Putting Amazon CodeGuru Reviewer To The Test

AWS introduced CodeGuru Reviewer in June of 2020, as a service that “uses program analysis and machine learning to detect potential defects that are difficult for developers to find.” It was trained on “millions of lines of Java and Python code from the Amazon code base and other sources.” And, like other AWS services, you pay for what you use; in this case based on the number of lines of code in your repository.

CodeGuru also offers a 90 day free tier for repositories of less than 100,000 lines of code. So I decided to give it a go with my log4j-aws-appenders project, which is just under 50,000 lines of code (including whitespace and comments). I’ve been working on this project since 2017, and it’s been through three major revisions; each added significant new functionality, and (in my opinion) improved the internal architecture. It’s currently downloaded 5,000 times a month from 3,000 unique IPs, so I’m going to assume that it gets a decent level of usage.

Those users have found the occasional bug, but in general it’s been pretty solid. However another set of “eyes” on the codebase certainly couldn’t hurt. If you’d like to see the results, go here.

What Are Static Analysis Tools?

Static analysis tools attempt to identify problems by examining the source code of a computer program. They’ve been around since the late 1970s; of the earliest, lint, was created at Bell Labs in 1978. It identified errors in C code such as defining variables without using them, or ignoring the return value from functions. Modern tools have far more checks, and can be quite good at calling out problematic code.

The Java ecosystem has several existing analysis tools; here are three popular ones:

FindBugs, originally developed at the University of Maryland, looks for patterns in compiled Java bytecode. I’ve been using it almost since it was released, circa 2007, and my logging library runs it as part of the build. Unfortunately, it hasn’t been updated since 2015, and does not recognize bugs involving newer language features.
PMD parses the project source code to look for questionable practices. It’s open-source, integrates with build tools such as Maven, and examines a small set of languages other than Java. It also allows you to configure rules: for example, by default it calls out any catch block that doesn’t contain code, but you can configure it to consider a comment as acceptable (on the assumption that developers will actually explain whey it’s OK to ignore the exception).
SonarQube is primarily an enterprise product, although it does have a limited scope ”community” edition. It supports a wide range of languages and tools, ranging from COBOL to Terraform.

All of these tools are based around explicit rules: the tool’s developers identified a problem and codified how the tool would search for that problem.

A tool based on machine learning, however, should be able to identify questionable code on its own: if during training if it sees a certain code segment repeated many times throughout a codebase, but then sees something that’s slightly different, it should flag the different version for human review. Perhaps it’s OK, perhaps it’s a problem; in the future, the tool will know the difference. After training, it should be able to flag problems in real-world code.

At least, that’s the theory. And, before I go on, I should make clear that I have no insights into how AWS trained their model; they may have taken a completely different approach than what I just described. Or CodeGuru may rely more heavily on explicit rules than it does on learned patterns.

Running CodeGuru Reviewer

This section contains my notes on using CodeGuru Reviewer. The official docs for associating your repository and running CodeGuru are here.

Initial analysis

To analyze your project, you must first associate the project repository with CodeGuru. For a CodeCommit repository owned by the same account, this is simple: you select the repository from a pre-populated list in the Console, and tell it the name of the main branch. For GitHub (and, I’ll assume, BitBucket), you must login to the remote service and grant CodeGuru permission to read your repositories. I did not go through the entire process, but it appears to grant access to all repositories, in all linked organizations that allow connections.

Once you associate a repository, CodeGuru runs a full scan of the main branch. At least, it’s documented as doing so, although I found at least one case where it didn’t. That may have been because I dis-associated and re-associated the same repository.

Once the full scan runs, you can browse the issues. This is a bit clumsy: by default the Console only shows 10 recommendations at a time, and you have to page through them. You can increase the display to 100 recommendations, and there’s some rudimentary filtering, but the Console doesn’t remember your settings. I found it easier to download the JSON file, then use jq to get a summary of issues:

cat recommendations.json | jq -c '.RecommendationSummaries[]|[.RuleMetadata.RuleName,.FilePath,.StartLine]'

CodeGuru doesn’t provide any way to disable recommendations by type, or to modify how a rule is applied. The only control you have is to ignore entire files.

Pull-request analysis

Once you’ve associated a repository with CodeGuru, it will run an analysis on any pull requests. The most important thing to understand is that it only analyzes the code that was changed. It does not< use any context from the file that code appears in. So, for example, if you delete the contents of an exception handler, ignoring the exception, CodeGuru won’t flag the change. Hopefully AWS changes this.

What CodeGuru had to say about my project

I uploaded my repository code into CodeCommit, and ran a full repository scan. I’ve selected a few of the findings as representative of what CodeGuru can and cannot do; as I mentioned above, the full results are here.

The Good: AWS API Usage

One of the pleasant surprises of CodeGuru is the knowledge that it has of AWS API calls. For example:

This code might not produce accurate results if the operation returns paginated results instead of all
results. Consider adding another call to check for additional results.

The code in question (there are actually several examples) retrieved a list of Kinesis streams. This is a paginated operation: if there are more streams than will fit into a single response, then the API provides a token for retrieving subsequent streams. In the specific case, part of my test suite, I knew that there would only be a single stream with the given name. In a general case, someone using this API should always check for pagination, and forgetting to do so is usually an error.

In another case, CodeGuru flagged me for not setting the description when creating a Parameter Store entry. I’m a big fan of little nudges toward better coding.

The Bad: Exception Handling

Exception handling is an area of programming where there are no right answers other than “don’t ignore them.” How you handle an exception depends on what you need to do as a result. However, CodeGuru has some very specific ideas on the subject.

For example, there’s this message:

**Problem**: An exception has been ignored. The catch-block hides information about the stack trace.
Not logging or forwarding the stack trace of an exception can mask unexpected errors and complicated
debugging.

**Fix**: Use a catch block to handle the exception and write a log message declaring that the exception
happened.

In several cases, this message was arguably valid: the code did indeed ignore an exception. And while each case had a comment explaining why the exception was ignored, I can understand that CodeGuru might not consider that acceptable. But then there’s this case:

catch (ResourceNotFoundException ex)
{
    return null;
}

Sorry, CodeGuru, but this is a perfectly valid action to take, and there’s no need to log the exception or re-throw it. Nor (in other reported cases) is there always a need to package an exception cause when transforming an exception.

In my opinion, these reported issues are a result of using machine learning: if a particular piece of code does not match the training corpus, it will be flagged. And while that may be a good default, CodeGuru does not have the flexibility to ignore patterns, leading to repeated false positives.

Here’s another issue that CodeGuru flagged related to exceptions:

Catch `InvocationTargetException` explicitly when you call `Method.invoke()`. Use `Throwable.getCause()`
or `getTargetException()` in the catch handler to access more information about the underlying exception.

And here’s the code that was flagged:

try
{
    // operation using reflection to invoke a method
}
catch (Throwable ex)
{
    if (ex instanceof InvocationTargetException)
        ex = ex.getCause();

    throw new RuntimeException("exception invoking factory method: " + fullyQualifiedMethodName, ex);
}

You might consider this code too clever for its own good. I look at it as a sensible unwrapping of InvocationTargetException, so that catching code (if any) doesn’t have to. It is, however, another case where machine learning fails because it looks at the program as a sequence of tokens, without actually understanding what those tokens mean. A strictly rule-based analysis tool would ”see” that I’m catching Throwable, which means that I’m also catching its subclass InvocationTargetException, and move on.

The Ugly: Cyclomatic Complexity

Cyclomatic complexity is a measure of the number of independent paths through a program’s source code. High cyclomatic complexity means that you need more tests to provide complete coverage, and — combined with large functions — can make code difficult to understand. I’m sure we’ve all seen examples of thousand-line methods built around horrific of-else chains, and would concur that cyclomatic complexity is a valid concern. CodeGuru, however, seems to be a little over-zealous in calling out these complexity issues, as well as not quite getting the point:

The cyclomatic complexity of this method is 11. By comparison, 98% of the methods in the CodeGuru
reference dataset have a lower cyclomatic complexity. This indicates the method has a high number
of decisions and it can make the logic difficult to understand and test.

We recommend that you simplify this method or break it into multiple methods. For example, consider
extracting the code block on lines 128-143 into a separate method.

The method in question performs validation on a configuration object. The total method length is 39 lines of code (including blank lines). The recommendation is to extract the tests on a single field of the configuration into a separate method.

This might be reasonable if the extracted method were then public and separately testable. But that would be a bad API design: you should not advertise the ability to validate just one part of a data object. And extracting code into a private method doesn’t change cyclomatic complexity at all — just the ability of CodeGuru to measure that complexity.

I think this example indicates both the promise of machine-learning approaches to code analysis, and the current limitation of those approaches with respect to a rules-based evaluation. This method is an example where high cyclomatic complexity is all but irrelevant: there is a minimal amount of code for each path, and the function can be fully tested in isolation. A well-trained model should be able to recognize those facts. Unfortunately, CodeGuru doesn’t take method size into consideration.

Conclusions

A good code reviewer — human or machine — provides relevant, actionable advice without dwelling on irrelevancies. It’s a tough job, and so far FindBugs is the only tool that I’ve found that passes the bar (and, unfortunately, only up to Java 8). By comparison, most of PMD’s review of my logging library involved unnecessary parentheses in if statements.

And as I said above, a machine model trained on millions of lines of code should be able to identify best practices and call out sketchy ones. In my opinion, CodeGuru Reviewer isn’t quite there. I would not rely on it as my sole reviewer.

I do think it can be a useful partner, pointing out areas where an experienced human should take a closer look. But to be an effective partner, that human needs to be able to adjust some dials: for example, allowing high cyclomatic complexity in a short method, but flagging that same level of complexity in a larger method.

And it definitely needs to take context into account when evaluating pull requests. At the very least, it should look at the surrounding method, to avoid a situation where deleting a single line of code results in an unflagged bug.