What’s the Difference? Creating Diffs with JGit

Rudiger HerrmannJune 19th, 2016Last Updated: June 18th, 2016

0 134 8 minutes read

In this post, I will dig into the details of how to diff revisions and create patches with JGit. Starting from the high-level DiffCommand all the way down to the more versatile APIs to find particular changes in a file.

DiffCommand, Take I

The diff command can be used to compare two revisions and report which files were changed, added or removed. Whereby, a revision, in this context, may be a commit as well as the work directory or the index.

The simples form of creating a diff in JGit looks like this:

git.diff().setOutputStream( System.out ).call();

If invoked without specifying which revisions to compare, the differences between the work directory and the index are determined.

The command prints a textual representation of the diff to the designated output stream:

diff --git a/file.txt b/file.txt
index 19def74..d5fcacb 100644
--- a/file.txt
+++ b/file.txt
@@ -1 +1,2 @@
 existing line
+added line
\ No newline at end of file

In addition, the call() method also returns a list of DiffEntries. These data structures describe the added, removed, and changed files and can also be used to determine the changes within a certain file.

But how can two arbitrary revisions be compared? By taking a closer look at the DiffCommand it becomes apparent that it actually compares two trees instead of revisions. And that explains why the work directory and index (which are trees themselves) can also be compared without extra effort.

Consequently, the diff command expects parameters of type AbstractTreeIterator to specify the old and new tree to be compared. Sometimes old and new are also referred to as source and destination or simply a and b. To learn more about what trees in Git are, you may want to read Explore Git Internals with the JGit API.

Tree Iterators

But how to get hold of a specific tree iterator? Looking at the type hierarchy of AbstractTreeIterator reveals that there are three implementations of interest.

The FileTreeIterator can be used to access the work directory of a repository. Passing the repository to its constructor like so, it is ready to use.

AbstractTreeIterator treeIterator = new FileTreeIterator( git.getRepository() );

The DirCacheIterator reveals the contents of the dir cache (aka index) and can be created in a similar way as the FileTreeIterator. Given a repository, we can tell it to read the index and pass this instance to the DirCacheIterator like so:

AbstractTreeIterator treeIterator = new DirCacheIterator( git.getRepository().readDirCache() );

Most interesting however is probably the CanonicalTreeParser. It can be configured to parse an arbitrary Git tree object. Therefore, it needs to be reset with the id of a tree object from the repository. Once set up it can be used to iterate over the contents of this tree.

This is best illustrated with the following example:

CanonicalTreeParser treeParser = new CanonicalTreeParser();
ObjectId treeId = repository.resolve( "my-branch^{tree}" );
try( ObjectReader reader = repository.newObjectReader() ) {
  treeParser.reset( reader, treeId );
}

The tree parser is configured to iterate over the tree of the commit to which my-branch points to.

Beware that it is undefined what resolve() returns if there are multiple matches. For example, the call resolve( "aabbccdde^{tree}" ) may return the wrong tree if there is a branch and an abbreviated commit id with this name. Therefore prefer fully qualified references like refs/heads/my-branch to reference the branch my-branch or refs/tags/my-tag for the tag named my-tag.

If the id of a commit is already available in the form of an ObjectId (or AnyObjectId), use the following snippet to obtain the tree id thereof:

try( RevWalk walk = new RevWalk( git.getRepository() ) ) {
  RevCommit commit = walk.parseCommit( commitId );
  ObjectId treeId = commit.getTree().getId();
  try( ObjectReader reader = git.getRepository().newObjectReader() ) {
    return new CanonicalTreeParser( null, reader, tree );
  }
}

The code assumes that the given object id references a commit and resolves the associated RevCommit, which in turn holds the id of the corresponding tree.

DiffCommand Revisited

Now that we know how to obtain a tree iterator the rest is simple:

git.diff()
  .setOldTree( oldTreeIterator )
  .setNewTree( newTreeIterator )
  .call();

With the setOldTree() and setNewTree() methods, the trees to be compared can be specified.

Besides these principal properties, several other aspects of the command can be controlled:

setPathFilter allows to restrict the scanned files to certain paths within the repository
setSourcePrefix and setDetinationPrefix changes the prefix of source (old) and destination (new) paths. The default values are a/ and b/.
setContextLines changes the number of context lines, i.e. the number of lines printed before and after a modified line. The default value is three.
setProgressMonitor allows to track progress while the diffs are computed. You can implement your own progress monitor or use one of the pre-defined ones that come with JGit
setShowNameAndStatusOnly skips generating the textual output and just returns the computed list of DiffEntries. (as the name suggests)

Apart from the properties described so far, the DiffCommand reads these configuration settings from the section.

noPrefix: if set to true, the source and destination prefixes are empty by default instead of a/ and b/.
renames: if set to true, the command attempts to detect renamed files based on similar content. More on renamed content later.
algorithm: the diff algorithm that should be used. JGit currently supports myers or histogram.

DiffEntry, Take I

As mentioned before, we will take a closer look at the principal output of the diff command: the DiffEntry. For each file that was added, removed or modified, a separate DiffEntry is returned. The getChangeType() indicates the type of the change which is either ADD, DELETE, or MODIFY. If a rename detector was involved while scanning for changes, the change type may also be RENAME or COPY.

In addition, a DiffEntry holds information about the old and the new state – including path, mode and id – of a file. The methods are named accordingly getOldPath/Mode/Id and getNewPath/Mode/Id. Depending on whether the entry represents an addition or removal, the getNew or getOld methods may return ’empty’ values. The JavaDoc explains in detail which values are returned. Note that the id references the blob object in the repository database that contains the file’s content.

Under the Covers of the DiffCommand

In some cases, the DiffCommand may not be sufficient to accomplish the task. For example to detect renames and copies when comparing two revisions or to create customized patches. In this case, don’t hesitate to take a look under the covers.

The DiffCommand primarily uses the DiffFormatter, which can also be accessed directly to scan for changes and create patches.

Its scan() method expects iterators for the old and new tree and returns a list of DiffEntries. There are also overloaded versions that accept id’s of tree objects to be supplied.

A simple example that scans for changes looks like this:

OutputStream outputStream = DisabledOutputStream.INSTANCE;
try( DiffFormatter formatter = new DiffFormatter( outputStream ) ) {
  formatter.setRepository( git.getRepository() );
  List<DiffEntry> entries = formatter.scan( oldTreeIterator, newTreeIterator );
}

The output stream to be used by format() is specified in the constructor. Since we aren’t interested in the output right now, a null output stream is supplied. With setRepository, the repository that should be scanned is specified. And finally the tree parsers are passed to the scan() method that returns the list of changes between the two of them.

Note that the DiffFormatter need to be closed explicitly or used in a try-with-resources statement like shown in the example code.

In order to create patches, one of the format() methods can be used. The patch is expressed as instructions to modify the old tree to make it the new tree.

Like the scan() methods, the format() methods accept pointers to or iterators for an old and a new tree. Either side may be null to indicate that the tree has been added or removed. In this case, the diff will be computed against nothing.

The snippet below uses format() to write a patch to the output stream that was passed to the constructor.

OutputStream outputStream = new ByteArrayOutputStream();
try( DiffFormatter formatter = new DiffFormatter( outputStream ) ) {
  formatter.setRepository( git.getRepository() );
  formatter.format( oldTreeIterator, newTreeIterator );
}

There are also an overloaded format() methods to print a single DiffEntry or a list of DiffEntries, possibly obtained by a previous call to scan().

While the outcome of the above example can also be accomplished with a plain DiffCommand, let’s have a look at what else the DiffFormater has to offer.

As mentioned earlier, renamed files can be associated while computing diffs. To enable rename detection, the DiffFormatter must be advised to do so with setDetectRenames(). Thereafter the RenameDetector can be obtained for fine tuning with getRenameDetector().

Remember that Git is a content tracker and does not track renames. Instead, renames are deduced from similar content during a diff operation.

In addition, the DiffFormatter has several further properties to fine-tune its behavior that are listed below:

setAbbreviationLength: the number of digits to print of an object id.
setDiffAlgorithm: the algorithm that should be used to construct the diff output.
setBinaryFileThreshold: files larger than this size will be treated as though they are binary and not text. Default is 50 MB.
setDiffComparator: the comparator used to determine if two lines of text are identical. The comparator can be configured to ignore various types of white space. However, I wasn’t able to let the DiffFormatter ignore all white spaces.

DiffEntry Revisited

If you are interested in the changes that took place in a certain file you may want to have another look at the DiffEntry and DiffFormatter.

With diffFormatter.toFileHeader(), a so-called FileHeader can be obtained from a given DiffEntry. And through its toEditList() method, a list of edits can be obtained.

The following code sample shows how to obtain the edit list for the first diff entry that results from a scan:

OutputStream outputStream = DisabledOutputStream.INSTANCE;
try( DiffFormatter formatter = new DiffFormatter( outputStream ) ) {
  formatter.setRepository( git.getRepository() );
  List<DiffEntry> entries = formatter.scan( oldTreeIterator, newTreeIterator );
  FileHeader fileHeader = formatter.toFileHeader( entries.get( 0 ) );
  return fileHeader.toEditList();
}

This list can be interpreted as the modifications that need to be applied to the old content in order to transform it to the new content.

Each Edit describes a region that was inserted, deleted or replaced and the lines that are affected.
The lines are counted starting with zero and can be queried with getBeginA(), getEndA() getBeginB() and getEndB().

For example, given a file with these two lines of content:

line 1
line 3

Inserting line2 between the two lines would result in an Edit of type INSERT with A(1-1) and B(1-2). In other words, replace line 1 with line 1 and 2. Deleting line 2 again results in the inverse of the inserting Edit: DELETE with A(1-2) and B(1-1). And changing the text of line 2 will yield an Edit of type REPLACE with the same A and B region 1-2.

Concluding Creating Diffs with JGit

While the DiffCommand is rather straightforward to use, the DiffFormatter has a scary API. But before using JGit in your project, you would certainly isolate yourself from the library anyway, wouldn’t you?!? … and thereby choose a more suitable API.

But apart from that, JGit provides means to accomplish most if not all tasks related to diffs and patches in Git.

The snippets shown throughout the article are excerpts from a collection of learning tests. The full source code can be found here:
https://gist.github.com/rherrmann/5341e735ce197f306949fc58e9aed141

If you like to experiment with the examples listed here by yourself, I recommend to setup JGit with access to the sources and JavaDoc so that you have meaningful context information, content assist, debug-sources, etc.

If you have difficulties or further questions, please leave a comment or ask the friendly and helpful JGit community for assistance.

Reference:

What’s the Difference? Creating Diffs with JGit from our JCG partner Rudiger Herrmann at the Code Affine blog.