The structure of Apache Lucene

Edmund KirwanMay 27th, 2015Last Updated: May 25th, 2015

0 65 9 minutes read

The inestimably noble Apache Software Foundation produces many of the blockbuster products (Ant, CouchDB, Hadoop, JMeter, Maven, OpenOffice, Subversion, etc.) that help build our digital universe. One perhaps less well-known gem is Lucene, which, ” … provides Java-based indexing and search technology, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities.” Despite its shying from headlines, Lucene forms a quiet but integral component of many Apache (and third-party) projects.

Let us take a look at the structure the underlies this wonderful and highly successful product.

Before we begin, the usual four caveats.

Being a syntactic structural analysis, this review cares little for either program semantics or, no matter the exquisiteness, delivered user-experience.
Structure merits investigation in and of itself because it governs the predictability of the potential cost of change. Poorly-structured systems exhibit excessive interconnectedness in which ripple effects drastically blunt the precision of change-cost estimations. Well-structured systems may not necessarily be cheaper to maintain and upgrade but they generally spring fewer nasty surprises.
This analysis depicts a package-structure as a spoiklin diagram in which a circle represents a package, a straight line represents a dependency from a package drawn above to one drawn below and a curved line represents a dependency from a package drawn below to one drawn above. The colour of a package indicates the relative number of transitive package dependencies of which it partakes: the redder, the more transitive dependencies.
No diagram can prove structural value or cost. High-level analyses only ever prompt questions whose answers lie buried deep under geological layers of code.

And so, to business …

The rise …

Figure 1: Package structure of Lucene version 1.4.3.

Figure 1 shows one of the earliest version of Lucene still archived, version 1.4.3. Recall that a simple test of structure suggests the selection of a package at random and asks, “If this package changes, which other packages will it most likely impact?”

Take index for example. Clearly both queryParser and spans depend on it and hence might be impacted by any change to index, and that curved line shows that search depends on it too. This ease of dependency identification characterizes the entire figure, making this a well-structured design.

Bravo, Lucene, you’re off to a good start.

Figure 2: Package structure of Lucene version 2.0.

Figure 2 shows version 2.0 (note that we shall not investigate every release, but evenly spaced milestones along the entire release path), and the simplicity of interconnectedness continues. Despite the number of methods rising from version 1.4.3’s 1,637 to version 2.0’s 2,085, the number of packages has fallen from 11 to 10. This has prompted a slight fall in potent coupling efficiency – from 41% to 37% – but nonetheless good design principles clearly master this system.

Figure 3: Package structure of Lucene version 2.4.

Presented above in figure 3, version 2.4 – although far from an obviously bad structure – shows the first signs of distress.

True, many of the packages stand in clear relationship to their neighbours; but now some do not. In particular, search and index seem to have become embroiled in one another’s affairs.

This mild degradation of structure, however, belies the tumultuous changes that have taken place behind the scenes. Where version 2.0 had 2,085 methods, version 2.4 has more than doubled in size to 4,176 methods. And where version 2.0 had just 9,767 transitive dependencies, version 2.4 sags beneath a burdensome 48,370 transitive dependencies. Some structural crack has opened deep down on method-level to trigger this five-fold increase in dependencies, a crack which Lucene’s programmers never detect or seal, and which plagues later revisions, as we shall see.

Not only has the number of dependencies dramatically increased, but the depth of the program – the average length of its transitive dependencies – has increased, too, jumping from version 2.0’s 7 to version 2.4’s 8.6, not only laying more tracks over which ripple effects may trundle, but extending those tracks to shunt spurious impacts further afield.

Still, this structure presents no unsolvable problems. Focused design could reinstate the simplicity enjoyed by the earlier versions.

Figure 4: Package structure of Lucene version 3.0.

Alas, version 3.0 – shown above in figure 4 – seems to continue, ever so slightly, the downward trend. Again, figure 4 does not present an irredeemable structure: we can tease apart the packages to see how most connect with one another. The task, however, has become harder.

Both analysis and spans have been sucked into the tangle sparked by search and index. Predicting the impact of changing any of these four packages would now seem to require an automatic investigation all others.

Contributing to this increase in interconnectedness is the addition of 800 methods to this revision; and even though the number of transitive dependencies has admirably fallen to 46,917, nevertheless the average length has again risen, this time to 9.3.

Is the system’s structure beyond hope? Not at all: many of the packages enjoy clear dependency relations with their colleagues. Just around the corner, however, lies version 3.5 and a surge of transitive dependencies which, though not immediately fatal, proves a disease resistant to all medicines.

And the fall …

Figure 5: Package structure of Lucene version 3.5.

On positive note, version 3.5, shown in figure 5 above, introduces an extra three packages – bringing the total to 18 – in an attempt to distribute and separate the system’s functionality. The generous might also offer that, although the package structure has clearly decayed again from the previous revision’s, that decay remains somewhat localized: bad-boys analysis, spans, search and index continue to terrorize the rest of Lucene-town’s largely well-behaved population.

But the generosity ends there.

For despite adding only another 1,800 methods, the number of revision 3.5’s transitive dependencies has soared to 109,357, and the average length of those dependencies hits 11 methods long, a sad maximum for the entire evolution. Given this phenomenal rise in structural complexity, we wonder how the package design seems as good as it does – and indeed any such harmony proves short-lived, as the strain finally destroys all semblance of control in the next revision milestone.

Figure 5: Package structure of Lucene version 4.0.

Revision 4.0, shown in figure 5, adds 1,600 methods to the previous revision, bringing the total to 8,474 and raising the number of transitive dependencies relatively modestly to 116,211, but as can be seen from the figure, something terrible has happened.

The burgeoning interconnectedness of the previous revisions has suddenly systematized, causing the structure to implode into the dreaded ball of tangled dependencies that makes code-impact prediction wildly unreliable.

True, this revision adds another two packages – raising the potential coupling efficiency to 43% – and reduces (slightly) transitive dependency length to 10.4, but the sheer effort of controlling this vast number of transitive dependencies has simply broken the system. It will not recover.

Figure 6: Package structure of Lucene version 4.5.

In revision 4.5, shown in figure 6, some heroic action has reduced the number of transitive dependencies to 106,242 while still raising the number of methods to 9,562, and perhaps some packages have managed to distance themselves from the ravenous black hole spinning manically at the system’s core. But the work is too little, too late.

Figure 7: Package structure of Lucene version 5.0.

Revision 5.0, shown in figure 7, attempts to tame the beast by removing 200 methods, yet this curiously results in again raising the number of transitive dependencies to 113,556.

Does revision 5.0 look as bad as revision 4.5? Well, perhaps not. Something looks a little cleaner. We should not, however, allow this to blind us to the grand dis-structure on display in figure 7: this system weeps in pain. Predicting the costs of changing any of those central packages has become foolhardy.

Why?

To understand what happened to destroy this system’s initial structural integrity, we must examine revision 3.5. Again, this may not look like the worst structure, but this revision heralded the changes that lead to eventual ruin.

The main change was not just one of size: bigger systems need not necessarily fall to poor structure. Revision 3.5 increased the number of methods by 35% – but revision 2.4 increased the number of methods by more than 100% without wrecking the overall organization.

Instead, the primary culprits were the number of transitive dependencies and their distribution across the system.

The sheer number of new transitive dependencies introduced in revision 3.5 is astounding, rising from 46,917 to 109,357. This brought the dependency-to-method ratio to an artery-hardening 16.

Figure 8: Comparing Lucene's transitive-dependencies-per-method ratio. — Figure 8: Comparing Lucene’s transitive-dependencies-per-method ratio.

The dependency-to-method ratio had already been too high. In previous revisions, however, these transitive dependencies largely confined themselves to just one or two packages. In revision 3.0, 95% of all transitive method dependencies either terminated in their originating package or in a package just one dependency away. This gave hope that changes might in some sense limit themselves to a region close to the origin point, leaving few changes to spill out all over the system and to defy cost prediction.

Revision 3.5, however, saw that figure plummet to just 75%. This means that 25% of all revision 3.5’s transitive dependencies spill into three or more packages. Combining both these factors reveals that more than 33,000 dependencies lie in wait to catapult changes far from their origins. More than anything else, this dooms the product to further structural decay.

Figure 9: Percentage of Lucene transitive dependencies spanning fewer than 3 packages.

This, then, concludes the examination of the Lucene’s package-level structure. Should we delve below package level? Should we comb through individual packages to examine various class constellations? No. According to the Blighttown corollary, if the package-level structure is bad, we should not hope to find diamonds below. So we won’t.

Final score

Let us attempt an objective scoring of Lucene’s structure (its final revision examined here, 5.0).

We shall use the average of four factors. The first measures Lucene’s attempt to limit the number of dependencies that are possible to form. The second and third attempt to capture transitive dependency length, and the fourth attempts to capture the number of transitive dependencies. Of course, large systems will always have, say, more dependencies than small systems, so we cannot say that System A is more well-structured than System B simply because it has fewer dependencies. Instead, we must derive measurements that can be fairly compared by either normalizing for size or making the measurements in some sense self-referential.

First, we shall measure its absolute ideal efficiency: this analyzes the structure’s potential coupling, and basically asks how many methods are encapsulated away from other methods, and thus how many dependencies could conceivably be created. If every method were put in one class, then every method would be visible to every other, and so the efficiency would be 0%. The value rises the more methods are made private and put in separate package-private classes, thus increasingly encapsulating methods from one another.

Lucene scores 44%, indicating that it has at least attempted to encapsulate its functionality, but much more could be done.

Second, we shall measure the length of Lucene’s transitive dependencies in a form which allows fair comparisons between programs. For this, we shall use a CDF graph showing how long Lucene’s transitive method dependencies are as a percentage of its longest transitive dependency.

Figure 10: Lucene's transitive dependency CDF. — Figure 10: Lucene’s transitive dependency CDF.

In figure 10 above, we see that half of Lucene’s transitive dependencies are shorter than 45% of the length of its longest transitive dependency. This is bad. A system’s resistance to ripple effects relies on most of its dependencies being short; half of JUnit‘s transitive dependencies, for example, are only 30% of the length of its longest dependency.

As we require a figure that rises with improved structure, we shall use 100 minus this figure, so Lucene will score 100 – 45 = 55, a value which should be closer to 70.

The third factor we shall use has already been discussed: the percentage of methods that span two packages or fewer, a figure found to be 75.5%. This sounds high, but with modern structuring techniques there is little reason for this value to be less than 90%.

Finally, we need a factor that measures how many dependencies wriggle through a system, as the fewer the number of dependencies, the better. To normalize for size, we would like to measure the number of method-dependencies per method. Here we must unfortunately estimate an industry lowest-possible score. Some research suggests that 25 seems an appropriate figure: if system contains more than 25 dependencies per method then that system’s structure is so bad that all other all other metrics lose their importance.

We saw earlier that Lucene has a huge 12 dependencies per method; so the figure we shall use is 25-12 = 13, expressed as a percentage of 25, giving 52%. As figure 8 presented, other systems reach as low as 6 dependencies per method, a figure which yields over 70% for this metric.

This gives Lucene a final score of 226.5/400 points, or 57%. With firm structural principles, modern programs easily score above 80%, so this is a poor score indicating, alas, a poor structure. Lucene finds itself second to last on the leader-board of systems analyzed so far in this series.

Factor	Score
Absolute potential couple efficiency %	44
100 – (% length of longest dependency that half the system is shorter than)	55
% Method transitive dependencies spanning 2 packages or fewer	75.5
((25 – (number of transitive method dependencies per method) / 25) as % of 25	52
Average	57%

Table 1: Lucene 5.0’s structural evaluation.