Home » Archives for Michael McCandless

Author Archives: Michael McCandless

Open-source collaboration, or how we finally added merge-on-refresh to Apache Lucene

The open-source software movement is a clearly a powerful phenomenon. A diverse (in time, geography, interests, gender (hmm not really, not yet, hrmph), race, skills, use-cases, age, corporate employer, motivation, IDEs (or,Emacs (with all of its recursive parens)), operating system, …) group of passionate developers work together, using surprisingly primitive digital tooling and asynchronous communication channels, devoid of emotion and ...

Read More »

Apache Lucene performance on 128-core AMD Ryzen Threadripper 3990X

Almost a decade ago, I started running Lucene’s nightly benchmarks, and have been trying with mixed success to keep them running every night, through the numerous amazing changes relentlessly developed by the passionate Lucene community. The benchmarks run on the tip of Lucene’s mainline branch each night, which is understandably a volatile and high-velocity code base. Sure, Lucene’s wonderful randomized unit ...

Read More »

Concurrent query execution in Apache Lucene

Apache Lucene is a wonderfully concurrent pure Java search engine, easily able to saturate the available CPU or IO resources on your server, if you ask it to. The concurrency model for a “typical” Lucene application is one thread per query at search time, but did you know Lucene can also execute a single query concurrently using multiple threads to ...

Read More »

Lucene’s near-real-time segment index replication

[TL;DR: Apache Lucene 6.0 quietly introduced a powerful new feature called near-real-time (NRT) segment replication, for efficiently and reliably replicating indices from one server to another, and taking advantage of ever faster and cheaper local area networking technologies. Neither of the popular search servers (Elasticsearch, Solr) are using it yet, but it should bring a big increase in indexing and ...

Read More »

Lucene gets concurrent deletes and updates!

Long ago, Lucene could only use a single thread to write new segments to disk. The actual indexing of documents, which is the costly process of inverting incoming documents into in-memory segment data structures, could run with multiple threads, but back then, the process of writing those in-memory indices to Lucene segments was single threaded. We fixed that, more than 6 ...

Read More »

Apache Lucene 7.0 Is Coming Soon!

The Apache Lucene project will likely release its next major release, 7.0, in a few months! Remember that Lucene developers generally try hard to backport new features for the next non-major (feature) release, and the upcoming 6.5 already has many great changes, so a new major release is exciting because it means the 7.0-only features, which I now describe, are the particularly big ...

Read More »

Jirasearch 2.0 dog food: using Lucene to find our Jira issues

A few years ago I first built and released Jirasearch as a fun dog-food test case for the thin-wrapper Lucene server, to expose a powerful search UI over our Jira issues. This is a great showcase of a number of Lucene’s important features: Using block join queries to model parent (the original Jira issue) and children (each comment) documents. This basic relational ...

Read More »

Apache Lucene 5.0.0 is coming!

At long last, after a strong series of 4.x feature releases, most recently 4.10.2, we are finally working towards another major Apache Lucene release! There are no promises for the exact timing (it’s done when it’s done!), but we already have a volunteer release manager (thank you Anshum!). A major release in Lucene means all deprecated APIs (as of 4.10.x) ...

Read More »

A new proximity query for Lucene, using automatons

The simplest Apache Lucene query, TermQuery, matches any document that contains the specified term, regardless of where the term occurs inside each document. Using BooleanQuery you can combine multiple TermQuerys, with full control over which terms are optional (SHOULD) and which are required (MUST) or required not to be present (MUST_NOT), but still the matching ignores the relative positions of ...

Read More »