Full Text Search of Dialogues with Apache Lucene: A Tutorial
Apache Lucene is a Java library used for the full text search of documents, and is at the core of search servers such as Solr and Elasticsearch. It can also be embedded into Java applications, such as Android apps or web backends.
While Lucene’s configuration options are extensive, they are intended for use by database developers on a generic corpus of text. If your documents have a specific structure or type of content, you can take advantage of either to improve search quality and query capability.
As an example of this sort of customization, in this Lucene tutorial we will index the corpus of Project Gutenberg, which offers thousands of free e-books. We know that many of these books are novels. Suppose we are especially interested in the dialogue within these novels. Neither Lucene, Elasticsearch, nor Solr provides out-of-the-box tools to identify content as dialogue. In fact, they will throw away punctuation at the earliest stages of text analysis, which runs counter to being able to identify portions of the text that are dialogue. So it is therefore in these early stages where our customization must begin.
Pieces of the Apache Lucene Analysis Pipeline
The Lucene analysis JavaDoc provides a good overview of all the moving parts in the text analysis pipeline.
At a high level, you can think of the analysis pipeline as consuming a raw stream of characters at the start and producing “terms”, roughly corresponding to words, at the end.
The standard analysis pipeline can be visualized as such:
We will see how to customize this pipeline to recognize regions of text marked by double-quotes, which I will call dialogue, and then bump up matches that occur when searching in those regions.
Reading Characters
When documents are initially added to the index, the characters are read from a Java InputStream, and so they can come from files, databases, web service calls, etc. To create an index for Project Gutenberg, we download the e-books, and create a small application to read these files and write them to the index. Creating a Lucene index and reading files are well travelled paths, so we won’t explore them much. The essential code for producing an index is:
IndexWriter writer = ...;
BufferedReader reader = new BufferedReader(new InputStreamReader(... fileInputStream ...));
Document document = new Document();
document.add(new StringField("title", fileName, Store.YES));
document.add(new TextField("body", reader));
writer.addDocument(document);
We can see that each e-book will correspond to a single Lucene Document
so, later on, our search results will be a list of matching books. Store.YES
indicates that we store the title field, which is just the filename. We don’t want to store the body of the ebook, however, as it is not needed when searching and would only waste disk space.
The actual reading of the stream begins with addDocument
. The IndexWriter
pulls tokens from the end of the pipeline. This pull proceeds back through the pipe until the first stage, the Tokenizer
, reads from the InputStream
.
Also note that we don’t close the stream, as Lucene handles this for us.
Tokenizing Characters
The Lucene StandardTokenizer throws away punctuation, and so our customization will begin here, as we need to preserve quotes.
The documentation for StandardTokenizer
invites you to copy the source code and tailor it to your needs, but this solution would be unnecessarily complex. Instead, we will extend CharTokenizer
, which allows you to specify characters to “accept”, where those that are not “accepted” will be treated as delimiters between tokens and thrown away. Since we are interested in words and the quotations around them, our custom Tokenizer is simply:
public class QuotationTokenizer extends CharTokenizer {
@Override
protected boolean isTokenChar(int c) {
return Character.isLetter(c) || c == '"';
}
}
Given an input stream of [He said, "Good day".]
, the tokens produced would be [He]
, [said]
, ["Good]
, [day"]
Note how the quotes are interspersed within the tokens. It is possible to write a Tokenizer
that produces separate tokens for each quote, but Tokenizer
is also concerned with fiddly, easy-to-screw-up details such as buffering and scanning, so it is best to keep your Tokenizer
simple and clean up the token stream further along in the pipeline.
Splitting Tokens using Filters
After the tokenizer comes a series of TokenFilter
objects. Note, incidentally, that filter is a bit of a misnomer, as a TokenFilter
can add, remove, or modify tokens.
Many of the filter classes provided by Lucene expect single words, so it won’t do to have our mixed word-and-quote tokens flow into them. Thus, our Lucene tutorial’s next customization must be the introduction of a filter that will clean up the output of QuotationTokenizer
.
This cleanup will involve the production of an extra start quote token if the quote appears at the beginning of a word, or an end quote token if the quote appears at the end. We will put aside the handling of single quoted words for simplicity.
Creating a TokenFilter
subclass involves implementing one method: incrementToken
. This method must call incrementToken
on the previous filter in the pipe, and then manipulate the results of that call to perform whatever work the filter is responsible for. The results of incrementToken
are available via Attribute
objects, which describe the current state of token processing. After our implementation of incrementToken
returns, it is expected that the attributes have been manipulated to setup the token for the next filter (or the index if we are at the end of the pipe).
The attributes we are interested in at this point in the pipeline are:
CharTermAttribute
: Contains achar[]
buffer holding the characters of the current token. We will need to manipulate this to remove the quote, or to produce a quote token.TypeAttribute
: Contains the “type” of the current token. Because we are adding start and end quotes to the token stream, we will introduce two new types using our filter.OffsetAttribute
: Lucene can optionally store references to the location of terms in the original document. These references are called “offsets”, which are just start and end indices into the original character stream. If we change the buffer inCharTermAttribute
to point to just a substring of the token, we must adjust these offsets accordingly.
You may be wondering why the API for manipulating token streams is so convoluted and, in particular, why we can’t just do something like String#split
on the incoming tokens. This is because Lucene is designed for high-speed, low-overhead indexing, whereby the built-in tokenizers and filters can quickly chew through gigabytes of text while using only megabytes of memory. To achieve this, few or no allocations are done during tokenization and filtering, and so the Attribute
instances mentioned above are intended to be allocated once and reused. If your tokenizers and filters are written in this way, and minimize their own allocations, you can customize Lucene without compromising performance.
With all that in mind, let’s see how to implement a filter that takes a token such as ["Hello]
, and produces the two tokens, ["]
and [Hello]
:
public class QuotationTokenFilter extends TokenFilter {
private static final char QUOTE = '"';
public static final String QUOTE_START_TYPE = "start_quote";
public static final String QUOTE_END_TYPE = "end_quote";
private final OffsetAttribute offsetAttr = addAttribute(OffsetAttribute.class);
private final TypeAttribute typeAttr = addAttribute(TypeAttribute.class);
private final CharTermAttribute termBufferAttr = addAttribute(CharTermAttribute.class);
We start by obtaining references to some of the attributes that we saw earlier. We suffix the field names with “Attr” so it will be clear later when we refer to them. It is possible that some Tokenizer
implementations do not provide these attributes, so we use addAttribute
to get our references. addAttribute
will create an attribute instance if it is missing, otherwise grab a shared reference to the attribute of that type. Note that Lucene does not allow multiple instances of the same attribute type at once.
private boolean emitExtraToken;
private int extraTokenStartOffset, extraTokenEndOffset;
private String extraTokenType;
Because our filter will introduce a new token that was not present in the original stream, we need a place to save the state of that token between calls to incrementToken
. Because we’re splitting an existing token into two, it is enough to know just the offsets and type of the new token. We also have a flag that tells us whether the next call to incrementToken
will be emitting this extra token. Lucene actually provides a pair of methods, captureState
and restoreState
, which will do this for you. But these methods involve the allocation of a State
object, and can actually be trickier than simply managing that state yourself, so we’ll avoid using them.
@Override
public void reset() throws IOException {
emitExtraToken = false;
extraTokenStartOffset = -1;
extraTokenEndOffset = -1;
extraTokenType = null;
super.reset();
}
As part of its aggressive avoidance of allocation, Lucene can reuse filter instances. In this situation, it is expected that a call to reset
will put the filter back into its initial state. So here, we simply reset our extra token fields.
@Override
public boolean incrementToken() throws IOException {
if (emitExtraToken) {
advanceToExtraToken();
emitExtraToken = false;
return true;
}
...
Now we’re getting to the interesting bits. When our implementation of incrementToken
is called, we have an opportunity to not call incrementToken
on the earlier stage of the pipeline. By doing so, we effectively introduce a new token, because we aren’t pulling a token from the Tokenizer
.
Instead, we call advanceToExtraToken
to setup the attributes for our extra token, set emitExtraToken
to false to avoid this branch on the next call, and then return true
, which indicates that another token is available.
@Override
public boolean incrementToken() throws IOException {
... (emit extra token) ...
boolean hasNext = input.incrementToken();
if (hasNext) {
char[] buffer = termBufferAttr.buffer();
if (termBuffer.length() > 1) {
if (buffer[0] == QUOTE) {
splitTermQuoteFirst();
} else if (buffer[termBuffer.length() - 1] == QUOTE) {
splitTermWordFirst();
}
} else if (termBuffer.length() == 1) {
if (buffer[0] == QUOTE) {
typeAttr.setType(QUOTE_END_TYPE);
}
}
}
return hasNext;
}
The remainder of incrementToken
will do one of three different things. Recall that termBufferAttr
is used to inspect the contents of the token coming through the pipe:
- If we’ve reached the end of the token stream (i.e.
hasNext
is false), we’re done and simply return. - If we have a token of more than one character, and one of those characters is a quote, we split the token.
- If the token is a solitary quote, we assume it is an end quote. To understand why, note that starting quotes always appear to the left of a word (i.e., with no intermediate punctuation), whereas ending quotes can follow punctuation (such as in the sentence,
[He told us to "go back the way we came."]
). In these cases, the ending quote will already be a separate token, and so we need only to set its type.
splitTermQuoteFirst
and splitTermWordFirst
will set attributes to make the current token either a word or a quote, and setup the “extra” fields to allow the other half to be consumed later. The two methods are similar, so we’ll look at just splitTermQuoteFirst
:
private void splitTermQuoteFirst() {
int origStart = offsetAttr.startOffset();
int origEnd = offsetAttr.endOffset();
offsetAttr.setOffset(origStart, origStart + 1);
typeAttr.setType(QUOTE_START_TYPE);
termBufferAttr.setLength(1);
prepareExtraTerm(origStart + 1, origEnd, TypeAttribute.DEFAULT_TYPE);
}
Because we want to split this token with the quote appearing in the stream first, we truncate the buffer by setting the length to one (i.e., one character; namely, the quote). We adjust the offsets accordingly (i.e. pointing to the quote in the original document) and also set the type to be a starting quote.
prepareExtraTerm
will set the extra*
fields and set emitExtraToken
to true. It is called with offsets pointing at the “extra” token (i.e., the word following the quote).
The entirety of QuotationTokenFilter
is available on Github.
As an aside, while this filter only produces one extra token, this approach can be extended to introduce an arbitrary number of extra tokens. Just replace the extra*
fields with a collection or, better yet, a fixed-length array if there is a limit on the number of extra tokens that can be produced. See SynonymFilter
and its PendingInput
inner class for an example of this.
Consuming Quote Tokens and Marking Dialogue
Now that we’ve gone to all that effort to add those quotes to the token stream, we can use them to delimit sections of dialogue in the text.
Since our end goal is to adjust search results based on whether terms are part of dialogue or not, we need to attach metadata to those terms. Lucene provides PayloadAttribute
for this purpose. Payloads are byte arrays that are stored alongside terms in the index, and can be read later during a search. This means that our flag will wastefully occupy an entire byte, so additional payloads could be implemented as bit flags to save space.
Below is a new filter, DialoguePayloadTokenFilter
, which is added to the very end of the analysis pipeline. It attaches the payload indicating whether or not the token is part of dialogue.
public class DialoguePayloadTokenFilter extends TokenFilter {
private final TypeAttribute typeAttr = getAttribute(TypeAttribute.class);
private final PayloadAttribute payloadAttr = addAttribute(PayloadAttribute.class);
private static final BytesRef PAYLOAD_DIALOGUE = new BytesRef(new byte[] { 1 });
private static final BytesRef PAYLOAD_NOT_DIALOGUE = new BytesRef(new byte[] { 0 });
private boolean withinDialogue;
protected DialoguePayloadTokenFilter(TokenStream input) {
super(input);
}
@Override
public void reset() throws IOException {
this.withinDialogue = false;
super.reset();
}
@Override
public boolean incrementToken() throws IOException {
boolean hasNext = input.incrementToken();
while(hasNext) {
boolean isStartQuote = QuotationTokenFilter
.QUOTE_START_TYPE.equals(typeAttr.type());
boolean isEndQuote = QuotationTokenFilter
.QUOTE_END_TYPE.equals(typeAttr.type());
if (isStartQuote) {
withinDialogue = true;
hasNext = input.incrementToken();
} else if (isEndQuote) {
withinDialogue = false;
hasNext = input.incrementToken();
} else {
break;
}
}
if (hasNext) {
payloadAttr.setPayload(withinDialogue ?
PAYLOAD_DIALOGUE : PAYLOAD_NOT_DIALOGUE);
}
return hasNext;
}
}
Since this filter only needs to maintain a single piece of state, withinDialogue
, it is much simpler. A start quote indicates that we are now within a section of dialogue, while an end quote indicates that the section of dialogue has ended. In either case, the quote token is discarded by making a second call to incrementToken
, so in effect, start quote or end quote tokens never flow past this stage in the pipeline.
For example, DialoguePayloadTokenFilter
will transform the token stream:
[the], [program], [printed], ["], [hello], [world], ["]`
into this new stream:
[the][0], [program][0], [printed][0], [hello][1], [world][1]
Tying Tokenizers and Filters Together
An Analyzer
is responsible for assembling the analysis pipeline, typically by combining a Tokenizer
with a series of TokenFilter
s. Analyzer
s can also define how that pipeline is reused between analyses. We don’t need to worry about that as our components don’t require anything except a call to reset()
between uses, which Lucene will always do. We just need to do the assembly by implementing Analyzer#createComponents(String)
:
public class DialogueAnalyzer extends Analyzer {
@Override
protected TokenStreamComponents createComponents(String fieldName) {
QuotationTokenizer tokenizer = new QuotationTokenizer();
TokenFilter filter = new QuotationTokenFilter(tokenizer);
filter = new LowerCaseFilter(filter);
filter = new StopFilter(filter, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
filter = new DialoguePayloadTokenFilter(filter);
return new TokenStreamComponents(tokenizer, filter);
}
}
As we saw earlier, filters contain a reference back to the previous stage in the pipeline, so that is how we instantiate them. We also slide in a few filters from StandardAnalyzer
: LowerCaseFilter
and StopFilter
. These two must come after QuotationTokenFilter
to ensure that any quotes have been separated. We can be more flexible in our placement of DialoguePayloadTokenFilter
, since anywhere after QuotationTokenFilter
will do. We put it after StopFilter
to avoid wasting time injecting the dialogue payload into stop words that will ultimately be removed.
Here is a visualization of our new pipeline in action (minus those parts of the standard pipeline that we have removed or already seen):
DialogueAnalyzer
can now be used as any other stock Analyzer
would be, and now we can build the index and move on to search.