Full text search in MongoDB
This is a custom implementation created by the MongoDB developers as a specific index type, and is due to be launched as an experimental feature in MongoDB 2.4. It has features such as:
- Full text search as an index type when creating new indexes, just like any other.
- Indexing of multiple fields, with weighting to give different fields higher priority.
- Support for Latin based languages initially, with plans for other character sets later. Initially this will be: Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish and Turkish.
- Support for advanced queries, similar to the Google search syntaxg. negation and phrase matching.
- Stemming, to deal with plurals.
- Stop words (see the list here).
This looks like a good, general purpose full text search engine which goes along well with how MongoDB is developing into a good multi-purpose database.
Examples
First we enable full text search in the the latest unstable nightly and insert some test documents:
use test
db.adminCommand( { setParameter : “*”, textSearchEnabled : true } );
tc = db.test
tc.save( { _id: 1, title: “Olivia Shakespear”,text: “Olivia Shakespear (born Olivia Tucker; 17 March 1863 – 3 October 1938) was a British novelist, playwright, and patron of the arts. She wrote six books that are described as \”marriage problem\” novels. Her works sold poorly, sometimes only a few hundred copies. Her last novel, Uncle Hilary, is considered her best. She wrote two plays in collaboration with Florence Farr.” } );
tc.save( { _id: 2, title: “Linn-Kristin Riegelhuth Koren”, text: “Linn-Kristin Riegelhuth Koren (born 1 August 1984, in Ski) is a Norwegian handballer playing for Larvik HK and the Norwegian national team. She is commonly known as Linka. Outside handball she is a qualified nurse.” } );
Then we can create a new index on the title field:
tc.ensureIndex( { “title”: “text” } );
and we can now search:
1
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
> res = tc.runCommand( “text”, { search: “Olivia” } );
{ “queryDebugString” : “olivia||||||”, “language” : “english”, “results” : [ { “score” : 0.75, “obj” : { “_id” : 1, “title” : “Olivia Shakespear”, “text” : “Olivia Shakespear (born Olivia Tucker; 17 March 1863 – 3 October 1938) was a British novelist, playwright, and patron of the arts. She wrote six books that are described as \”marriage problem\” novels. Her works sold poorly, sometimes only a few hundred copies. Her last novel, Uncle Hilary, is considered her best. She wrote two plays in collaboration with Florence Farr.” } } ], “stats” : { “nscanned” : 1, “nscannedObjects” : 0, “n” : 1, “timeMicros” : 128 }, “ok” : 1 } |
We can then add the text field to the index. Note that you can only have 1 full text index so I have to drop the original one first, then recreate it as a compound index:
tc.dropIndexes()
tc.ensureIndex( { “title”: “text”, “text”: “text” } );
and test stemming:
1
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
> res = tc.runCommand( “text”, { search: “novelists” } );
{ “queryDebugString” : “novelist||||||”, “language” : “english”, “results” : [ { “score” : 0.5116279069767442, “obj” : { “_id” : 1, “title” : “Olivia Shakespear”, “text” : “Olivia Shakespear (born Olivia Tucker; 17 March 1863 – 3 October 1938) was a British novelist, playwright, and patron of the arts. She wrote six books that are described as \”marriage problem\” novels. Her works sold poorly, sometimes only a few hundred copies. Her last novel, Uncle Hilary, is considered her best. She wrote two plays in collaboration with Florence Farr.” } } ], “stats” : { “nscanned” : 1, “nscannedObjects” : 0, “n” : 1, “timeMicros” : 90 }, “ok” : 1 } |
We can see the index we created and you can set overrides on the language:
1
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
> tc.getIndexes()
[ { “v” : 1, “key” : { “_id” : 1 }, “ns” : “test.test”, “name” : “_id_” }, { “v” : 0, “key” : { “_fts” : “text”, “_ftsx” : 1 }, “ns” : “test.test”, “name” : “title_text_text_text”, “weights” : { “text” : 1, “title” : 1 }, “default_language” : “english”, “language_override” : “language” } ] |
You can specify the weight and default_language options when creating the index e.g.
tc.ensureIndex( { “title”: “text”, “text”: “text” }, {weights: { title: 10 }, default_language: “norwegian” } );
And that’s basically it (from what I can see from the tests). Nice and simple.