Full text search in MongoDB

This is a custom implementation created by the MongoDB developers as a specific index type, and is due to be launched as an experimental feature in MongoDB 2.4. It has features such as:

  • Full text search as an index type when creating new indexes, just like any other.
  • Indexing of multiple fields, with weighting to give different fields higher priority.
  • Support for Latin based languages initially, with plans for other character sets later. Initially this will be: Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish and Turkish.
  • Support for advanced queries, similar to the Google search syntaxg. negation and phrase matching.
  • Stemming, to deal with plurals.
  • Stop words (see the list here).

This looks like a good, general purpose full text search engine which goes along well with how MongoDB is developing into a good multi-purpose database.

Examples

First we enable full text search in the the latest unstable nightly and insert some test documents:

use test

 

db.adminCommand( { setParameter : “*”, textSearchEnabled : true } );

 

tc = db.test

 

tc.save( { _id: 1, title: “Olivia Shakespear”,text: “Olivia Shakespear (born Olivia Tucker; 17 March 1863 – 3 October 1938) was a British novelist, playwright, and patron of the arts. She wrote six books that are described as \”marriage problem\” novels. Her works sold poorly, sometimes only a few hundred copies. Her last novel, Uncle Hilary, is considered her best. She wrote two plays in collaboration with Florence Farr.” } );

 

tc.save( { _id: 2, title: “Linn-Kristin Riegelhuth Koren”, text: “Linn-Kristin Riegelhuth Koren (born 1 August 1984, in Ski) is a Norwegian handballer playing for Larvik HK and the Norwegian national team. She is commonly known as Linka. Outside handball she is a qualified nurse.” } );

 

Then we can create a new index on the title field:

tc.ensureIndex( { “title”: “text” } );

and we can now search:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

> res = tc.runCommand( “text”, { search: “Olivia” } );

{

“queryDebugString” : “olivia||||||”,

“language” : “english”,

“results” : [

{

“score” : 0.75,

“obj” : {

“_id” : 1,

“title” : “Olivia Shakespear”,

“text” : “Olivia Shakespear (born Olivia Tucker; 17 March 1863 – 3 October 1938) was a British novelist, playwright, and patron of the arts. She wrote six books that are described as \”marriage problem\” novels. Her works sold poorly, sometimes only a few hundred copies. Her last novel, Uncle Hilary, is considered her best. She wrote two plays in collaboration with Florence Farr.”

}

}

],

“stats” : {

“nscanned” : 1,

“nscannedObjects” : 0,

“n” : 1,

“timeMicros” : 128

},

“ok” : 1

}

We can then add the text field to the index. Note that you can only have 1 full text index so I have to drop the original one first, then recreate it as a compound index:

tc.dropIndexes()
tc.ensureIndex( { “title”: “text”, “text”: “text” } );

and test stemming:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

> res = tc.runCommand( “text”, { search: “novelists” } );

{

“queryDebugString” : “novelist||||||”,

“language” : “english”,

“results” : [

{

“score” : 0.5116279069767442,

“obj” : {

“_id” : 1,

“title” : “Olivia Shakespear”,

“text” : “Olivia Shakespear (born Olivia Tucker; 17 March 1863 – 3 October 1938) was a British novelist, playwright, and patron of the arts. She wrote six books that are described as \”marriage problem\” novels. Her works sold poorly, sometimes only a few hundred copies. Her last novel, Uncle Hilary, is considered her best. She wrote two plays in collaboration with Florence Farr.”

}

}

],

“stats” : {

“nscanned” : 1,

“nscannedObjects” : 0,

“n” : 1,

“timeMicros” : 90

},

“ok” : 1

}

 

 

We can see the index we created and you can set overrides on the language:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

> tc.getIndexes()

[

{

“v” : 1,

“key” : {

“_id” : 1

},

“ns” : “test.test”,

“name” : “_id_”

},

{

“v” : 0,

“key” : {

“_fts” : “text”,

“_ftsx” : 1

},

“ns” : “test.test”,

“name” : “title_text_text_text”,

“weights” : {

“text” : 1,

“title” : 1

},

“default_language” : “english”,

“language_override” : “language”

}

]

You can specify the weight and default_language options when creating the index e.g.

tc.ensureIndex( { “title”: “text”, “text”: “text” }, {weights: { title: 10 }, default_language: “norwegian” } );

And that’s basically it (from what I can see from the tests). Nice and simple.

 

  • Ask Question