2

How should I organize my collection for documents like this:

{ 
 "path" : "\\192.168.77.1\user\1.wav", // unique text index
 "sex" : "male", "age" : 28 // some fields 
}

I use this scheme in Python (pymongo):

client = MongoClient(self.addr)
db = self.client['some']
db.files.ensure_index([('path', TEXT)], unique=True)

data = [
    {"path": r'\\192.168.77.5\1.wav', "base": "CAGS2"},
    {"path": r'\\192.168.77.5\2.wav', "base": "CAGS2"}
]
sid = self.db.files.insert(data)

But error occurs:

pymongo.errors.DuplicateKeyError: insertDocument :: 
caused by :: 11000 E11000 duplicate key error index: 
some.files.$path_text  dup key: { : "168", : 0.75 }

If I remove all dots ('.') inside path keys, everything is ok. What is wrong?

Max Tkachenko
  • 792
  • 1
  • 12
  • 30
  • can you provide the index you have on the "path" field? – Kevin Brady Feb 03 '15 at 17:09
  • if you mean self.db.files.index_information(): {u'_id_': {u'key': [(u'_id', 1)], u'v': 1}, u'path_text': {u'default_language': u'english', u'weights': SON([(u'path', 1)]), u'key': [(u'_fts', u'text'), (u'_ftsx', 1)], u'v': 1, u'language_override': u'language', u'unique': True, u'textIndexVersion': 2}} – Max Tkachenko Feb 03 '15 at 17:17

3 Answers3

4

Why are you creating a unique text index? For that matter, why is MongoDB letting you? When you create a text index, the input field values are tokenized:

"the brown fox jumps" -> ["the", "brown", "fox", "jumps"]

The tokens are stemmed, meaning they are reduced (in a language-specific way) to a different form to support natural language matching like "loves" with "love" and "loving" with "love". Stopwords like "the", which are common words that would be more harmful than helpful to match on, are thrown out.

["the", "brown", "fox", "jumps"] -> ["brown", "fox", "jump"]

The index entries for the document are the stemmed tokens of the original field value with a score that's calculated based off of how important the term is in the value string. Ergo, when you put a unique index on these values, you are ensuring that you cannot have two documents with terms that stem to the same thing and have the same score. This is pretty much never what you would want because it's hard to tell what it's going to reject. Here is an example:

> db.test.drop()
> db.test.ensureIndex({ "t" : "text" }, { "unique" : true })
> db.test.insert({ "t" : "ducks are quacking" })
WriteResult({ "nInserted" : 1 })
> db.test.insert({ "t" : "did you just quack?" })
WriteResult({
    "nInserted" : 0,
    "writeError" : {
        "code" : 11000,
        "errmsg" : "insertDocument :: caused by :: 11000 E11000 duplicate key error index: test.test.$a_text  dup key: { : \"quack\", : 0.75 }"
    }
})
> db.test.insert({ "t" : "though I walk through the valley of the shadow of death, I will fear no quack" })
WriteResult({ "nInserted" : 1 })

The stemmed term "quack" will result from all three documents, but in the first two it receives the score of 0.75, so the second insertion is rejected by the unique constraint. It receives a score of 0.5625 in the third document.

What are you actually trying to achieve with the index on the path? A unique text index is not what you want.

wdberkeley
  • 11,531
  • 1
  • 28
  • 23
1

have you escaped all the text in the input fields to ensure that it is a valid JSON document?

Here is a valid json document

{
    "path": "\"\\\\192.168.77.1\\user\\1.wav\"",
    "sex": "male",
    "age": 28
}

You have set the text index to be unique - is there already a document in the collection with a path value of "\\192.168.77.1\user\1.wav" ?

Mongo may also be treating the punctuation in the path as delimiters which may be affecting how its stored. MongoDB $search field

Kevin Brady
  • 1,684
  • 17
  • 30
0

I created a scheme with TEXT index for 'path' and it was saved in DB. I tried to change TEXT to ASCENDING/DESCENDING after and nothing worked because I didn't do the index reset (or delete and create entire DB again).

So, as wdberkeley wrote below: when you create a text index, the input field values are tokenized:

"the brown fox jumps" -> ["the", "brown", "fox", "jumps"]

And TEXT index is not solution for filenames. Use ASCENDING/DESCENDING instead.

Max Tkachenko
  • 792
  • 1
  • 12
  • 30