1

Background:

I am working on an open-source side-project-for-fun which uses Mongo as database. The project is supposed to be a 'repository catalog'.

It is supposed to gather data about hundreds/thousands software projects - basic info such as name, description, some tags as well as e.g. a list of files in it. Sample document below.

Requirements:

Primary purpose of having a catalogue is that it's easily and reliably searchable, and that should rely on 'partial matches' as well.
So, for example if a project is called 'XmlValidator', I should be able to find it by 'Xml' search string.
If a project contains a file called 'GoogleDriveSynchronizer.cs', I should be able to find it by 'Google' or 'GoogleDrive' etc.
This does not work out of the box.

The search should also be fast.
Size wise, realistically, I don't expect to exceed 10000 documents, and the average document size is 2kb, but let's say I want it to perform well even with 100k documents of 3kb average size. Due to performance, I am not considering a regex searches (though I am not sure, perhaps 100k docs of 3KB are not hard to scan through?)

Current situation:

My text index is currently set us as follows (pretty much most of the fields, but not all):

IndexKeysDefinition<ProjectInfo> keys = Builders<ProjectInfo>.IndexKeys
                .Text(x => x.ProjectName)
                .Text(x => x.ProjectDescription)
                .Text(x => x.AssemblyName)
                .Text(x => x.ProjectUri)
                .Text(x=>x.Tags)
                .Text($"{nameof(ProjectInfo.Properties)}.{nameof(Property.Value)}")
                .Text($"{nameof(ProjectInfo.Components)}.{nameof(ComponentManifest.Name)}")
                .Text($"{nameof(ProjectInfo.Components)}.{nameof(ComponentManifest.Description)}")
                .Text($"{nameof(ProjectInfo.Components)}.{nameof(ComponentManifest.DocumentationUri)}")
                .Text($"{nameof(ProjectInfo.Components)}.{nameof(ComponentManifest.Tags)}")
                ;

Sample document

       "_id": {
            "$oid": "5e67ce562ee2d4d141822a17"
        },
        "AddedDateTime": {
            "$date": {
                "$numberLong": "1583861334692"
            }
        },
        "ProjectName": "XmlValidatorFake",
        "Autogenerated": false,
        "Owner": null,
        "ProjectDescription": null,
        "ProjectUri": null,
        "DocumentationUri": null,
        "DownloadLocation": null,
        "AssemblyName": null,
        "OutputType": null,
        "TargetExtension": null,
        "RepositoryId": {
            "$oid": "5e67ce558d980a7b344dac5f"
        },
        "RepositoryStamp": "2020-03-10T17:28:54.8444190Z",
        "Tags": [],
        "Properties": [{
            "Key": "Files", 
//the value will more often be a normal string, but could be a collection as well


  "Value": {
            "_t": "System.Collections.Generic.List`1[System.String]",
            "_v": ["FileNumberOne.cs", "FileNumberTwo.cs"]
        }
    }],
    "Components": []
}

The question/solution idea:

So, my idea was to create another field on the document, into which I would drop all the 'tokens' that I think it should be foundable by.
So, all the strings in all relevant fields would be tokenized (split on PascalCase, hyphens, underscores etc) and stored on that field (e.g. search_data).

I would then create a different text index, which would only look at the search_data field.

Considerations are:

  1. This will make the documents pretty much twice as large (almost all data would be duplicated in the search_data field)
  2. I could not assign weights to the tokens... Unless I create several search_data fields with various weights (which would contain tokenized values grouped by relevance)
  3. It still would not solve the problem if the value cannot be tokenized, e.g. if the file name is called 'stackoverflowhackingattempt.cs' it won't be tokenized and it won't be findable by 'hack' query - unless it's a regex search.

Does this approach make sense?
Also, with this approach, is regex search supposed to perform any quicker than in my current one? I'd like to know what you guys think before I go ahead and redesign the entire thing.
Cheers!

Bartosz
  • 4,406
  • 7
  • 41
  • 80
  • you're pretty much out of luck with values that cannot be tokenized cause mongo can't use indexes for partial matches. i think lucene/elasticsearch would be a good candidate for this project due to the requirements. – Dĵ ΝιΓΞΗΛψΚ Mar 11 '20 at 03:54
  • @ĐĵΝιΓΞΗΛψΚ - thanks - that's actually a very cool idea for the future. – Bartosz Mar 11 '20 at 07:57
  • also at the expense of disk space, you can store your words as n-grams and use the text indexes to get results. i've done a similar thing with phonetic matching for my mongodb library to get [fuzzy results](https://github.com/dj-nitehawk/MongoDB.Entities/wiki/11.-Fuzzy-Text-Search). – Dĵ ΝιΓΞΗΛψΚ Mar 11 '20 at 10:55
  • Could you take a look at this answer: https://stackoverflow.com/a/69767176/12011575 – JayCodist Oct 29 '21 at 10:23

0 Answers0