3

I have a CSV file with ~9 million rows. I want to be able to search a row from this file in a quick manner. I decided to use python whoosh to index this data and then search it, like below.

schema = Schema(content=TEXT(stored=True, analyzer=RegexTokenizer() | LowercaseFilter() | CharsetFilter(accent_map)))

if not os.path.exists("index"):
    os.mkdir("index")
ix = create_in("index", schema)

ix = open_dir("index")

writer = ix.writer()

with open(file_path, 'r', encoding='utf-8') as file:
    for line in file:
        writer.add_document(content=line)

writer.commit()

I am not sure if this is the correct/fastest way to index the data. Does changing the schema make indexing faster? If not, is the general idea of using whoosh or other indexing libraries a good one on large file sizes like this?

The good thing is that indexing will only be done once, so I am willing to wait if this will give a fast search time. I am not experienced in full-text searching. will someone know, with my setup, how long indexing will take?

This is a sample of my csv:

ID,TYPE,TEXT,ID2
1058895,1,Be,1067806
1058895,2,Hosl,101938
1058895,3,370,None
1058895,4,Tnwg,10582
bcsta
  • 1,963
  • 3
  • 22
  • 61

2 Answers2

6

Your approach is fine. The trade-off with full text search is that you trade fast queries for slow indexing, so you do as much of the work upfront as possible.

Compared to Lucene and its ecosystem (Solr, Elasticsearch), Whoosh indexing is very slow. From some quick testing on my PC, it indexes around 540 rows per second. You can expect 9 million rows to index in about 4.5 to 5 hours. The more fields you add, especially fields that will be analyzed, the longer it'll take, so make sure you analyze only necessary fields and index only what you need.

If you plan to query this index many times, then the investment in time can be worthwhile and your queries should be reasonably fast. If you'll need to index large files frequently and can't afford to wait 5 hours every time, then consider switching to Lucene, Solr or Elasticsearch.

Your data also seems quite simple. A Pandas query allows you to search a DataFrame much faster if exact Boolean queries would suffice. Pandas offers fast vectorized methods for basic text processing that you could use for things like lowercasing, stopword removal, character mapping and regular expressions.

You only truly need full text search if you need the scoring and NLP facilities such as stemming and word breaking.

Steven
  • 1,733
  • 2
  • 16
  • 30
3

I am sharing here a part the official documentation page to this exact issue (https://whoosh.readthedocs.io/en/latest/batch.html). See the link for other approaches such as increasing the number of cores.

Increase the amount of RAM memory for the whoosh indexer. It is set to only 128 MB by default:

    from whoosh import index

    ix = index.open_dir("indexdir")
    writer = ix.writer(limitmb=2048)

Allow more cache for the analyser, which otherwise can slow down indexing significantly.

    w = myindex.writer()
    # Get the analyzer object from a text field
    stem_ana = w.schema["content"].format.analyzer
    # Set the cachesize to -1 to indicate unbounded caching
    stem_ana.cachesize = -1
    # Reset the analyzer to pick up the changed attribute
    stem_ana.clear()
mateuszb
  • 1,072
  • 13
  • 26