I have a CSV file with ~9 million rows. I want to be able to search a row from this file in a quick manner. I decided to use python whoosh to index this data and then search it, like below.
schema = Schema(content=TEXT(stored=True, analyzer=RegexTokenizer() | LowercaseFilter() | CharsetFilter(accent_map)))
if not os.path.exists("index"):
os.mkdir("index")
ix = create_in("index", schema)
ix = open_dir("index")
writer = ix.writer()
with open(file_path, 'r', encoding='utf-8') as file:
for line in file:
writer.add_document(content=line)
writer.commit()
I am not sure if this is the correct/fastest way to index the data. Does changing the schema make indexing faster? If not, is the general idea of using whoosh or other indexing libraries a good one on large file sizes like this?
The good thing is that indexing will only be done once, so I am willing to wait if this will give a fast search time. I am not experienced in full-text searching. will someone know, with my setup, how long indexing will take?
This is a sample of my csv:
ID,TYPE,TEXT,ID2
1058895,1,Be,1067806
1058895,2,Hosl,101938
1058895,3,370,None
1058895,4,Tnwg,10582