1

I am trying to Create index by whoosh of 150 MB file. But it is showing the error list index out of range: I have quote the line which is responsible of error. That is for x in range(len(id)):. Logically Index record will be equivalent to ID number of the documents.

from whoosh import index

from whoosh.fields import Schema,ID, TEXT,NUMERIC
from whoosh import index
from whoosh.index import create_in

id = []
body = []
Score = []
count=0
doc_path='C:/Users/Abhi/Desktop/My_Experiments_with_truth/extracted_xml.txt'
with open(doc_path,'r+',encoding="utf8") as line:
 for f in line:
    count=count+1
    if f.startswith('Id : '):
            a = f.replace('Id : ','')
            id.append(a)
            #print(a)
    elif f.startswith('body : '):
            b = f.replace('body : ','')
            body.append(b)
            #print(b)
    elif  f.startswith('Score :'):
            c = f.replace('Score :','')
            Score.append(c)
            #print(c)

if not os.path.exists("index"):
        os.mkdir("index")
#design the Schema

schema=Schema(id_details=ID(stored=True),body_details=TEXT(stored=True),Score_details=NUMERIC(stored=True))

print(schema)


#creation of the index

ix = index.create_in("index", schema)

writer = ix.writer()
#Opening writer


for x in range(len(id)):
    writer.add_document(id_details=id[x],body_details=body[x],Score_details=Score[x])
writer.commit()
print("Index created")
Assem
  • 11,574
  • 5
  • 59
  • 97
Abhishek Kaushik
  • 93
  • 1
  • 2
  • 12

1 Answers1

1

I think the issue is not with whoosh but in the way you are parsing your input file. If you are inconsistent in reading data from the input file, you will get the lists id, body, Score in different sizes causing this line to fail:

  writer.add_document(id_details=id[x],body_details=body[x],Score_details=Score[x])

Since you are only comparing to the limit of list id : range(len(id))

Try to improve your parsing of the file or at least compare your x to the limit of the shortest list between id, body, Score

Assem
  • 11,574
  • 5
  • 59
  • 97
  • Thanks for your answer...But I already figure out the problem...The problem with the encoding of the file...I am not able to parse the file completely.. I tried different encoding like utf-8,16,32 and 64, latin-1, ascii but every encoding is reading only the partial data. The text is in English. It was an xml file and parse into text file. I got the text file. And the original xml file is not available to me. Can you guide for encoding with the encoding the problem. – Abhishek Kaushik Nov 19 '17 at 22:28