The reason classic csv reader doesn't work on term-document arrays is that the first column of the csv file are terms, not values. Thus the file has the following syntax:
"";"label1";"label2";"label3" ...
"term1";1;0;8;...
"term2";0;0;3;...
.................................
I need to build a dictionary whose keys are label1, label3, etc... and values are the column vectors (here it would be: dict[label1]-> 1,0 , dict[label2] -> 0,0 etc), meaning that the terms are completely useless to me.
I have implemented a custom solution which goes something like this:
....
keys = f.readline().split('";"') #1st line of the csv
keys = keys[1:] #skipping ""
zeros = [0] * len(keys) #dicts initial values will be 0
d = OrderedDict(zip(keys, zeros))
lines = f.readlines()
for line in lines:
...
splittting, stripping etc I get a list with values (eg: 1,0,8 - see example above)
...
for value in values:
....
However reading 8 csv files (total: 12MB) takes over 90 minutes with my laptop.
Does anyone know a more efficient way to deal with this?