Read a term-document matrix from csv using python

Question

The reason classic csv reader doesn't work on term-document arrays is that the first column of the csv file are terms, not values. Thus the file has the following syntax:

"";"label1";"label2";"label3" ...
"term1";1;0;8;...
"term2";0;0;3;...
.................................

I need to build a dictionary whose keys are label1, label3, etc... and values are the column vectors (here it would be: dict[label1]-> 1,0 , dict[label2] -> 0,0 etc), meaning that the terms are completely useless to me.

I have implemented a custom solution which goes something like this:

....
keys = f.readline().split('";"') #1st line of the csv
keys = keys[1:]                  #skipping ""
zeros = [0] * len(keys)          #dicts initial values will be 0
d = OrderedDict(zip(keys, zeros))
lines = f.readlines()
for line in lines:
    ...
    splittting, stripping etc I get a list with values (eg: 1,0,8 - see example above)
    ...
    for value in values:
        ....

However reading 8 csv files (total: 12MB) takes over 90 minutes with my laptop.

Does anyone know a more efficient way to deal with this?

have you considered loading the file with [pandas](http://pandas.pydata.org/) and then iterating through the columns and rows to make the dictionary that way? — Ryan Saxe, May 08 '13 at 17:12

score 1 · Accepted Answer · answered May 08 '13 at 17:14

1

You could use the csv module anyway to read the CSV files into memory, then transpose the rows using zip(*rows) or itertools.izip(*rows):

with open(somecsv, 'rb') as infile:
    reader = csv.reader(infile, delimiter=';')
    headers = next(reader)
    data = list(reader)
    data = dict(zip(headers, zip(*data)))

This creates a data dictionary with the headers as keys and the columns as values. You can delete the '' 'terms' column from the dictionary if needed.

For your input example, the data dictionary looks like this after executing the above code:

{'': ('term1', 'term2'), 'label1': ('1', '0'), 'label2': ('0', '0'), 'label3': ('8', '3')}

answered May 08 '13 at 17:14

Martijn Pieters

1,048,767
296
4,058
3,343

isn't it `reader.next()`? I get this if I do `next(reader)`: `Traceback (most recent call last): File "", line 1, in StopIteration` – Ryan Saxe May 08 '13 at 17:48
@RyanSaxe: `next()` is a function that'll call `reader.next()`; on Python 3 you'd have to call `reader.__next__()`, the `next()` function is the proper API to use. You have opened an empty file or have *already* read all the contents of the file; `reader.next()` will raise `StopIteration` *too*. – Martijn Pieters May 08 '13 at 17:50
Superb! 90+ minutes became 5-6 seconds! Thank you so much! – stelios May 08 '13 at 18:24

score 1 · Answer 2 · answered May 08 '13 at 17:44

1

pandas is clearly the way to go! All you have to do is load the dataframe into a dictionary and it makes one. Here is all the code, it's quick and efficient:

import pandas as pd
data = pd.read_csv(filename)
my_dict = dict(data)

quick and easy!

answered May 08 '13 at 17:44

Ryan Saxe

17,123
23
80
128

Read a term-document matrix from csv using python

2 Answers2