8

I have a tab separated file with 1 billion lines of these (Imagine 200 columns, instead of 3):

abc -0.123  0.6524  0.325
foo -0.9808 0.874   -0.2341 
bar 0.23123 -0.123124   -0.1232

I want to create a dictionary where the string in the first column is the key and the rest are the values. I've been doing it like this but it's computationally expensive:

import io

dictionary = {}

with io.open('bigfile', 'r') as fin:
    for line in fin:
        kv = line.strip().split()
        k, v = kv[0], kv[1:]
        dictionary[k] = list(map(float, v))

How else can I do get the desired dictionary? Actually a numpy array would be more appropriate than a list of floats for the value.

logic
  • 1,739
  • 3
  • 16
  • 22
alvas
  • 115,346
  • 109
  • 446
  • 738

5 Answers5

5

You can use pandas to load the df, then construct a new df as desired and then call to_dict:

In [99]:

t="""abc -0.123  0.6524  0.325
foo -0.9808 0.874   -0.2341 
bar 0.23123 -0.123124   -0.1232"""
df = pd.read_csv(io.StringIO(t), sep='\s+', header=None)
df = pd.DataFrame(columns = df[0], data = df.ix[:,1:].values)
df.to_dict()
Out[99]:
{'abc': {0: -0.12300000000000001,
  1: -0.98080000000000001,
  2: 0.23123000000000002},
 'bar': {0: 0.32500000000000001, 1: -0.2341, 2: -0.1232},
 'foo': {0: 0.65239999999999998, 1: 0.87400000000000011, 2: -0.123124}}

EDIT

A more dynamic method and one which would reduce the need to construct a temporary df:

In [121]:

t="""abc -0.123  0.6524  0.325
foo -0.9808 0.874   -0.2341 
bar 0.23123 -0.123124   -0.1232"""
# determine the number of cols, we'll use this in usecols
col_len = pd.read_csv(io.StringIO(t), sep='\s+', nrows=1).shape[1]
col_len
# read the first col we'll use this in names
cols = pd.read_csv(io.StringIO(t), sep='\s+', usecols=[0], header=None)[0].values
# now read and construct the df using the determined usecols and names from above
df = pd.read_csv(io.StringIO(t), sep='\s+', header=None, usecols = list(range(1, col_len)), names = cols)
df.to_dict()
Out[121]:
{'abc': {0: -0.12300000000000001,
  1: -0.98080000000000001,
  2: 0.23123000000000002},
 'bar': {0: 0.32500000000000001, 1: -0.2341, 2: -0.1232},
 'foo': {0: 0.65239999999999998, 1: 0.87400000000000011, 2: -0.123124}}

Further update

Actually you don't need the first read, the column length can be implicitly derived by the number of columns in the first column anyway:

In [128]:

t="""abc -0.123  0.6524  0.325
foo -0.9808 0.874   -0.2341 
bar 0.23123 -0.123124   -0.1232"""
cols = pd.read_csv(io.StringIO(t), sep='\s+', usecols=[0], header=None)[0].values
df = pd.read_csv(io.StringIO(t), sep='\s+', header=None, usecols = list(range(1, len(cols)+1)), names = cols)
df.to_dict()
Out[128]:
{'abc': {0: -0.12300000000000001,
  1: -0.98080000000000001,
  2: 0.23123000000000002},
 'bar': {0: 0.32500000000000001, 1: -0.2341, 2: -0.1232},
 'foo': {0: 0.65239999999999998, 1: 0.87400000000000011, 2: -0.123124}}
EdChum
  • 376,765
  • 198
  • 813
  • 562
  • can i then do `df['foo'].values()` to get an ordered list by the feature keys? Or would the `values()` jumble up the order of the columns? – alvas Apr 28 '15 at 13:32
  • The values are always in order – EdChum Apr 28 '15 at 13:38
  • Uh, but the dynamic method would be reading a billion lines csv file thrice? or am I missing something here? – Zero Apr 28 '15 at 13:47
  • @JohnGalt nope for the first read it just reads a single row to determine the number of columns, for the second read it just reads the columns, the third read, reads just the data – EdChum Apr 28 '15 at 13:48
  • @EdChum Excellent - My bad, didn't clearly see through the lines. Apologies! – Zero Apr 28 '15 at 13:50
  • @alvas there is a `low_memory` param (which is True by default) and `memory_map` param which is `False` by default you can set that to `True`, as to whether this works, I've never tried but it's supposed to: http://stackoverflow.com/questions/24249690/python-pandas-read-csv-low-memory-and-memory-map-flags, they're both undocumented by the way: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv – EdChum Apr 28 '15 at 13:52
  • Reading into the data frame returns this error: `pandas.parser.CParserError: Too many columns specified: expected 263188 and found 401`. There are 401 columns for each line. I've ran through asserts on each line of the file to count the columns. Here's the file: http://clic.cimec.unitn.it/composes/materials/EN-wform.w.5.cbow.neg10.400.subsmpl.txt.gz – alvas Apr 29 '15 at 11:40
  • Works for me: `pd.read_csv(r'c:\data\EN-wform.w.5.cbow.neg10.400.subsmpl.txt', nrows=1, sep='\s+')` this showed: ` Int64Index: 1 entries, 0 to 0 Columns: 401 entries, the to 0.016487999999999999 dtypes: float64(400), object(1) memory usage: 3.1+ KB` – EdChum Apr 29 '15 at 12:07
3

You could make use of the numpy.genfromtxt() function, if you specify the number of columns:

import numpy as np

a = np.genfromtxt('bigfile.csv',dtype=str,usecols=(0)) 
b = np.genfromtxt('bigfile.csv',dtype=float,delimiter='\t',usecols=range(1,4)) 
                                                                             #^enter # of cols here

d = dict(zip(a,b.tolist()))    #if you want a numpy array, just remove .tolist()

print d

Output:

{'abc': [-0.123, 0.6524, 0.325], 'bar': [0.23123, -0.123124, -0.1232], 'foo': [-0.9808, 0.874, -0.2341]}

Note: To programatically find the number of cols you could do:

with open('bigfile.csv', 'r') as f:
    num_cols = len(f.readline().split())

And then use num_cols for the usecols parameter.

logic
  • 1,739
  • 3
  • 16
  • 22
  • I'm getting `Line #1 (got 1 columns instead of 400)` from `b = np.genfromtxt('bigfile',dtype='float',delimiter='\t',usecols=range(1,401)` 400 is the number of columns. – alvas Apr 28 '15 at 15:24
  • @alvas Are you certain there are actually 400 columns separated by `'\t'`? Also, you don't need quotations around `float` – logic Apr 28 '15 at 15:28
  • @alvas You should also add an extension to `'bigfile'`, for example, `'bigfile.csv'` – logic Apr 28 '15 at 15:32
  • it's strange. the same `got 1 columns` happens with `b = np.genfromtxt('bigfile.csv',dtype=float,delimiter='\t',usecols=range(1,401))`. It's surely tab separated because i can do `cut -f` on unix and surely 401 columns from `awk '{print NF;quit}' bigfile.csv ` – alvas Apr 28 '15 at 15:37
  • @alvas I just tried reproducing your error and it only shows up when there is exactly 1 column. I honestly don't know what could be wrong. It could help if you let us see the file, but I understand privacy concerns. – logic Apr 28 '15 at 15:41
  • No worries, it's some open data. It's a pretty big file: http://clic.cimec.unitn.it/composes/materials/EN-wform.w.5.cbow.neg10.400.subsmpl.txt.gz – alvas Apr 28 '15 at 16:00
  • did you get the same `got 1 columns` problem with the dataset from the link? – alvas Apr 28 '15 at 20:31
2

You can use csv module for reading the file, to get ride of splitting the lines then use np.array to convert the float values to an numpy array object :

import csv
import numpy as np
dictionary = {}
with open('bigfile.csv', 'rb') as csvfile:
     spamreader = csv.reader(csvfile, delimiter='\t',)
     for row in spamreader:
        k, v = row[0], row[1:] #in python3 do  k,*v = row
        dictionary[k] = np.array(map(float, v))
Mazdak
  • 105,000
  • 18
  • 159
  • 188
0

One way to use Pandas. Assuming you do df = pd.read_csv(file) and the df is like

In [220]: df
Out[220]:
     k       a1        a2      a3
0  abc -0.12300  0.652400  0.3250
1  foo -0.98080  0.874000 -0.2341
2  bar  0.23123 -0.123124 -0.1232

I've added dummy column names, you have flexibility to change that while reading the csv file

Then you could do the following.

In [221]: df.set_index('k').T.to_dict('list')
Out[221]:
{'abc': [-0.12300000000000001, 0.65239999999999998, 0.32500000000000001],
 'bar': [0.23123000000000002, -0.123124, -0.1232],
 'foo': [-0.98080000000000001, 0.87400000000000011, -0.2341]}
Zero
  • 74,117
  • 18
  • 147
  • 154
0

Sorry this is not really an answer but too long for a comment.

You say you have 1 billion lines with 200 columns of float. It means a minimal memory of

109 * 200 * 8 = 1.6 1012 bytes

It gives more than 1.5 G not counting the overhead for the dict.

Of course, you can try to use numpy arrays instead of lists of floats, but each array is small (200 elements), so I highly doubt that the gain will be important.

IMHO, for so many data, you should not considere the load phase independently of how you will process the data and if you really need a dictionnary of one billion records of 200 float values each, you current implementation is correct as is a numpy array one.

You could get an important gain in further processing if you could have all the data in a single numpy array, and did use numpy for the processing part, but without knowing more of it it is just speculation.

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252