calculating means from csv with python's numpy

Question

I have a 10GB (can't fit in RAM) file of the format:

Col1,Col2,Col3,Col4
1,2,3,4
34,256,348,
12,,3,4

So we have columns and missing values and I want to calculate the means of columns 2 and 3. With plain python I would do something like:

def means(rng):
    s, e = rng

    with open("data.csv") as fd:
        title = next(fd)
        titles = title.split(',')
        print "Means for", ",".join(titles[s:e])

        ret = [0] * (e-s)
        for c, l in enumerate(fd):
            vals = l.split(",")[s:e]
            for i, v in enumerate(vals):
                try:
                    ret[i] += int(v)
                except ValueError:
                    pass

        return map(lambda s: float(s) / (c + 1), ret)

But I suspect there is a much faster way to do thins with numpy (I am still a novice at it).

Do you want the sum to be divided by the number of rows, or by the number of non-missing values (for each column)? — unutbu, Sep 10 '14 at 18:06
it doesnt really matter, the missing values shouldn't be more than 1% of the column and I am not interested in that much accuracy. Whichever is easier. — fakedrake, Sep 10 '14 at 18:29

robbrit · Answer 1 · 2014-09-11T14:28:50.370

4

Pandas is your best friend:

from pandas.io.parsers import read_csv
from numpy import sum

# Load 10000 elements at a time, you can play with this number to get better
# performance on your machine
my_data = read_csv("data.csv", chunksize=10000)

total = 0
count = 0

for chunk in my_data:
    # If you want to exclude NAs from the average, remove the next line
    chunk = chunk.fillna(0.0)

    total += chunk.sum(skipna=True)
    count += chunk.count()

avg = total / count

col1_avg = avg["Col1"]
# ... etc. ...

edited Sep 11 '14 at 14:28

answered Sep 10 '14 at 17:57

robbrit

17,560
4
48
68

1

Sorry I forgot to metntion the file is 10G and my ram is much less than that. – fakedrake Sep 10 '14 at 17:58
1

You'll have to tweak your average calculations a bit, but the parsing can be chunked: http://pandas.pydata.org/pandas-docs/stable/io.html#io-chunking – robbrit Sep 10 '14 at 18:01
You'd need a further tweak to handle missing values. (Assuming I understand the OP's expectations correctly.) – DSM Sep 10 '14 at 18:07
Indeed this does not handle missing values but +1 for refering the chunk thing in pandas, I had no idea. – fakedrake Sep 10 '14 at 18:10
Alright the version has been fixed to handle missing values as well. – robbrit Sep 10 '14 at 18:29
`chunk_size` should be `chunksize` – fakedrake Sep 10 '14 at 23:25
Yup that's right, I fixed that in my test script but forgot to update it here. – robbrit Sep 11 '14 at 14:29

score 2 · Answer 2 · edited May 23 '17 at 11:57

2

Try:

import numpy
# read from csv into record array
df = numpy.genfromtxt('test.csv',delimiter=',', usecols=(1,2), skip_header=1, usemask=True)
# calc means on columns
ans = numpy.mean(dat, axis=0)

ans.data will contain an array of all the means for the columns.

EDITS for Updated Question

If you have a 10G file you can chunk it with numpy as well. See this answer.

Something like this:

sums = numpy.array((0,0))
counts = numpy.array((0,0))
fH = open('test.csv')
fH.readline() # skip header
while True:
    try:
        df = numpy.genfromtxt(itertools.islice(fH, 1000), delimiter=',', usecols=(1,2), usemask=True)
    except StopIteration:
        break       
    sums = sums + numpy.sum(df, 0)
    counts = counts + numpy.sum(df.mask == False, 0)
fH.close()
means = sums / counts

edited May 23 '17 at 11:57

Community

1
1

answered Sep 10 '14 at 18:02

Mark

106,305
20
172
230

:P found that out the hard way. Thought `genfromtxt` meant 'generator from text' and froze my laptop. – fakedrake Sep 10 '14 at 18:22
@fakedrake, ha, ha, sorry. Was having too much fun playing with the options in `genfromtxt` I missed your update... – Mark Sep 10 '14 at 18:24
I tried this: ` with open('test.csv') as f: dat = np.genfromtxt(iter(f), delimiter=',', skip_header=1, usecols=range(s,e), usemask=True) ret = np.mean(dat, axis=0).data` like your link suggests and it keeps accumulating ram (I killed ita ata bout 1G). Any idea why? (edit: sorry about the formatting) – fakedrake Sep 10 '14 at 18:58
@fakedrake, see edits above, that should get you pretty close. – Mark Sep 10 '14 at 19:39

calculating means from csv with python's numpy

2 Answers2