I have a 10GB (can't fit in RAM) file of the format:
Col1,Col2,Col3,Col4
1,2,3,4
34,256,348,
12,,3,4
So we have columns and missing values and I want to calculate the means of columns 2 and 3. With plain python I would do something like:
def means(rng):
s, e = rng
with open("data.csv") as fd:
title = next(fd)
titles = title.split(',')
print "Means for", ",".join(titles[s:e])
ret = [0] * (e-s)
for c, l in enumerate(fd):
vals = l.split(",")[s:e]
for i, v in enumerate(vals):
try:
ret[i] += int(v)
except ValueError:
pass
return map(lambda s: float(s) / (c + 1), ret)
But I suspect there is a much faster way to do thins with numpy (I am still a novice at it).