about creating frequency distribution against large inputs

Question

Recently I need to create a histogram which shows frequency distribution of a large data set. This should be a simple job if the data set is small. However, the data set I need to plot contains about 800000000 numbers (lets assume each number takes 4 bytes), and they are all stored in one text file, one number each line. The text file is about 4 GB. I tried GNUPLOT but it complains not enough memory for handling this data set. Can someone suggest how to solve this problem, or any other tools for doing this job?

Thanks, Tom

I think you need to be a little more explicit about what you have. You say each number takes 4 bytes but then you imply that it is an ascii file since you have "one number per line" "stored in one text file". Are the numbers floating point? Integers? Do you need to bin the data, or is just getting a count good enough? — mgilson, Jan 21 '13 at 17:20

score 0 · Answer 1 · answered Jan 19 '13 at 23:54

I'd use python. It's as easy as building a dictionary. Assuming your file contains integers:

from collections import defaultdict

d = defaultdict(int)
with open('datafile') as fin:
    for line in fin:
        d[int(line)] += 1

for item,number_of_occurances in sorted(d.items()):
    print item,number_of_occurances

If you're on a newer version of python, this can be even easier with a Counter:

from collections import Counter
with open('datafile') as fin:
    d = Counter(int(line) for line in fin)

for item,number_of_occurances in sorted(d.items()):
    print item,number_of_occurances

about creating frequency distribution against large inputs

1 Answers1