2

I have a file "uniprot.tab" of size 3,8GB.

I'm trying to draw an histogram based on this file, but it never finishes calculating because it is too big.

I have tested my code previously with a small file "mock.tab" and it works correctly.

EDIT: Some lines of "mock.dat" as example:

Entry   Status  Cross-reference (PDB)
A1WYA9  reviewed    
Q6LLK1  reviewed    
Q1ACM9  reviewed    
P10994  reviewed    1OY8;1OY9;1OY9;1OY9;
Q0HV56  reviewed    
Q2NQJ2  reviewed    
B7HCE7  reviewed    
P0A959  reviewed    4CVQ;
B7HLI3  reviewed    
P31224  reviewed    1IWG;1OY6;1OY8;1OY9;4CVQ;

Here you can see the code used on the small file:

import matplotlib.pyplot as plt

occurrences = []
with open('/home/martina/Documents/webstormProj/unpAnalysis/mock.tab', 'r') as f:
    next(f) #do not read the heading
    for line in f:
        col_third = line.split('\t')[2] #take third column
        occ = col_third.count(';') # count how many times it finds ; in each line
        occurrences.append(occ)

x_min = min(occurrences)
x_max = max(occurrences)


x = [] # x-axis
x = list(range(x_min, x_max + 1))

y = [] # y-axis
for i in x:
    y.append(occurrences.count(i))

plt.bar(x,y,align='center') # draw the plot
plt.xlabel('Bins')
plt.ylabel('Frequency')
plt.show()

How can I be able to draw this plot with my large file?

mb925
  • 137
  • 14
  • Try using multiprocessing to calculate the values and parallelizing the process might help. – Sundeep Pidugu Aug 29 '19 at 13:20
  • How long is a typical line in the file? How many lines are there in total? What do you see if you add some output in the first loop, how fast does it iterate the lines? – tobias_k Aug 29 '19 at 13:21
  • Can you provide a small sample of mock.tab, just a few lines. – samredai Aug 29 '19 at 13:21
  • Have you determined which part of the code it's stuck on? Most likely candidates are reading the file, or the `for i in x:` loop. – glibdud Aug 29 '19 at 13:27
  • The indentation in your `with` block is wrong in your example code. Is that just with the example, or is it in your real code? – jjramsey Aug 29 '19 at 13:27
  • @jjramsey OP stated their code is working with a smaller file, so I'd say the actual code is ok otherwise they would have gotten a SyntaxError regardless of the file size – DeepSpace Aug 29 '19 at 13:29
  • 1
    Is that dupe really applicable here? Looks like just a line of the file is being read at a time... it's already processing it more or less "lazily". – glibdud Aug 29 '19 at 13:29
  • @tobias_k 168,321,807 total number of lines. Each line is composed more or less by 3 words. In the for loop I can print line, but it never finishes printing. – mb925 Aug 29 '19 at 13:29
  • @DeepSpace That dupe doesn't seem inappropriate here; the file is already being read line by line so chunking is not necessary. – tzaman Aug 29 '19 at 13:30
  • @SamLegesse this is a typical line of 3 words: P0A959 reviewed 4CVQ; This is repeated 168,321,807 times. – mb925 Aug 29 '19 at 13:31
  • @glibdud it is blocked already in the first for loop, when it needs to read each line. – mb925 Aug 29 '19 at 13:33
  • 1
    Probably running out of memory because you are trying to create a list `occurrences` which would have 168,321,807 entries. – DisappointedByUnaccountableMod Aug 29 '19 at 13:33
  • 1
    The RAM is my guess, too. 160E6 integers in a list require around 6GB of RAM on my machine. Counting the histogram bins instead as proposed by @tzaman's answer could help. If the file lines loop is still slow, it could be anything regarding I/O... – Jeronimo Aug 29 '19 at 13:39

1 Answers1

6

Instead of building a list of all the values and then counting occurrences for each value, it'll be much faster to build the histogram directly while you iterate. You can use a collections.Counter for this:

from collections import Counter

histogram = Counter()
with open(my_file, 'r') as f:
    next(f)
    for line in file:
        # split line, etc. 
        histogram[occ] += 1

# now histogram is a dictionary containing each "occurrence" value and the count of how many times it was seen.

x_axis = list(range(min(histogram), max(histogram)+1))
y_axis = [histogram[x] for x in x_axis]
tzaman
  • 46,925
  • 11
  • 90
  • 115
  • 1
    Thank you, this is a cleaner way to plot the histogram and it works on the smaller file. Unfortunately it's not working on the big file (never finishes calculating). – mb925 Aug 29 '19 at 13:50
  • 1
    @mb925 What part is taking so long? Reading the file, or creating the plot? Note that reading 168M lines in itself might take some time (a few minutes maybe?) – tobias_k Aug 29 '19 at 14:12
  • @tobias_k Yes after few minutes now it has calculated the plot. Thanks! – mb925 Aug 29 '19 at 14:21