3

I am pretty new to python. I am trying to process data on a very large .csv file (~6.8 million lines). An example of the lines would look like:

Group1.1    57645   0.0954454545 
Group1.1    57662   0.09556544778
Group1.13   500 0.357114538 
Group1.13   504 0.320618298 
Group1.13   2370    0.483851368 
Group1.14   42  0.5495688

The first column gives the group, the second gives the position and the third gives the value I am reading in to run a calculation on. I am trying to perform these calculations in a "sliding window" based on the position. Another factor is that each group is calculated separately from one another because the position number restarts for each group. In my code I am first trying to read in the group ID's as a list before I do anything, "uniqifying" that list, and then using that list as a basis for only performing the "sliding window" over that specific group. I then move to the next group ID in the unique list and run the calculation again. Here is the basics of my code (the unique1 function is a simple method to uniqify a list:

for row in reader:
    scaffolds.append(row[0])
    unique1(scaffolds)
    newfile.seek(0)
    reader=csv.reader((line.replace('\0','') for line in newfile), delimiter="\t")
    if row[0] == unique_scaffolds[i]:
        #...perform the calculations
    else:
        i+=1

My problem that I am running into is that it is only reading in the very first line of my data set and nothing more. So if I insert a "print row" right after the "for row in reader", I get an output like this:

['Group1.1', '424', '0.082048032']

If I write this exact same code without any of the further calculations and loops following, it will print every single row in the data set. In this situation how would I read in every line at the beginning of this loop?

Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
abovezero
  • 63
  • 2
  • 11

3 Answers3

1

You are re-initializing reader each time. Essentially this is causing it to get stuck on the first line. Try this

reader=csv.reader((line.replace('\0','') for line in newfile), delimiter="\t")
for row in reader:
    scaffolds.append(row[0])
    unique1(scaffolds)
    newfile.seek(0)

    if row[0] == unique_scaffolds[i]:
        #...perform the calculations
    else:
        i+=1
PearsonArtPhoto
  • 38,970
  • 17
  • 111
  • 142
  • I tried your suggestion, about taking the reader object outside of loop. It did not effect the calculations for the part inside the loop, but for the first part where I am trying to make a unique list of all group ID's, it is still only reading the first line from the file... – abovezero Nov 19 '12 at 18:25
  • Realize that cvsreader will only read one line in at a time. You will have to generate your own list by reading them in, one line at a time. – PearsonArtPhoto Nov 19 '12 at 18:30
0

It looks to me like you're replacing your reader object inside the loop. Fix that (or get rid of it) and you'll probably have better luck getting this to work.

Eric
  • 5,137
  • 4
  • 34
  • 31
0

Realize that cvsreader will only read one line in at a time. You will have to generate your own list by reading them in, one line at a time.

Adnan
  • 1