I am pretty new to python. I am trying to process data on a very large .csv file (~6.8 million lines). An example of the lines would look like:
Group1.1 57645 0.0954454545
Group1.1 57662 0.09556544778
Group1.13 500 0.357114538
Group1.13 504 0.320618298
Group1.13 2370 0.483851368
Group1.14 42 0.5495688
The first column gives the group, the second gives the position and the third gives the value I am reading in to run a calculation on. I am trying to perform these calculations in a "sliding window" based on the position. Another factor is that each group is calculated separately from one another because the position number restarts for each group. In my code I am first trying to read in the group ID's as a list before I do anything, "uniqifying" that list, and then using that list as a basis for only performing the "sliding window" over that specific group. I then move to the next group ID in the unique list and run the calculation again. Here is the basics of my code (the unique1 function is a simple method to uniqify a list:
for row in reader:
scaffolds.append(row[0])
unique1(scaffolds)
newfile.seek(0)
reader=csv.reader((line.replace('\0','') for line in newfile), delimiter="\t")
if row[0] == unique_scaffolds[i]:
#...perform the calculations
else:
i+=1
My problem that I am running into is that it is only reading in the very first line of my data set and nothing more. So if I insert a "print row" right after the "for row in reader", I get an output like this:
['Group1.1', '424', '0.082048032']
If I write this exact same code without any of the further calculations and loops following, it will print every single row in the data set. In this situation how would I read in every line at the beginning of this loop?