6

Ahoy, I'm writing a Python script to filter some large CSV files.

I only want to keep rows which meet my criteria.

My input is a CSV file in the following format

Locus         Total_Depth  Average_Depth_sample   Depth_for_17
chr1:6484996  1030         1030                   1030
chr1:6484997  14           14                     14
chr1:6484998  0            0                      0

I want to return lines where the Total_Depth is 0.

I've been following this answer to read the data. But am stuck trying to parse over the rows and pull out the lines that meet my condition.

Here is the code I have so far:

import csv

f = open("file path", 'rb')
reader = csv.reader(f) #reader object which iterates over a csv file(f)
headers = reader.next() #assign the first row to the headers variable
column = {} #list of columns
for h in headers: #for each header
    column[h] = []
for row in reader: #for each row in the reader object
    for h, v in zip(headers, row): #combine header names with row values (v) in a series of tuples
        column[h].append(v) #append each value to the relevant column

I understand that my data is now in a dictionary format, and I want to filter it based on the "Total_Depth" key, but I am unsure how to do this. I'm aiming to use an 'if' statement to select the relevant rows, but not sure how to do this with the dictionary structure.

Any advice would be greatly appreciated. SB :)

Community
  • 1
  • 1
s_boardman
  • 416
  • 3
  • 9
  • 27

3 Answers3

11

Use list comprehension.

import csv

with open("filepath", 'rb') as f:
    reader = csv.DictReader(f)
    rows = [row for row in reader if row['Total_Depth'] != '0']

for row in rows:
    print row

DictReader

falsetru
  • 357,413
  • 63
  • 732
  • 636
2

If you store the full result of the zip, you can check the appropriate header before assigning:

...
for row in reader: #for each row in the reader object
    r = zip(headers, row):
    if r['Total_Depth'] == 0:
        for h, v in r:
            column[h].append(v)
blazetopher
  • 1,050
  • 9
  • 13
  • 1
    @s_boardman I'm not sure if it fits your problem, but you might have a look at [numpy.genfromtxt](http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html). The potential benefit would be that the function results in a numpy.ndarray (structured), which allows for advanced slicing. You'd also be able to have fine-grained control over your data types (if that's important). – blazetopher Jun 21 '13 at 16:02
1

The dictionary of lists that you are using makes row operations quite difficult because you have to mess with C parallel lists. namedtuples are a much more convenient way to collect and operate on tabular data.

The other answers satisfy the exact problem you have. Using a more friendly data structure will help with the problems you have tomorrow.

msw
  • 42,753
  • 9
  • 87
  • 112
  • Thanks @msw, I'll try digging in to namedtuples and see if I can build a better version of the script with that. :) – s_boardman Jun 21 '13 at 15:31