1

I have a very large csv file with roughly 30,000 lines and 25 columns that gets produced daily. I need to filter this file to contain only rows that are of interest for me. It is of the form:

date, time, user, entity, party1, party2
20131001, 00:01, user1, ABC, XXX, XXX
20131002, 00:01, user2, XYZ/ABC, XXX, ABC
20131003, 00:01, user1, DEF, ABC, XXX

For example I need to delete all rows that have entity=ABC. I was thinking of either

  1. read the file in and delete each line that contains ABC, but that would get rid of lines that I actually need. I only want to delete lines that contain ABC in the entity column to be removed.

  2. use the csv module in python and try to achieve the same. I've read the functions available in csv, but it doesn't seem to provide anything that let's me delimit by field per column.

I am not necessarily looking for an answer in code, but any general advice on how to solve this problem would be welcome.

Thanks a lot.

thefourtheye
  • 233,700
  • 52
  • 457
  • 497
Eric
  • 295
  • 2
  • 7
  • 22

2 Answers2

1

You can certainly do what you want with Python's csv module, as you suggest and as e.g. @DhruvPathak outlines in his answer (better still here), but I think it's much simpler to do it with a one-line awk script:

$ awk -F ', ' '{ if ($4 != "ABC") print; }' < file.txt
date, time, user, entity, party1, party2
20131002, 00:01, user2, XYZ/ABC, XXX, ABC
20131003, 00:01, user1, DEF, ABC, XXX

where file.txt contains your data.

Community
  • 1
  • 1
nickie
  • 5,608
  • 2
  • 23
  • 37
0
for mycsv_line in csv_reader:
    if mycsv_line[4] != "ABC" :
    #append to result
DhruvPathak
  • 42,059
  • 16
  • 116
  • 175