1

I am trying to parse a gzipped csv file (where the fields are separated by | characters), to test if reading the file directly in Python will be faster than zcat file.gz | python in parsing the contents.

I have the following code:

#!/usr/bin/python3

import gzip

if __name__ == "__main__": 
    total=0
    count=0

    f=gzip.open('SmallData.DAT.gz', 'r')
    for line in f.readlines():
        split_line = line.split('|')
        total += int(split_line[52])
        count += 1

    print(count, " :: ", total)

But I get the following error:

$ ./PyZip.py 
Traceback (most recent call last):
  File "./PyZip.py", line 11, in <module>
    split_line = line.split('|')
TypeError: a bytes-like object is required, not 'str'

How can I modify this to read the line and split it properly?

I'm interested mainly in just the 52nd field as delimited by |. The lines in my input file are like the following:

field1|field2|field3|...field52|field53

Is there a faster way than what I have in summing all the values in the 52nd field?

Thanks!

Rusty Lemur
  • 1,697
  • 1
  • 21
  • 54
  • Does this answer your question? [Cannot split, a bytes-like object is required, not 'str'](https://stackoverflow.com/questions/50829364/cannot-split-a-bytes-like-object-is-required-not-str) – mkrieger1 Dec 21 '22 at 10:03

2 Answers2

2

You should decode the line first before splitting, since unzipped files are read as bytes:

split_line = line.decode('utf-8').split('|')

The code you have for summing all the values in the 52nd field is fine. There's no way to make it faster because all the lines simply have to be read and split in order to identify the 52th field of every line.

blhsing
  • 91,368
  • 6
  • 71
  • 106
1

Just try decoding the bytes object to a string. i.e,

line.decode('utf-8')

Updated script :

#!/usr/bin/python3
import gzip

if __name__ == "__main__": 
    total=0
    count=0

    f=gzip.open('SmallData.DAT.gz', 'r')
    for line in f.readlines():
        split_line = line.decode("utf-8").split('|')
         total += int(split_line[52])
         count += 1

    print(count, " :: ", total)