Unable to do regex based operations in a gzip file in Python

Question

I have .gz file that contains several strings. My requirement is that I have to do several regex based operations in the data that is contained in the .gz file

I get the error when I use a re.findall() in the lines of data extracted

File "C:\Users\santoshn\AppData\Local\Continuum\anaconda3\lib\re.py", line 182, in search
    return _compile(pattern, flags).search(string)

TypeError: cannot use a string pattern on a bytes-like object

I have tried opening with option "r" with the same result.

Do I have to decompress this file first and then do the regex operations or is there a way to address this ?

Data contains several text lines, an example line is listed below:

ThreadContext 432 mov (8) <8;1,2>r2  <8;3,3>r4 Instruction count

possible duplicate of https://stackoverflow.com/questions/30478736/cant-use-string-pattern-on-bytes-like-object-pythons-re-error/30478822 and https://stackoverflow.com/questions/31019854/typeerror-cant-use-a-string-pattern-on-a-bytes-like-object-in-re-findall. Also you may want to read this: https://docs.python.org/3/library/gzip.html#gzip.open — bruno desthuilliers, Jun 18 '18 at 11:07
The above pointers pertain to HTML files processing. Mine are text files and these options do not work — Santosh Narayanan, Jun 18 '18 at 13:04
HTML __is__ text, and those links __are__ relevant to your question. — bruno desthuilliers, Jun 18 '18 at 13:28

score 1 · Answer 1 · answered Jun 19 '18 at 05:21

I was able to fix this issue by reading the file using gzip.open()

with gzip.open(file,"rb") as f: binFile = f.readlines()

After this file is read, each line in the file is converted to 'ascii'. Subsequently all regex operations like re.search() and re.findall() work fine.

for line in binFile: # go over each line line = line.strip().decode('ascii')

score 0 · Answer 2 · answered Sep 10 '20 at 16:30

I know this is an old question but I stumbled on it (as well as the other HTML references in the comments) when trying to sort out this same issue. Rather than opening the gzip file as a binary ("rb") and then decoding it to ASCII the gzip docs led me to simply opening the GZ file as text which allowed normal string manipulation after that:

with gzip.open(filepath,"rt") as f: 
    data = f.readlines()
    for line in data:
        split_string = date_time_pattern.split(line)
        # Whatever other string manipulation you may need.

The date_time_pattern variable is simply my compiled regex for different log date formats.

Unable to do regex based operations in a gzip file in Python

2 Answers2