3

I have .gz file that contains several strings. My requirement is that I have to do several regex based operations in the data that is contained in the .gz file

I get the error when I use a re.findall() in the lines of data extracted

File "C:\Users\santoshn\AppData\Local\Continuum\anaconda3\lib\re.py", line 182, in search
    return _compile(pattern, flags).search(string)

TypeError: cannot use a string pattern on a bytes-like object

I have tried opening with option "r" with the same result.

Do I have to decompress this file first and then do the regex operations or is there a way to address this ?

Data contains several text lines, an example line is listed below:

ThreadContext 432 mov (8) <8;1,2>r2  <8;3,3>r4 Instruction count
Austin
  • 25,759
  • 4
  • 25
  • 48
  • 1
    possible duplicate of https://stackoverflow.com/questions/30478736/cant-use-string-pattern-on-bytes-like-object-pythons-re-error/30478822 and https://stackoverflow.com/questions/31019854/typeerror-cant-use-a-string-pattern-on-a-bytes-like-object-in-re-findall. Also you may want to read this: https://docs.python.org/3/library/gzip.html#gzip.open – bruno desthuilliers Jun 18 '18 at 11:07
  • The above pointers pertain to HTML files processing. Mine are text files and these options do not work – Santosh Narayanan Jun 18 '18 at 13:04
  • HTML __is__ text, and those links __are__ relevant to your question. – bruno desthuilliers Jun 18 '18 at 13:28

2 Answers2

1

I was able to fix this issue by reading the file using gzip.open()

with gzip.open(file,"rb") as f: binFile = f.readlines()

After this file is read, each line in the file is converted to 'ascii'. Subsequently all regex operations like re.search() and re.findall() work fine.

for line in binFile: # go over each line line = line.strip().decode('ascii')

0

I know this is an old question but I stumbled on it (as well as the other HTML references in the comments) when trying to sort out this same issue. Rather than opening the gzip file as a binary ("rb") and then decoding it to ASCII the gzip docs led me to simply opening the GZ file as text which allowed normal string manipulation after that:

with gzip.open(filepath,"rt") as f: 
    data = f.readlines()
    for line in data:
        split_string = date_time_pattern.split(line)
        # Whatever other string manipulation you may need.

The date_time_pattern variable is simply my compiled regex for different log date formats.

William
  • 125
  • 7