0

I'm currently trying to improve processing speed on several large log files, to extract some metrics to then store on a Postgres database. Currently, I'm just trying the first step, which is, simply filtering only relevant lines of the log after having them processed.

This is the sample code that currently works in regular Pandas:

import os
import regex as re
import pandas as pd

fp = "server.log"
data_lines = []

with open(fp, "rt", encoding="utf8") as file:
    lines = file.readlines()
    # data_lines += [
    #     line for line in lines
    #     if "POST" in line
    # ]
    data_lines += lines

# Processing
df = pd.DataFrame({"src": data_lines})
df.src = df.src.astype("string")

df = df[df.src.str.contains("POST")]

But, when I try to replace import pandas as pd with import modin.pandas as pd, I get this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xee in position 67: invalid continuation byte

As shown, the text file is being open with the correct encoding, and no error is thrown when using the same code with Pandas. Please, advise in case this is not the intended way to use Modin.

0 Answers0