Using Polars to process multiple csv files and partition them differently

Question

I have a directory full of csv files that are each 1m rows or more. Each csv file has an identical structure and has a "Date" column with no obvious ordering of these dates.

I want to read in the csv files and then split them up by month-year combination and then write them out again.

lazy_df = pl.scan_csv('data/file_*.tsv', separator='\t')
lazy_df = lazy_df.with_columns(pl.col("Date").str.to_datetime("%Y-%m-%d").alias("Date"))
    
lazy_df = lazy_df.with_columns(pl.col("Date").apply(lambda date: date.strftime("%Y-%m")).alias("year_month"))

year_months = lazy_df.select(['year_month']).unique().collect()

for year_month in year_months:
        df_month = lazy_df.filter(col("year_month") == year_month).collect()

        # Write this dataframe to a new TSV file
        output_filename = f"data_{year_month}.tsv"
        df_month.write_csv(output_filename, delimiter='\t')

My jupyter kernel keeps crashing so I'm not sure if I'm using Polars the right way.

The code I tried is above.

Maybe look at whether `DataFrame.partition_by` would help? – Tikkanz Aug 04 '23 at 07:07 — Tikkanz, Aug 04 '23 at 07:07

John Collins · Answer 1 · 2023-08-04T14:58:04.700

Using the .apply method defeats the speed performance gains advantage of using polars and is highly discouraged unless you can't otherwise achieve your data transformation using a polars expression.

I think something like this should do the trick for you:

lazy_df = (
    pl.scan_csv("data/file_*.tsv", separator="\t")
    .with_columns(pl.col("Date").str.to_datetime("%Y-%m-%d").alias("Date_dt"))
    .with_columns(
        pl.col("Date_dt")
        .dt.strftime("%Y-%m")
        .alias("year_month")
    )
    .collect()
)
lazy_df

where I've replaced the apply with a Polars .dt.strftime expressions method. Also, I put the whole Polars expression into the scan_csv call. In my example of 3 files each with 100,000 rows of a date string (YYYY-MM-DD) and a value (random.random(), i.e. some float x where 0.0 < x < 1.0) the 300,000 rows are lazily gathered into a single df in an execution that took 192ms on my local personal standard desktop computer (4-core).

And then for your subsequent Polars operations you don't need to apply the .collect method because it already has been, so:

year_months = lazy_df.select(["year_month"]).unique()

year_months = year_months.to_dict()["year_month"].to_list()

for year_month in year_months:
    df_month = lazy_df.filter(pl.col("year_month") == year_month)

    # Write this dataframe to a new TSV file
    output_filename = f"./data/year_months_{year_month}.tsv"
    df_month.write_csv(output_filename, separator="\t")

EDIT: Extended Polars expression on `pl.scan_csv`

Alternatively, you could try something like:

lazy_df = (
    pl.scan_csv("data/file_*.tsv", separator="\t")
    .with_columns(pl.col("Date").str.to_datetime("%Y-%m-%d").alias("Date_dt"))
    .with_columns(
        pl.col("Date")
        .str.extract(r"([0-9]{4}-[0-9]{2})", 1)
        .alias("year_month")
    )
    .groupby("year_month")
    .agg(
        pl.count(),
        pl.col("*"),
    )
    .sort("count", descending=True)
    .collect()
)
lazy_df

which gives (using same dataset of 300,000 dates I mention previously):

polars has a strftime method so I'm curious why you're doing the regex extract instead? — Dean MacGregor, Aug 04 '23 at 14:47
@DeanMacGregor Thanks for pointing that out, I missed it; but yeah `.dt.strftime` works and is a better / more relevant solution. I updated my answer — John Collins, Aug 04 '23 at 15:01

Using Polars to process multiple csv files and partition them differently

1 Answers1

EDIT: Extended Polars expression on pl.scan_csv

EDIT: Extended Polars expression on `pl.scan_csv`