create a polars dataframe containing unique values from a set of CSVs

Question

I have +3000 CSVs with +10 columns. What I need is to get all unique values from just two of these. I am able to read unique values in polars:

import polars as pl

df1 = pl.read_csv("test1.biobank.tsv.gz", sep='\t', dtype={"#chrom": pl.Utf8}, n_threads=8, columns=["#chrom", "pos"], new_columns=["chr", "pos"]).drop_duplicates()

I can read the remaining files one by one, i.e.:

df2 = pl.read_csv("test2.biobank.tsv.gz", sep='\t', dtype={"#chrom": pl.Utf8}, n_threads=8, columns=["#chrom", "pos"], new_columns=["chr", "pos"]).drop_duplicates()

check if all the values are not equal:

if not df1.frame_equal(df2):
    df = df1.vstack(df2)
    del(df1)
    del(df2)

then .drop_duplicates(). But since all the input files are already sorted on the two columns (chr, pos) and the differences are in thousands out of 16M input rows I hope there is a better way to do it.

Thank you for your help in advance

DK

edit

There is another way to do it using Polars and DuckDB.

create parquet files for each of the inputs

tsv_pattern = "gwas_*.gz"

for fn in glob.glob(tsv_pattern):
    print(fn)
    parquet_fn = fn.replace(".gz", ".chr_pos.parquet")
    df = pl.read_csv(fn, sep='\t', dtype={"#chrom": pl.Utf8}, n_threads=8, columns=["#chrom", "pos"], new_columns=["chr", "pos"]).drop_duplicates()
    df.to_parquet(parquet_fn, compression='zstd')
    del(df)

run duckdb and execute:

CREATE TABLE my_table AS SELECT DISTINCT * FROM 'my_directory/*.parquet'

Credits go to Mark Mytherin from DuckDB

Why you want something else then `drop_duplicates` is it running too slow? Semantically its the same as what you do with duckdb. — ritchie46, Nov 18 '21 at 04:08
Polars is super fast for drop_duplicates (15s for 16M rows and outputting zstd compressed parquet per file). I was looking for a way to do it in 3k files, preferably in polars. — darked89, Nov 18 '21 at 07:08
You can read all files to a `DataFrame` into a list and then use `pl.concat(list_of_dfs).drop_duplicates()`. So we `pl.concat` we concattenate all `DataFrame` into a single table and then finally we drop all duplicates. — ritchie46, Nov 18 '21 at 07:17
I will try to do that, but I am bit worry about putting 3K data frames with 16M rows. If frame_equal() is super fast I may compare each new frame to already created ones. Or do this in batches. What would you suggest? — darked89, Nov 18 '21 at 08:02
What is your worry? Memory capacity? You could create reduce size by calling `drop_duplicates` after each csv is parsed and then one final `drop_duplicates`, similar to what you do in duckdb. Otherwise it doesn't really matter. If all the data would be in a single csv, the compute would be approximately the same. The concats are cheap (if you don't rechunk). — ritchie46, Nov 18 '21 at 08:11
Yes, RAM may be an issue IMHO. To be on the safe side I wrote parquet Partitioned Datasets using Polars & Arrow, computed md5 on all .parquets and run at the moment DuckDB on the unique parquets from a given partition. — darked89, Nov 19 '21 at 15:31

score 1 · Answer 1 · answered Nov 17 '21 at 14:21

1

it sounds like merge k sorted arrays, i've found a article for the solution, wish it could help： https://medium.com/outco/how-to-merge-k-sorted-arrays-c35d87aa298e

answered Nov 17 '21 at 14:21

prof_FL

156
9

score 1 · Accepted Answer · edited Sep 16 '22 at 09:56

1

You can use glob patterns to read the csv's and then call distinct.

(pl.scan_csv("**/*.csv")
 .unique()
 .collect())

edited Sep 16 '22 at 09:56

astrojuanlu

6,744
8
45
105

answered Jan 07 '22 at 17:30

ritchie46

10,405
1
24
43

create a polars dataframe containing unique values from a set of CSVs

2 Answers2