0

I have +3000 CSVs with +10 columns. What I need is to get all unique values from just two of these. I am able to read unique values in polars:

import polars as pl

df1 = pl.read_csv("test1.biobank.tsv.gz", sep='\t', dtype={"#chrom": pl.Utf8}, n_threads=8, columns=["#chrom", "pos"], new_columns=["chr", "pos"]).drop_duplicates()

I can read the remaining files one by one, i.e.:

df2 = pl.read_csv("test2.biobank.tsv.gz", sep='\t', dtype={"#chrom": pl.Utf8}, n_threads=8, columns=["#chrom", "pos"], new_columns=["chr", "pos"]).drop_duplicates()

check if all the values are not equal:

if not df1.frame_equal(df2):
    df = df1.vstack(df2)
    del(df1)
    del(df2)  

then .drop_duplicates(). But since all the input files are already sorted on the two columns (chr, pos) and the differences are in thousands out of 16M input rows I hope there is a better way to do it.

Thank you for your help in advance

DK

edit

There is another way to do it using Polars and DuckDB.

  • create parquet files for each of the inputs
tsv_pattern = "gwas_*.gz"

for fn in glob.glob(tsv_pattern):
    print(fn)
    parquet_fn = fn.replace(".gz", ".chr_pos.parquet")
    df = pl.read_csv(fn, sep='\t', dtype={"#chrom": pl.Utf8}, n_threads=8, columns=["#chrom", "pos"], new_columns=["chr", "pos"]).drop_duplicates()
    df.to_parquet(parquet_fn, compression='zstd')
    del(df)

  • run duckdb and execute:
CREATE TABLE my_table AS SELECT DISTINCT * FROM 'my_directory/*.parquet'

Credits go to Mark Mytherin from DuckDB

darked89
  • 332
  • 1
  • 2
  • 17
  • 1
    Why you want something else then `drop_duplicates` is it running too slow? Semantically its the same as what you do with duckdb. – ritchie46 Nov 18 '21 at 04:08
  • Polars is super fast for drop_duplicates (15s for 16M rows and outputting zstd compressed parquet per file). I was looking for a way to do it in 3k files, preferably in polars. – darked89 Nov 18 '21 at 07:08
  • 1
    You can read all files to a `DataFrame` into a list and then use `pl.concat(list_of_dfs).drop_duplicates()`. So we `pl.concat` we concattenate all `DataFrame` into a single table and then finally we drop all duplicates. – ritchie46 Nov 18 '21 at 07:17
  • I will try to do that, but I am bit worry about putting 3K data frames with 16M rows. If frame_equal() is super fast I may compare each new frame to already created ones. Or do this in batches. What would you suggest? – darked89 Nov 18 '21 at 08:02
  • What is your worry? Memory capacity? You could create reduce size by calling `drop_duplicates` after each csv is parsed and then one final `drop_duplicates`, similar to what you do in duckdb. Otherwise it doesn't really matter. If all the data would be in a single csv, the compute would be approximately the same. The concats are cheap (if you don't rechunk). – ritchie46 Nov 18 '21 at 08:11
  • Yes, RAM may be an issue IMHO. To be on the safe side I wrote parquet Partitioned Datasets using Polars & Arrow, computed md5 on all .parquets and run at the moment DuckDB on the unique parquets from a given partition. – darked89 Nov 19 '21 at 15:31

2 Answers2

1

it sounds like merge k sorted arrays, i've found a article for the solution, wish it could help: https://medium.com/outco/how-to-merge-k-sorted-arrays-c35d87aa298e

prof_FL
  • 156
  • 9
1

You can use glob patterns to read the csv's and then call distinct.

(pl.scan_csv("**/*.csv")
 .unique()
 .collect())
astrojuanlu
  • 6,744
  • 8
  • 45
  • 105
ritchie46
  • 10,405
  • 1
  • 24
  • 43