0

Pandas read_csv accepts converters to pre-process each field. This is very useful especially for int64 validation or mixed dateformats etc. Could you please provide a way to read multiple columns as pl.Utf8 and then cast as Int64, Float64, Date etc ?

  • I'm not familiar with Polars, but looking at the docs, it seems like you could use polars.io.scan_csv with a dtype set to string for the column. That lazily reads in the dataframe, and then you could do the conversion after that. – Nick ODell Feb 15 '22 at 00:34
  • @narayanb I think you're probably looking for this: https://stackoverflow.com/questions/71106690/polars-specify-dtypes-for-all-columns-at-once-in-read-csv/ – daviewales Feb 15 '22 at 04:44
  • Specifically: [pl.Series.cast](https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.Series.cast.html) – daviewales Feb 15 '22 at 04:46

1 Answers1

0

If you need to preprocess some column like converters do in pandas, you can just read that column as pl.Utf8 dtype and use polars expressions to process that column before a cast.


csv = """a,b,c
#12,1,2,
#1,3,4
1,45,5""".encode()

(pl.read_csv(csv, dtypes={"a": pl.Utf8})
     .with_column(pl.col("a").str.replace("#", "").cast(pl.Int64))
)

Or if you want to do the same to multiple columns of that dtype

csv = """a,b,c,str_col
#12,1#,2foo,
#1,3#,4,bar
1,45#,5,ham""".encode()

pl.read_csv(
    file = csv,
).with_columns([
    pl.col(pl.Utf8).exclude("str_col").str.replace("#","").cast(pl.Int64),
])
ritchie46
  • 10,405
  • 1
  • 24
  • 43