Pandas read_csv accepts converters to pre-process each field. This is very useful especially for int64 validation or mixed dateformats etc. Could you please provide a way to read multiple columns as pl.Utf8 and then cast as Int64, Float64, Date etc ?
Asked
Active
Viewed 413 times
0
-
I'm not familiar with Polars, but looking at the docs, it seems like you could use polars.io.scan_csv with a dtype set to string for the column. That lazily reads in the dataframe, and then you could do the conversion after that. – Nick ODell Feb 15 '22 at 00:34
-
@narayanb I think you're probably looking for this: https://stackoverflow.com/questions/71106690/polars-specify-dtypes-for-all-columns-at-once-in-read-csv/ – daviewales Feb 15 '22 at 04:44
-
Specifically: [pl.Series.cast](https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.Series.cast.html) – daviewales Feb 15 '22 at 04:46
1 Answers
0
If you need to preprocess some column like converters do in pandas, you can just read that column as pl.Utf8
dtype and use polars expressions
to process that column before a cast.
csv = """a,b,c
#12,1,2,
#1,3,4
1,45,5""".encode()
(pl.read_csv(csv, dtypes={"a": pl.Utf8})
.with_column(pl.col("a").str.replace("#", "").cast(pl.Int64))
)
Or if you want to do the same to multiple columns of that dtype
csv = """a,b,c,str_col
#12,1#,2foo,
#1,3#,4,bar
1,45#,5,ham""".encode()
pl.read_csv(
file = csv,
).with_columns([
pl.col(pl.Utf8).exclude("str_col").str.replace("#","").cast(pl.Int64),
])

ritchie46
- 10,405
- 1
- 24
- 43