8

In Polars, how can one specify a single dtype for all columns in read_csv?

According to the docs, the dtypes argument to read_csv can take either a mapping (dict) in the form of {'column_name': dtype}, or a list of dtypes, one for each column. However, it is not clear how to specify "I want all columns to be a single dtype".

If you wanted all columns to be Utf-8 for example and you knew the total number of columns, you could do:

pl.read_csv('sample.csv', dtypes=[pl.Utf8]*number_of_columns)

However, this doesn't work if you don't know the total number of columns. In Pandas, you could do something like:

pd.read_csv('sample.csv', dtype=str)

But this doesn't work in Polars.

daviewales
  • 2,144
  • 21
  • 30

2 Answers2

13

Reading all data in a csv to any other type than pl.Utf8 likely fails with a lot of null values. We can use expressions to declare how we want to deal with those null values.

If you read a csv with infer_schema_length=0, polars does not know the schema and will read all columns as pl.Utf8 as that is a super type of all polars types.

When read as Utf8 we can use expressions to cast all columns.

(pl.read_csv("test.csv", infer_schema_length=0)
   .with_columns(pl.all().cast(pl.Int32, strict=False))
ritchie46
  • 10,405
  • 1
  • 24
  • 43
1

If you want to read all columns as str (pl.Utf8 in polars) set infer_schema_length=0 as polars uses string as default type when reading csvs:

pl.read_csv('sample.csv', infer_schema_length=0)

This is the TLDR of ritchie46's more detailed answer. I broke it out into a separate answer as his code snippet solves the general case for any datatype and not the special but common case of reading all as strings.

Cornelius Roemer
  • 3,772
  • 1
  • 24
  • 55