How do I overwrite a dtype that Polars has inferred when reading a .CSV file?

Question

I am attempting to read data from a .CSV file into a Polars dataframe for analysis. I followed this prior Stack Overflow answer to get help on the layout of the CsvReader's chained methods.

fn read_csv_to_dataframe(path: &str) -> PolarsResult<DataFrame> {
    CsvReader::from_path(path)?.has_header(true).finish()
}

When I run the code, I'm met with the following error from the compiler:

Could not parse `"TA1305000009"` as dtype `i64` at column 'end_station_id' (column number 8).
The current offset in the file is 108268613 bytes.

You might want to try:
- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
- specifying correct dtype with the `dtypes` argument
- setting `ignore_errors` to `True`,
- adding `"TA1305000009"` to the `null_values` list.

What I'd like to do:

Parse the CSV file
Input an override for the column end_station_id's i64 type to cast it to a string type.
Return the resulting dataframe.

I can get the dataframe to render when I use .with_ignore_errors(ignore: true), but I specifically need the end_station_id and start_station_id columns, so casting to a string would be preferable.

From reading the docs for CsvReader, I've run into a few possibilities:

1- with_dtypes() - Overwrite the schema with the dtypes in this given Schema. The given schema may be a subset of the total schema. This looks to be the optimal solution since I don't want to overwrite the whole schema, I'd just need to override a specific field.

2- with_schema() - Set the CSV file’s schema. This only accepts datatypes that are implemented in the csv parser and expects a complete Schema. It is recommended to use with_dtypes instead. Ideally I would avoid this second implementation because I'd like to avoid having to write out the whole schema from scratch.

I don't know how to implement either one of these solutions since it appears to ask for a Schema definition and I can't quite follow how to provide one.

Help for this newbie Rustacean would be appreciated!

EDIT:

I may have solved my own problem.

use polars::prelude::*;

fn read_csv_from_file(filepath: &str) -> PolarsResult<DataFrame> {

    let num_records: Option<usize> = Some(10000);
    CsvReader::from_path(filepath)?.has_header(true).infer_schema(num_records).finish()
}

fn main() {
    let path: String = String::from("/example_path/example_csv.csv");
    let df: PolarsResult<DataFrame> = read_csv_from_file(&path);
    println!("{:?}", df);
}

By trying to infer the schema and setting the length to 10000, I allowed Polars to read 10000 rows from the CSV and use that information to check the types within end_station_id. It could then see that there were string types as well as i64 and it guessed correctly that I'd need strings.

I'd still want to figure out how to override dtypes, but this will work for now.

EDIT 2:

Doing a bit more digging, it appears there's another similar question which appears to answer the question I had. Not 100%, but close.

https://stackoverflow.com/a/67135702/11098442

My resulting solution which converts the end_station_id to a string dtype by supplying an overriding schema:

fn read_csv_from_file(filepath: &str) -> PolarsResult<DataFrame> {
        let my_schema = Schema::from_iter(
            vec![
                Field::new("end_station_id", DataType::Utf8),
            ]
            .into_iter(),
    );

    println!("{:?}", my_schema);

    CsvReader::from_path(filepath)?
        .has_header(true)
        .with_dtypes(Some(Arc::new(my_schema)))
        .finish()
}

fn main() -> Result<(), Box<dyn Error>> {
    configure_the_environment(); // This sets a few of my preferred configs for Polars, can be changed on a user basis.

    let path: String = String::from("/example_path/example_csv.csv");
    let mut df: DataFrame = read_csv_from_file(&path)?;

    println!("{:?}", df);
    Ok(())
}

How do I overwrite a dtype that Polars has inferred when reading a .CSV file?

0 Answers0