2

I'm using polars and I would like to define the type of the columns while loading a dataframe. In pandas, I can use dtype:

df=pd.read_csv("iris.csv", dtype={'petal_length':str})

I'm trying to do the same thing in polars, but without success until now. Here is what I have tried:

use polars::prelude::*;
use std::fs::File;
use std::collections::HashMap;


fn main() {
    let df = example();
    println!("{:?}", df.expect("Cannot find dataframe").head(Some(10)))
}

fn example() -> Result<DataFrame> {
    let file = File::open("iris.csv")
                    .expect("could not read file");
    let mut myschema = HashMap::new();
    myschema.insert("sepal_length", f64);
    myschema.insert("sepal_width", f64); 
    myschema.insert("petal_length",String); 
    myschema.insert("petal_width", f64); 
    myschema.insert("species", String); 

    CsvReader::new(file)
            .with_schema(myschema)
            .has_header(true)
            .finish()
}

My doubt is what type of data the implementation with_schema expects? I printed the schema of the DataFrame loaded using infer_schema(None).This prints a object that looks like a dictionary:

Schema { fields: [Field { name: "sepal_length", data_type: Float64 }, Field { name: "sepal_width", data_type: Float64 }, Field { name: "petal_length", data_type: Float64 }, Field { name: "petal_width", data_type: Float64 }, Field { name: "species", data_type: Utf8 }] }

But I cannot figure what object I should use to implement my schema.

Also, there is a way to specify the type of one variable, instead of all of them?

ritchie46
  • 10,405
  • 1
  • 24
  • 43
Lucas
  • 1,166
  • 2
  • 14
  • 34

2 Answers2

3

The with_schema method expects an Arc<Schema> type, not a Hashmap.

The following code works:

use polars::prelude::*;
use std::sync::Arc;

fn example() -> Result<DataFrame> {
    let file = "iris.csv";

    let myschema = Schema::new(
        vec![
            Field::new("sepal_length", DataType::Float64),
            Field::new("sepal_width", DataType::Float64),
            Field::new("petal_length", DataType::Utf8),
            Field::new("petal_width", DataType::Float64),
            Field::new("species", DataType::Utf8),
        ]
    );

    CsvReader::from_path(file)?
        .with_schema(Arc::new(myschema))
        .has_header(true)
        .finish()
}

Also, there is a way to specify the type of one variable, instead of all of them?

Yes, you can use with_dtype_overwrite. Which expects a partial schema.

ritchie46
  • 10,405
  • 1
  • 24
  • 43
1

A slight update to ritche46's answer. As Robert stated, the vector needs to be changed to an iterator. And it looks like we should use from now instead of new? I've not executed the code below, but it compiles.

...
        let myschema = Schema::from(
            vec![
                Field::new("sepal_length", DataType::Float64),
                Field::new("sepal_width", DataType::Float64),
                Field::new("petal_length", DataType::Utf8),
                Field::new("petal_width", DataType::Float64),
                Field::new("species", DataType::Utf8),
            ]
            .into_iter(),
        );
...
C. Thomas Brittain
  • 376
  • 1
  • 5
  • 12