0

Problem

How to read a dataframe in polars from mysql. Docs are silent on the issue. Currently probably there is only support for parquet, json, ipc, etc, and no direct support for sql as mentioned here.

Regardless what would be an appropriate method to read in data using libraries like: sqlx or mysql

Current Approach

Currently I am following this approach as provided in this answer:

  1. Read in a Vec<Struct> using sqlx
  2. Convert it into a tuple of vecs (Vec<T>, Vec<T>) using the code below
  3. Convert (Vec<T>, Vec<T>) into (Series, Series)
  4. Create a dataframe using: DataFrame::new(vec![s0, s1]); where s0 and s1 are Series
struct A(u8, i8);

fn main() {
    let v = vec![A(1, 4), A(2, 6), A(3, 5)];

    let result = v.into_iter()
        .fold((vec![], vec![]), |(mut u, mut i), item| {
            u.push(item.0);
            i.push(item.1);
            (u, i)
        });
    
    dbg!(result);

    // `result` is just a tuple of vectors
    // let (unsigneds, signeds): (Vec<u8>, Vec<i8>) = result;
}
Hamza Zubair
  • 1,232
  • 13
  • 21

2 Answers2

4

This can help you?

let schema = Arc::new(Schema::new(vec![
    Field::new("country", DataType::Int64, false),
    Field::new("count", DataType::Int64, false),
]));

let datas = RecordBatch::try_new(
    schema.clone(),
    vec![
        Arc::new(Int64Array::from(vec![1, 1, 2])),
        Arc::new(Int64Array::from(vec![1, 2, 3])),
    ],
)
.unwrap();

let mut df = DataFrame::try_from(datas)?;
Zeppi
  • 1,175
  • 6
  • 11
4

Same answer as in this question, seems quite duplicate IMO.

You could use the builders for that or collect from iterators. Collecting from iterators is often fast, but in this case it requires you to loop the Vec<Country> twice, so you should benchmark.

Below is an example function for both the solutions shown.

use polars::prelude::*;

struct Country {
    country: String,
    count: i64,
}

fn example_1(values: &[Country]) -> (Series, Series) {
    let ca_country: Utf8Chunked = values.iter().map(|v| &*v.country).collect();
    let ca_count: NoNull<Int64Chunked> = values.iter().map(|v| v.count).collect();
    let mut s_country: Series = ca_country.into();
    let mut s_count: Series = ca_count.into_inner().into();
    s_country.rename("country");
    s_count.rename("country");
    (s_count, s_country)
}

fn example_2(values: &[Country]) -> (Series, Series) {
    let mut country_builder = Utf8ChunkedBuilder::new("country", values.len(), values.len() * 5);
    let mut count_builder = PrimitiveChunkedBuilder::<Int64Type>::new("count", values.len());

    values.iter().for_each(|v| {
        country_builder.append_value(&v.country);
        count_builder.append_value(v.count)
    });

    (
        count_builder.finish().into(),
        country_builder.finish().into(),
    )
}

Once you've got the Series, you can use DataFrame::new(columns) where columns: Vec<Series> to create a DataFrame.

Btw, if you want maximum performance, I really recommend connector-x. It has got polars and arrow integration and has got insane performance.

ritchie46
  • 10,405
  • 1
  • 24
  • 43