0

i have a dataframe that look like this enter image description here

i would like to convert the first column 'time' with date strings into a column with only numbers in format of "YYYYMMDD"(e.g.: 20230531) in u64.

i tried building up a function to do this but i am struggling and espcially in how to remove the hyphens in date strings.

pub fn convert_str_to_num(df: &DataFrame) -> Result<DataFrame, PolarsError> {
    let mut final_df = df.clone();
    let col_name = String::from("time");
    let time_col = df.column(col_name.as_str())?;
    let mut new_time_col = time_col.clone().replace("-", "")?;
    // replace the old column with the new one
    final_df.replace(col_name.as_str(), new_time_col.as_mut())?;
    Ok(final_df)
}

somehow this returns

error[E0599]: no method named `replace` found for struct `polars::prelude::Series` in the current scope
  --> src/main.rs:13:45
   |
13 |     let mut new_time_col = time_col.clone().replace("-", "")?;
   |                                             ^^^^^^^ method not found in `Series`
Arthur Zhang
  • 107
  • 8
  • 2
    I've gotta ask, are you sure this is what you want? Using numbers as strings is generally a bad idea — if you can't add them (zip codes, SSNs, etc), they're not *really* numbers. – BallpointBen Jun 08 '23 at 14:24

2 Answers2

2

Assuming you already obtained the date string from the first column, I would use a function as in the following example.

It starts by splitting the string-slice according to the '-' separator. This provides an iterator delivering sub-slices of the input string-slice, but does not involve any copy of any part of the original string.

At each iteration, we try to parse the delivered sub-slice in order to extract a u64 value. If this fails, the function reports the error thanks to ?. When it succeeds, we simply update the value as you expect (100×100×year + 100×month + day).

In the end, we must ensure three parts have been parsed (year, month, day) and report an error if it is not the case.

Finally, the value which was updated during the three iterations is the expected result.

Note that we could add some bounds checking about the month and the day.

fn txt_date_to_u64(
    txt_date: &str
) -> Result<u64, Box<dyn std::error::Error>> {
    let mut part_count = 0;
    let mut year_month_day = 0;
    for part in txt_date.split('-') {
        year_month_day = year_month_day * 100 + str::parse::<u64>(part)?;
        part_count += 1;
    }
    if part_count != 3 {
        Err("unexpected date")?;
    }
    Ok(year_month_day)
}

fn main() {
    for txt_date in [
        "2023-05-31",
        "what???",
        "2004-01-07",
        "2004-01",
        "2004-01-07-19",
    ] {
        match txt_date_to_u64(txt_date) {
            Ok(d) => {
                println!("{:?} ~~> {:?}", txt_date, d);
            }
            Err(e) => {
                println!("{:?} ~~> !!! {:?} !!!", txt_date, e);
            }
        }
    }
}
/*
"2023-05-31" ~~> 20230531
"what???" ~~> !!! ParseIntError { kind: InvalidDigit } !!!
"2004-01-07" ~~> 20040107
"2004-01" ~~> !!! "unexpected date" !!!
"2004-01-07-19" ~~> !!! "unexpected date" !!!
*/
prog-fh
  • 13,492
  • 1
  • 15
  • 30
1

turns out i have solved my own question.

fn convert_str_to_int(mut df: DataFrame, date_col_name: &str) -> Result<DataFrame, PolarsError> {
    // Get the date column as a Series
    let date_col = df.column(date_col_name)?;
    // Convert each date string into an unsigned 32-bit integer value in the form of "YYYYMMDD"
    let int_values = date_col
        .utf8()?
        .into_iter()
        .map(|date_str| {
            let int_str = Cow::from(date_str.unwrap().replace('-', ""));
            // Parse the integer value as u32
            int_str.parse::<u32>().unwrap()
        })
        .collect::<Vec<_>>();
    // Create a new UInt32Chunked to replace the original column
    let u32_col = UInt32Chunked::new(date_col_name, int_values).into_series();
    // Create a new DataFrame with the converted unsigned 32-bit integer column
    df.replace(date_col_name, u32_col)?;
    Ok(df)
}
Arthur Zhang
  • 107
  • 8