5

I'm pretty new to Rust and trying to implement some kind of database. Users should create tables by giving a table name, a vector of column names and a vector of column types (realized over an enum). Filling tables should be done by specifying csv files. However, this requires the structure of the table rows to be specified at compile time, like shown in the basic example:

#[derive(Debug, Deserialize, Eq, PartialEq)]
struct Row {
    key: u32,
    name: String,
    comment: String
}
use std::error::Error;
use csv::ReaderBuilder;
use serde::Deserialize;
use std::fs;

fn read_from_file(path: &str) -> Result<(), Box<dyn Error>> {
    let data = fs::read_to_string(path).expect("Unable to read file");
    let mut rdr = ReaderBuilder::new()
        .has_headers(false)
        .delimiter(b'|')
        .from_reader(data.as_bytes());
    let mut iter = rdr.deserialize();

    if let Some(result) = iter.next() {
        let record:Row = result?;
        println!("{:?}", record);
        Ok(())
    } else {
        Err(From::from("expected at least one record but got none"))
    }   
}

Is there a possibility to use the generic table information instead of the "Row"-struct to cast the results from the deserialization? Is it possible to simply allocate memory according to the combined sizes of the column types and parse the records in? I would do something like this in C...

E_net4
  • 27,810
  • 13
  • 101
  • 139
C4st1el
  • 51
  • 2
  • serde_json has a generic `Value` type, that offers runtime dynamic building and mapping of JSON. https://stackoverflow.com/questions/59047280/how-to-build-json-arrays-or-objects-dynamically-with-serde-json has some more explanation. This might be a first direction to look into. – berkes Oct 15 '20 at 11:15
  • 1
    Your `result` has type [`StringRecord`](https://docs.rs/csv/1.1.3/csv/struct.StringRecord.html) which can be handled more or less as an array of strings. – Jmb Oct 15 '20 at 11:45
  • 1
    @Jmb That is right, I can store each row as a Vector of Strings and convert to the actual type each time I access it. However, this seems not very efficient. – C4st1el Oct 15 '20 at 12:00
  • So your question isn't so much about CSV reading, and more along the lines of "How can I store values of different types when the type is only known at run time?" Then you want to use an [`enum`](https://doc.rust-lang.org/stable/book/ch06-01-defining-an-enum.html) with variants for each possible type. – Jmb Oct 15 '20 at 12:23
  • Since you are reading from a CSV file, all your values will be of `String` type. Conversion can happen to a particular type when you insert. You can leverage the `From` trait to achieve this. Eg., if your table A needs value to be of type `u32` then you can do `impl From for u32 ` (this might be implemented already..not sure). And during insert you can do `let value_to_be_inserted_in_table_A: u32 = string_value.into();` You can read more about `From` and `Into` traits here https://sjoshid.blog/2020/06/07/from-and-to-traits-in-rust/ – Boss Man Oct 15 '20 at 14:23
  • @SujitJoshi right, but that is only the mechanism to cast the input, which is not the main problem. The main problem is how to actually store the casted value (e.g. in which structure), if the type was not known at compile time. Outoftime's answer seems to solve the problem then. – C4st1el Oct 16 '20 at 06:35
  • @C4st1el Ok I see you what you wanted now but as @outoftime already said, storing it in `Any` means losing the concrete type. So if you want to determine the actual type later, you'll have to do the if-else logic again. After having worked on many databases, I suggest you rethink your strategy so wont lose concrete type. – Boss Man Oct 18 '20 at 20:39

1 Answers1

0

Is there a possibility to use the generic table information instead of the "Row"-struct to cast the results from the deserialization?

All generics replaced with concrete types at compile time. If you do not know types you will need in runtime, "generics" is not what you need.

Is it possible to simply allocate memory according to the combined sizes of the column types and parse the records in? I would do something like this in C...

I suggest using Box<dyn Any> instead, to be able to store reference of any type and, still, know what type it is.

Maintenance cost for this approach is pretty high. You have to manage each possible value type everywhere you want to use a cell's value. On the other hand, you do not need to parse value each time, just make some type checks in runtime.

I have used std::any::TypeId to identify type, but it can not be used in match expressions. You can consider using custom enum as type identifier.

use std::any::{Any, TypeId};
use std::io::Read;

use csv::Reader;

#[derive(Default)]
struct Table {
    name: String,
    headers: Vec<(String, TypeId)>,
    data: Vec<Vec<Box<dyn Any>>>,
}

impl Table {
    fn add_header(&mut self, header: String, _type: TypeId) {
        self.headers.push((header, _type));
    }

    fn populate_data<R: Read>(
        &mut self,
        rdr: &mut Reader<R>,
    ) -> Result<(), Box<dyn std::error::Error>> {
        for record in rdr.records() {
            let record = record?;
            let mut row: Vec<Box<dyn Any>> = vec![];
            for (&(_, type_id), value) in self.headers.iter().zip(record.iter()) {
                if type_id == TypeId::of::<u32>() {
                    row.push(Box::new(value.parse::<u32>()?));
                } else if type_id == TypeId::of::<String>() {
                    row.push(Box::new(value.to_owned()));
                }
            }
            self.data.push(row);
        }
        Ok(())
    }
}

impl std::fmt::Display for Table {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        writeln!(f, "Table: {}", self.name)?;
        for (name, _) in self.headers.iter() {
            write!(f, "{}, ", name)?;
        }
        writeln!(f)?;
        for row in self.data.iter() {
            for cell in row.iter() {
                if let Some(&value) = cell.downcast_ref::<u32>() {
                    write!(f, "{}, ", value)?;
                } else if let Some(value) = cell.downcast_ref::<String>() {
                    write!(f, "{}, ", value)?;
                }
            }
            writeln!(f)?;
        }
        Ok(())
    }
}

fn main() {
    let mut table: Table = Default::default();
    table.name = "Foo".to_owned();
    table.add_header("key".to_owned(), TypeId::of::<u32>());
    table.add_header("name".to_owned(), TypeId::of::<String>());
    table.add_header("comment".to_owned(), TypeId::of::<String>());
    let data = "\
key,name,comment
1,foo,foo comment
2,bar,bar comment
";
    let mut rdr = Reader::from_reader(data.as_bytes());
    table.populate_data(&mut rdr).unwrap();
    print!("{}", table);
}
outoftime
  • 715
  • 7
  • 21
  • This is exactly what I was looking for. Surely there is "code" overhead here, but potentially better performance. I planned to hide the additional code complexity behind function pointers in the header. – C4st1el Oct 16 '20 at 06:37
  • `Any` should usually be avoided when possible. If the table can only contain a limited set of types (eg. integers, floats or strings), then it's better to use an enum. This avoids an indirection, allows you to simplify some actions (eg. you can implement `Display` for your enum), and makes sure you don't forget a type in code that uses the value. – Jmb Oct 16 '20 at 06:44
  • Thanks for the hint. I'll try out both. – C4st1el Oct 16 '20 at 07:06