0

I am attempting to relearn data-science in rust.

I have a Vec<String> that includes a delimiter "|" and a new line "!end".

What I'd like to end up with is Vec<Vec<String>> that can be put into a 2D ND array.

I have this python Code:

file = open('somefile.dat')
lst = []
for line in file:
    lst += [line.split('|')]
    
df = pd.DataFrame(lst)
SAMV2FinalDataFrame = pd.DataFrame(lst,columns=column_names)

And i've recreated it here in rust:



fn lines_from_file(filename: impl AsRef<Path>) -> Vec<String> {
    let file = File::open(filename).expect("no such file");
    let buf = BufReader::new(file);
    buf.lines()
        .map(|l| l.expect("Could not parse line"))
        .collect()
}

fn main() {
    let lines = lines_from_file(".dat");
    let mut new_arr = vec![];
//Here i get a lines immitable borrow
    for line in lines{
        new_arr.push([*line.split("!end")]);
    }

// here i get expeected closure found str
let x = lines.split("!end");



let array = Array::from(lines)

what i have: ['1','1','1','end!','2','2','2','!end'] What i need: [['1','1','1'],['2','2','2']]

Edit: also why when i turbo fish does it make it disappear on Stack Overflow?

MB-F
  • 22,770
  • 4
  • 61
  • 116
TrapLordOb
  • 25
  • 4
  • To answer your edit: SO interprets `< >` as HTML tags. It's good practice to wrap inline code snippets with single ` tick marks for proper formatting (see my edit to your post). – MB-F Feb 18 '22 at 19:35
  • Are you sure your equivalent Python code works as expected? I can't see any handling of `"!end"` tags. – MB-F Feb 18 '22 at 19:37
  • @MB-F Yes the split in python is pretty powerful. – TrapLordOb Feb 18 '22 at 19:53
  • 2
    Can you provide a sample input? – PitaJ Feb 18 '22 at 20:04
  • `*line.split("!end")` doesn't even compile. I'm not sure how this code would even produce a result. `error[E0614]: type std::str::Split<'_, &str> cannot be dereferenced` – cdhowie Feb 18 '22 at 20:12
  • 928338219||3HY83||A|Z5|20030917|20220713|20211110|20210114|FEDERAL HIGHWAY ADMINISTRATION||OFFICE OF ACQUISITIION AND GRANTS MANAGEMENT||1200 NEW JERSEY AVE SE||WASHINGTON|DC|20590|0001|USA|98|19720101|0930||2A|||0002|2R~NG|926120|0001|926120N|0000||N||1200 NEW JERSEY AVENUE, SE||WASHINGTON|20590|0001|USA|DC|TABITHA||LORTHRIDGE||1200 NEW JERSEY AVENUE SE|ROOM E65-314|WASHINGTON|20590|0001|USA|DC||||||||||||||||||||||JASON||JOHNSON||1200 NEW JERSEY AVENUE, SE||WASHINGTON DC|20590||USA|DC||||||||||||||||||||TABITHA||LORTHRIDGE||1200 NEW JERSEY AVENUE SE|ROOM E65-314|NUE SE||N||0000|||0000||!end – TrapLordOb Feb 18 '22 at 20:29
  • @MB-F second look at the output, there is a newline after the !end, so that might be wheere it is splitting in python. – TrapLordOb Feb 18 '22 at 20:30
  • @cdhowie Yeah... I am honeslty at a loss on how to write it. I need to split into a new Vec of Strings at !End, or some other delimiter. – TrapLordOb Feb 18 '22 at 20:31

1 Answers1

1

I think part of the issue you ran into was due how you worked with arrays. For example, Vec::push will only add a single element so you would want to use Vec::extend instead. I also ran into a few cases of empty strings due to splitting by "!end" would leave trailing '|' on the ends of substrings. The errors were quite strange, I am not completely sure where the closure came from.

let lines = vec!["1|1|1|!end|2|2|2|!end".to_string()];
let mut new_arr = Vec::new();

// Iterate over &lines so we don't consume lines and it can be used again later
for line in &lines {
    new_arr.extend(line.split("!end")
        // Remove trailing empty string
        .filter(|x| !x.is_empty())
        // Convert each &str into a Vec<String>
        .map(|x| {
            x.split('|')
                // Remove empty strings from ends split (Ex split: "|2|2|2|")
                .filter(|x| !x.is_empty())
                // Convert &str into owned String
                .map(|x| x.to_string())
                // Turn iterator into Vec<String>
                .collect::<Vec<_>>()
    }));
}

println!("{:?}", new_arr);

I also came up with this other version which should handle your use case better. The earlier approach dropped all empty strings, while this one should preserve them while correctly handling the "!end".

use std::io::{self, BufRead, BufReader, Read, Cursor};

fn split_data<R: Read>(buffer: &mut R) -> io::Result<Vec<Vec<String>>> {
    let mut sections = Vec::new();
    let mut current_section = Vec::new();
    
    for line in BufReader::new(buffer).lines() {
        for item in line?.split('|') {
            if item != "!end" {
                current_section.push(item.to_string());
            } else {
                sections.push(current_section);
                current_section = Vec::new();
            }
        }
    }
        
    Ok(sections)
}

In this example, I used Read for easier testing, but it will also work with a file.

let sample_input = b"1|1|1|!end|2|2|2|!end";
println!("{:?}", split_data(&mut Cursor::new(sample_input)));
// Output: Ok([["1", "1", "1"], ["2", "2", "2"]])

// You can also use a file instead
let mut file = File::new("somefile.dat");
let solution: Vec<Vec<String>> = split_data(&mut file).unwrap();

playground link

Locke
  • 7,626
  • 2
  • 21
  • 41