The problem here is that you do use to_string()
to turn each item into an owned value, this is done lazily. Since it's lazy, the value before to_string is used (a &str
) still exists in the returned iterator's state, and thus is invalid (since the source String is dropped as soon as your map
closure returns).
Naive solution
The simplest solution here is to remove the lazy evaluation for that part of the iterator, and simply allocate all tokens as soon as the line is allocated. This will not be as fast, and will involve an additional allocation, but has minimal changes from your current function, and keeps the same signature:
// Returns an iterator of an iterator. The use case is a very large file where
// each line is very long. The outer iterator goes over the file's lines.
// The inner iterator returns the words of each line.
pub fn tokens_from_path<P>(path_arg: P) -> BoxIter<BoxIter<String>>
where
P: AsRef<Path>
{
let reader = reader_from_path(path_arg);
let iter = reader.lines()
.filter_map(|result| result.ok())
.map(|s| {
let collected = tokens_from_str(&s).collect::<Vec<_>>();
Box::new(collected.into_iter()) as Box<Iterator<Item=String>>
});
Box::new(iter)
}
This solution will be fine for any small workload, and it will only allocate about twice as much memory for the line at the same time. There's a performance penalty, but unless you have 10mb+ lines, it probably won't matter.
If you do choose this solution, I'd recommend changing the function signature of tokens_from_path
to directly return a BoxIter<String>
:
pub fn tokens_from_path<P>(path_arg: P) -> BoxIter<String>
where
P: AsRef<Path>
{
let reader = reader_from_path(path_arg);
let iter = reader.lines()
.filter_map(|result| result.ok())
.flat_map(|s| {
let collected = tokens_from_str(&s).collect::<Vec<_>>();
Box::new(collected.into_iter()) as Box<Iterator<Item=String>>
});
Box::new(iter)
}
Alternative: decouple tokens_from_path
and tokens_from_str
The original code doesn't work because you are trying to return borrows to a String which you don't return.
We can fix that by returning the String instead - just hidden behind an opaque API. This is pretty similar to breeden's solution, but slightly different in execution.
use std::fs::{File};
use std::io::{BufRead, BufReader};
use std::path::Path;
type BoxIter<T> = Box<Iterator<Item=T>>;
/// Structure representing in our code a line, but with an opaque API surface.
pub struct TokenIntermediate(String);
impl<'a> IntoIterator for &'a TokenIntermediate {
type Item = String;
type IntoIter = Box<Iterator<Item=String> + 'a>;
fn into_iter(self) -> Self::IntoIter {
// delegate to tokens_from_str
tokens_from_str(&self.0)
}
}
fn tokens_from_str<'a>(text: &'a str) -> Box<Iterator<Item=String> + 'a> {
Box::new(text.lines().flat_map(|s|
s.split_whitespace().map(|s| s.to_string())
))
}
// Returns an iterator of an iterator. The use case is a very large file where
// each line is very long. The outer iterator goes over the file's lines.
// The inner iterator returns the words of each line.
pub fn token_parts_from_path<P>(path_arg: P) -> BoxIter<TokenIntermediate>
where
P: AsRef<Path>
{
let reader = reader_from_path(path_arg);
let iter = reader.lines()
.filter_map(|result| result.ok())
.map(|s| TokenIntermediate(s));
Box::new(iter)
}
fn reader_from_path<P>(path_arg: P) -> BufReader<File>
where P: AsRef<Path> {
let path = path_arg.as_ref();
let file = File::open(path).unwrap();
BufReader::new(file)
}
As you notice, there's no difference to tokens_from_str
, and tokens_from_path
just returns this opaque TokenIntermediate
struct. This will be just as usable as your original solution, all it does is push the ownership of your intermediate String
values onto the caller, so they can iterate the tokens in them.