3

With this sample code:

use std::fs::{File};
use std::io::{BufRead, BufReader};
use std::path::Path;

type BoxIter<T> = Box<Iterator<Item=T>>;

fn tokens_from_str<'a>(text: &'a str) 
-> Box<Iterator<Item=String> + 'a> {
    Box::new(text.lines().flat_map(|s|
        s.split_whitespace().map(|s| s.to_string())
    ))
}

// Returns an iterator of an iterator. The use case is a very large file where
// each line is very long. The outer iterator goes over the file's lines.
// The inner iterator returns the words of each line.
pub fn tokens_from_path<P>(path_arg: P) 
-> BoxIter<BoxIter<String>>
where P: AsRef<Path> {
    let reader = reader_from_path(path_arg);
    let iter = reader.lines()
        .filter_map(|result| result.ok())
        .map(|s| tokens_from_str(&s));
    Box::new(iter)
}

fn reader_from_path<P>(path_arg: P) -> BufReader<File>
where P: AsRef<Path> {
    let path = path_arg.as_ref();
    let file = File::open(path).unwrap();
    BufReader::new(file)
}

I get this compiler error message:

rustc 1.18.0 (03fc9d622 2017-06-06)
error: `s` does not live long enough
  --> <anon>:23:35
   |
23 |         .map(|s| tokens_from_str(&s));
   |                                   ^- borrowed value only lives until here
   |                                   |
   |                                   does not live long enough
   |
   = note: borrowed value must be valid for the static lifetime...

My questions are:

  • How can this be fixed (without changing the function signatures, if possible?)

  • Any suggestions on better function arguments and return values?

David J.
  • 31,569
  • 22
  • 122
  • 174

3 Answers3

2

One issue is that .split_whitespace() takes a reference, and doesn't own its content. So when you try to construct a SplitWhitespace object with an owned owned object (this happens when you call .map(|s| tokens_from_str(&s))), the string s is dropped while SplitWhitespace is still trying to reference it. I wrote a quick fix to this by creating a struct that takes ownership of the String and yields a SplitWhitespace on demand.

use std::fs::File;
use std::io::{BufRead, BufReader};
use std::path::Path;
use std::iter::IntoIterator;
use std::str::SplitWhitespace;

pub struct SplitWhitespaceOwned(String);

impl<'a> IntoIterator for &'a SplitWhitespaceOwned {
    type Item = &'a str;
    type IntoIter = SplitWhitespace<'a>;
    fn into_iter(self) -> Self::IntoIter {
        self.0.split_whitespace()
    }
}

// Returns an iterator of an iterator. The use case is a very large file where
// each line is very long. The outer iterator goes over the file's lines.
// The inner iterator returns the words of each line.
pub fn tokens_from_path<P>(path_arg: P) -> Box<Iterator<Item = SplitWhitespaceOwned>>
    where P: AsRef<Path>
{
    let reader = reader_from_path(path_arg);
    let iter = reader
        .lines()
        .filter_map(|result| result.ok())
        .map(|s| SplitWhitespaceOwned(s));
    Box::new(iter)
}

fn reader_from_path<P>(path_arg: P) -> BufReader<File>
    where P: AsRef<Path>
{
    let path = path_arg.as_ref();
    let file = File::open(path).unwrap();
    BufReader::new(file)
}

fn main() {
    let t = tokens_from_path("test.txt");

    for line in t {
        for word in &line {
            println!("{}", word);
        }
    }
}
breeden
  • 715
  • 1
  • 6
  • 19
2

Disclaimer: Frame Challenge

When dealing with huge files, the simplest solution is to use Memory Mapped Files.

That is, you tell the OS that you want the whole file to be accessible in memory, and it's up to it to deal with paging parts of the file in and out of memory.

Once this is down, then your whole file is accessible as a &[u8] or &str (at your convenience), and you can trivially access slices of it.

It may not always be the fastest solution; it certainly is the easiest.

Matthieu M.
  • 287,565
  • 48
  • 449
  • 722
  • Thanks for this comment; I appreciate the spirit of it. That said, given the format of stack overflow, this content, as currently written, doesn't squarely answer the original question. (Sorry to be a stickler, but I think the SO format is important to structure the content.) – David J. Jun 13 '17 at 21:24
  • @DavidJ.: I added a comment that I am challenging the frame of the question to make it more obvious that it's slightly different. I think it's important to consider alternatives in the design; and one alternative is to avoid intermediate owning structures (in this case, by having a single owning structure with intermediates only using slices). – Matthieu M. Jun 14 '17 at 08:18
0

The problem here is that you do use to_string() to turn each item into an owned value, this is done lazily. Since it's lazy, the value before to_string is used (a &str) still exists in the returned iterator's state, and thus is invalid (since the source String is dropped as soon as your map closure returns).

Naive solution

The simplest solution here is to remove the lazy evaluation for that part of the iterator, and simply allocate all tokens as soon as the line is allocated. This will not be as fast, and will involve an additional allocation, but has minimal changes from your current function, and keeps the same signature:

// Returns an iterator of an iterator. The use case is a very large file where
// each line is very long. The outer iterator goes over the file's lines.
// The inner iterator returns the words of each line.
pub fn tokens_from_path<P>(path_arg: P) -> BoxIter<BoxIter<String>>
where
    P: AsRef<Path>
{
    let reader = reader_from_path(path_arg);
    let iter = reader.lines()
        .filter_map(|result| result.ok())
        .map(|s| {
            let collected = tokens_from_str(&s).collect::<Vec<_>>();

            Box::new(collected.into_iter()) as Box<Iterator<Item=String>>
        });

    Box::new(iter)
}

This solution will be fine for any small workload, and it will only allocate about twice as much memory for the line at the same time. There's a performance penalty, but unless you have 10mb+ lines, it probably won't matter.

If you do choose this solution, I'd recommend changing the function signature of tokens_from_path to directly return a BoxIter<String>:

pub fn tokens_from_path<P>(path_arg: P) -> BoxIter<String>
where
    P: AsRef<Path>
{
    let reader = reader_from_path(path_arg);
    let iter = reader.lines()
        .filter_map(|result| result.ok())
        .flat_map(|s| {
            let collected = tokens_from_str(&s).collect::<Vec<_>>();

            Box::new(collected.into_iter()) as Box<Iterator<Item=String>>
        });

    Box::new(iter)
}

Alternative: decouple tokens_from_path and tokens_from_str

The original code doesn't work because you are trying to return borrows to a String which you don't return.

We can fix that by returning the String instead - just hidden behind an opaque API. This is pretty similar to breeden's solution, but slightly different in execution.

use std::fs::{File};
use std::io::{BufRead, BufReader};
use std::path::Path;

type BoxIter<T> = Box<Iterator<Item=T>>;

/// Structure representing in our code a line, but with an opaque API surface.
pub struct TokenIntermediate(String);

impl<'a> IntoIterator for &'a TokenIntermediate {
    type Item = String;
    type IntoIter = Box<Iterator<Item=String> + 'a>;

    fn into_iter(self) -> Self::IntoIter {
        // delegate to tokens_from_str
        tokens_from_str(&self.0)
    }
}

fn tokens_from_str<'a>(text: &'a str) -> Box<Iterator<Item=String> + 'a> {
    Box::new(text.lines().flat_map(|s|
        s.split_whitespace().map(|s| s.to_string())
    ))
}

// Returns an iterator of an iterator. The use case is a very large file where
// each line is very long. The outer iterator goes over the file's lines.
// The inner iterator returns the words of each line.
pub fn token_parts_from_path<P>(path_arg: P) -> BoxIter<TokenIntermediate>
where
    P: AsRef<Path>
{
    let reader = reader_from_path(path_arg);
    let iter = reader.lines()
        .filter_map(|result| result.ok())
        .map(|s| TokenIntermediate(s));

    Box::new(iter)
}

fn reader_from_path<P>(path_arg: P) -> BufReader<File>
where P: AsRef<Path> {
    let path = path_arg.as_ref();
    let file = File::open(path).unwrap();
    BufReader::new(file)
}

As you notice, there's no difference to tokens_from_str, and tokens_from_path just returns this opaque TokenIntermediate struct. This will be just as usable as your original solution, all it does is push the ownership of your intermediate String values onto the caller, so they can iterate the tokens in them.

daboross
  • 716
  • 10
  • 23