14

My question has been partially answered, so I've revised it in response to things I've learned from comments and additional experiments.

In summary, I want a fast I/O routine for programming contests, in which problems are solved with a single file and no external crates. It should read in a sequence of whitespace-separated tokens from a BufRead (either stdin or a file). The tokens may be integers, floats or ASCII words, separated by spaces and newlines, so it seems I should support FromStr types generically. A small minority of problems are interactive, meaning not all of the input is available initially, but it always comes in complete lines.

For context, here's the discussion that led me to post here. Someone wrote very fast custom code to parse integers directly from the &[u8] output of BufRead::fill_buf(), but it's not generic in FromStr.

Here is my best solution so far (emphasis on the Scanner struct):

use std::io::{self, prelude::*};

fn solve<B: BufRead, W: Write>(mut scan: Scanner<B>, mut w: W) {
    let n = scan.token();
    let mut a = Vec::with_capacity(n);
    let mut b = Vec::with_capacity(n);
    for _ in 0..n {
        a.push(scan.token::<i64>());
        b.push(scan.token::<i64>());
    }
    let mut order: Vec<_> = (0..n).collect();
    order.sort_by_key(|&i| b[i] - a[i]);
    let ans: i64 = order
        .into_iter()
        .enumerate()
        .map(|(i, x)| a[x] * i as i64 + b[x] * (n - 1 - i) as i64)
        .sum();
    writeln!(w, "{}", ans);
}

fn main() {
    let stdin = io::stdin();
    let stdout = io::stdout();
    let reader = Scanner::new(stdin.lock());
    let writer = io::BufWriter::new(stdout.lock());
    solve(reader, writer);
}

pub struct Scanner<B> {
    reader: B,
    buf_str: String,
    buf_iter: std::str::SplitWhitespace<'static>,
}
impl<B: BufRead> Scanner<B> {
    pub fn new(reader: B) -> Self {
        Self {
            reader,
            buf_str: String::new(),
            buf_iter: "".split_whitespace(),
        }
    }
    pub fn token<T: std::str::FromStr>(&mut self) -> T {
        loop {
            if let Some(token) = self.buf_iter.next() {
                return token.parse().ok().expect("Failed parse");
            }
            self.buf_str.clear();
            self.reader
                .read_line(&mut self.buf_str)
                .expect("Failed read");
            self.buf_iter = unsafe { std::mem::transmute(self.buf_str.split_whitespace()) };
        }
    }
}

By avoiding unnecessary allocations, this Scanner is quite fast. If we didn't care about unsafety, it can be made even faster by, instead of doing read_line() into a String, doing read_until(b'\n') into a Vec<u8>, followed by str::from_utf8_unchecked().

However, I'd also like to know what's the fastest safe solution. Is there a clever way to tell Rust that what my Scanner implementation does is actually safe, eliminating the mem::transmute? Intuitively, it seems we should think of the SplitWhitespace object as owning the buffer until it's effectively dropped after it returns None.

All else being equal, I'd like a "nice" idiomatic standard library solution, as I'm trying to present Rust to others who do programming contests.

EbTech
  • 141
  • 5
  • 4
    *I'm guessing it has to do* — don't guess, **profile**. What does the profiler say? You are obviously going to be using a profiler if you are so concerned about runtime performance, yes? – Shepmaster Jun 04 '19 at 17:58
  • *with the extra `String` and `Vec` allocations, and UTF-8 validation* — none of these have anything to do with `FromStr`. If you started with a static string, presumably your implementation would be just as fast. – Shepmaster Jun 04 '19 at 18:02
  • Right, I'm not blaming ```FromStr```. To the contrary, seeing as the ```impl FromStr for i64``` defers to the ```as_bytes()``` representation anyway, I feel this must be possible to do efficiently. It might just be a matter of finding a more efficient way to obtain tokens from a ```BufRead```. You're right though, I oughta get some practice with a profiler... – EbTech Jun 04 '19 at 18:20
  • 1
    I know nearly nothing about Rust, but I wouldn't favor a program going into an endless loop on unexpected EOF. – Armali Jun 28 '19 at 13:13
  • I also would love sth like `cin>>x` or `(f)scanf(..)` in C++. I considered if to maybe rewrite text_io::read into short snippet tailored for programming contests https://github.com/oli-obk/rust-si/blob/master/src/lib.rs – Grzegorz Wierzowiecki Apr 24 '21 at 11:35
  • This looks like some direction to go, but not perfect to me : https://stackoverflow.com/questions/30355185/how-to-read-an-integer-input-from-the-user-in-rust-1-0/57200055#57200055 , I would love sth that reads word by word. – Grzegorz Wierzowiecki Apr 24 '21 at 11:36

2 Answers2

3

I'm so glad you asked, as I solved this exact problem in my LibCodeJam rust implementation. Specifically, reading raw tokens from a BufRead is handled by the TokensReader type as well as some tiny related helpers.

Here's the relevant excerpt. The basic idea here is to scan the BufRead::fill_buf buffer for whitespace, and copying non-whitespace characters into a local buffer, which is reused between token calls. Once a whitespace character is found, or the stream ends, the local buffer is interpreted as UTF-8 and returned as an &str.

#[derive(Debug)]
pub enum LoadError {
    Io(io::Error),
    Utf8Error(Utf8Error),
    OutOfTokens,
}

/// TokenBuffer is a resuable buffer into which tokens are
/// read into, one-by-one. It is cleared but not deallocated
/// between each token.
#[derive(Debug)]
struct TokenBuffer(Vec<u8>);

impl TokenBuffer {
    /// Clear the buffer and start reading a new token
    fn lock(&mut self) -> TokenBufferLock {
        self.0.clear();
        TokenBufferLock(&mut self.0)
    }
}

/// TokenBufferLock is a helper type that helps manage the lifecycle
/// of reading a new token, then interpreting it as UTF-8.
#[derive(Debug, Default)]
struct TokenBufferLock<'a>(&'a mut Vec<u8>);

impl<'a> TokenBufferLock<'a> {
    /// Add some bytes to a token
    fn extend(&mut self, chunk: &[u8]) {
        self.0.extend(chunk)
    }

    /// Complete the token and attempt to interpret it as UTF-8
    fn complete(self) -> Result<&'a str, LoadError> {
        from_utf8(self.0).map_err(LoadError::Utf8Error)
    }
}

pub struct TokensReader<R: io::BufRead> {
    reader: R,
    token: TokenBuffer,
}

impl<R: io::BufRead> Tokens for TokensReader<R> {
    fn next_raw(&mut self) -> Result<&str, LoadError> {
        use std::io::ErrorKind::Interrupted;

        // Clear leading whitespace
        loop {
            match self.reader.fill_buf() {
                Err(ref err) if err.kind() == Interrupted => continue,
                Err(err) => return Err(LoadError::Io(err)),
                Ok([]) => return Err(LoadError::OutOfTokens),
                // Got some content; scan for the next non-whitespace character
                Ok(buf) => match buf.iter().position(|byte| !byte.is_ascii_whitespace()) {
                    Some(i) => {
                        self.reader.consume(i);
                        break;
                    }
                    None => self.reader.consume(buf.len()),
                },
            };
        }

        // If we reach this point, there is definitely a non-empty token ready to be read.
        let mut token_buf = self.token.lock();

        loop {
            match self.reader.fill_buf() {
                Err(ref err) if err.kind() == Interrupted => continue,
                Err(err) => return Err(LoadError::Io(err)),
                Ok([]) => return token_buf.complete(),
                // Got some content; scan for the next whitespace character
                Ok(buf) => match buf.iter().position(u8::is_ascii_whitespace) {
                    Some(i) => {
                        token_buf.extend(&buf[..i]);
                        self.reader.consume(i + 1);
                        return token_buf.complete();
                    }
                    None => {
                        token_buf.extend(buf);
                        self.reader.consume(buf.len());
                    }
                },
            }
        }
    }
}

This implementation doesn't handle parsing strings into FromStr types— that's handled separately— but it does handle correctly accumulating bytes, separating them into whitespace-separated tokens, and interpreting those tokens as UTF-8. It does assume that only ASCII whitespace will be used to separate Tokens.

It's worth noting that FromStr cannot be used directly on the fill_buf buffer, because there's no guarantee that a token doesn't straddle the boundary between two fill_buf calls, and there's no way to force a BufRead to read more bytes until the existing buffer is fully consumed. I'm assuming it's pretty obvious that once you have an Ok(&str), you can perform FromStr on it at your leisure.

This implementation is not 0-copy, but is is (amortized) 0-allocation, and it minimizes unnecessary copying or buffering. It uses a single persistent buffer that is only resized if it's too small for a single token, and it reuses this buffer between tokens. Bytes are copied into this buffer directly from the input BufRead buffer, without extra intermediary copying.

Lucretiel
  • 3,145
  • 1
  • 24
  • 52
1

I use a struct which stores a BufRead and String. It can read to a vector or be passed to a macro which returns a tuple of parsed values.

Only one single String allocation is made.

use std::fs::File;
use std::io::{stdin, stdout, BufRead, BufReader, BufWriter, Write};
use std::str::{FromStr, SplitAsciiWhitespace};

struct Read<T: BufRead> {
    buf: T,
    line: String,
}

impl<T: BufRead> Read<T> {
    fn new(buf: T) -> Self {
        Read {
            buf,
            line: String::new(),
        }
    }

    fn line(&mut self) -> SplitAsciiWhitespace {
        self.line.clear();
        self.buf.read_line(&mut self.line).expect("Failed read");
        self.line.split_ascii_whitespace()
    }

    fn next_arr<U: FromStr>(&mut self) -> Vec<U> {
        self.line()
            .map(|s| s.parse().ok().expect("Failed parse"))
            .collect()
    }
}

macro_rules! scan {
    ( $read:expr, $( $type:ty ),* ) => {{
        let mut iter = $read.line();
        ($(
            iter.next()
                .expect("No next")
                .parse::<$type>()
                .expect("Couldn't parse")
        ),*)
    }}
}

This simple approach allows for little boilerplate and a short, clear syntax:

fn main() {
    // Open inputs and outputs
    let file = BufReader::new(File::open("in.txt").unwrap());
    let mut read = Read::new(stdin().lock()); // or file
    let mut out = BufWriter::new(stdout().lock());

    // Read different types from one line
    let (a, b, c, d) = scan!(read, bool, f32, u8, String);

    // Read a vector from one line
    let arr: Vec<i64> = read.next_arr();

    // Write output
    writeln!(out, "{} {} {} {}", a, b, c, d).unwrap();
    writeln!(out, "{:?}", arr).unwrap();
}
marcin42
  • 11
  • 2