0

In order to be able to infer a schema of a csv file passed via /dev/stdin before converting the whole csv file to parquet, I have implemented a wrapper that buffers the input and implements Seek as required by the crate arrow2. This all works.

This wrapper is, however, not necessary in certain situations like when a file is redirected to stdin: my_binary /dev/stdin <test.csv. I would like to only use that wrapper when it is really necessary, as in this case zstdcat test.csv.zst | my_binary /dev/stdin. As a result, I need to know whether the file I open errors on seek or not.

The following method I came up with seems to work. Is there a better way? Is this idiomatic for Rust?

fn main() {
    let mut file = std::fs::File::open("/dev/stdin").unwrap();
    let seekable = match std::io::Seek::rewind(&mut file) {
        Ok(_) => true,
        Err(_) => false,
    };
    println!("File is seekable: {:?}", seekable);
}

There is a similar question for C, but the solution doesn't transfer blindly to Rust: How to determine if a file descriptor is seekable? - or is this effectively what file.rewind() does under the hood?

Cornelius Roemer
  • 3,772
  • 1
  • 24
  • 55
  • I would argue that if your program needs seekable access to a file, it should not be reading from standard input at all, or you should supply the name of the file from which the schema could be inferred as a separate argument to be opened internally (even if it ends up being the same file). – chepner Mar 04 '23 at 17:42
  • @chepner here's the use case where you do want to read from standard input: if you have a `zstd` compressed file that uncompressed would be 1TB, compressed only 10GB. You don't want to uncompress it to disk just so that the first 1000 lines or so can be read twice for schema inference. – Cornelius Roemer Mar 04 '23 at 19:35
  • Where does the seeking come in? Read the first 1000 lines, build the schema, then resume reading. – chepner Mar 04 '23 at 20:05
  • Crate `arrow-rs` expects Read and Seek to be implemented for `infer_file_schema`: https://docs.rs/arrow-csv/34.0.0/arrow_csv/reader/fn.infer_file_schema.html – Cornelius Roemer Mar 04 '23 at 20:08

1 Answers1

3

There is a similar question for C, but the solution doesn't seem to transfer to Rust: How to determine if a file descriptor is seekable? - or is this effectively what file.rewind() does under the hood?

rewind actually performs a lseek(fd, 0, SEEK_SET), so it'll have the side-effect of, well, rewinding (hence the name) the fd's cursor. I assume the reason the original uses SEEK_CUR is to avoid moving the cursor on seekable files for maximum generality.

If you want to match the original question exactly you should use seek(SeekFrom::Current(0)). If it doesn't matter then rewind is fine.

Additionally:

  • you don't need a match, just call is_ok on the result of rewind (/ seek)
  • you don't need to call std::io::Seek::rewind(&mut file), if you use std::io::Seek then you can just call the provided methods on any seekable objects, like files

So:

use std::io::{Seek, SeekFrom::Current};

fn main() {
    let mut file = std::fs::File::open("/dev/stdin").unwrap();
    let seekable = file.seek(Current(0)).is_ok();
    println!("File is seekable: {:?}", seekable);
}

matches the C answer exactly.

Though for what it's worth on my mac the device files are seekable by default.

Only way I can get it to fail is if I pipe (not redirect):

> ./test
File is seekable: true
> </dev/null ./test
File is seekable: true
> </dev/null cat | ./test
File is seekable: false
Masklinn
  • 34,759
  • 3
  • 38
  • 57