How to read utf-16le file with and test regex matches against it without converting to utf8

Question

I have a 1.5GB text file with the UTF-16LE encoding. I want to read it and test regex matches with the lines of the file. Right now I use the following two crates.

encoding_rs = "0.8.31"
encoding_rs_io = "0.1.7"

The code to read the file looks like this:

fn decode_utf16le(buf: Vec<u8>) -> String {
    let enc = encoding_rs::Encoding::for_label("utf-16le".as_bytes());
    let mut dec = encoding_rs_io::DecodeReaderBytesBuilder::new()
        .encoding(enc)
        .build(&buf[..]);
    let mut res = String::new();
    dec.read_to_string(&mut res).unwrap();
    res
}

let mut file = File::open("huge.text.file.txt").unwrap();
let mut buffer = Vec::new();

file.read_to_end(&mut buffer).unwrap(
let mut contents = decode_utf16le(buffer);

However, the statement decode_utf16le(buffer) is so slow and it takes almost 20 seconds. Is it possible to read the file directly and match against a regex?

You're doing multiple passes over the data... Firstly, you don't need a `String` to run a regex search. See the `regex::bytes` submodule. Secondly, you can build a `DecodeReaderBytesBuilder` directly from a `File` instead of reading the entire file to the heap and then transcoding data on the heap into yet another heap buffer. — BurntSushi5, Oct 10 '22 at 13:26

score 1 · Answer 1 · answered Oct 10 '22 at 13:28

Is it possible to read the file directly and match against a regex?

The only way is if the regex engine supports searching UTF-16 directly. I'm not aware of any regex engine available in Rust that supports it. PCRE2 supports it, and there are PCRE2 bindings, but they don't expose its UTF-16 functionality.

(Note that my comment above tries to get to the heart of the matter as I believe the spirit of your question is "how to make this faster" instead of asking specifically about a particular optimization technique.)

How to read utf-16le file with and test regex matches against it without converting to utf8

1 Answers1