6

I'm trying to write a build.rs script that creates an up-to-date HashMap that maps the first 6 characters of a MAC address with its corresponding vendor.

It has 29231 key-value pairs which causes cargo check to spend more than 7 minutes on my source code. Before this, it was less than 20 seconds. It also uses all 8GB of the RAM available on my laptop and I cannot use it during those 7-8 minutes.

I think this is either a rustc/cargo bug, or I am doing something wrong, and I'm pretty sure is the latter. What is the correct way of generating code like this?

main.rs

use std::collections::{HashMap, HashSet};
use rustc_hash::{FxHashMap, FxHashSet, FxHasher};
type CustomHasher = BuildHasherDefault<FxHasher>;
include!(concat!(env!("OUT_DIR"), "/map_oui.rs"));

map_oui.rs

#[rustfmt::skip]
lazy_static! {
    static ref MAP_MACS: FxHashMap<&'static [u8; 6], &'static str> = {
    let mut map_macs = HashMap::with_capacity_and_hasher(29231, CustomHasher::default());
    map_macs.insert(b"002272", "American Micro-Fuel Device Corp.");
    map_macs.insert(b"00D0EF", "IGT");
//...

build.rs

use std::env;
use std::fs::File;
use std::io::prelude::*;
use std::io::{BufReader, BufWriter};
use std::path::Path;

fn main() {
    let out_dir = env::var_os("OUT_DIR").unwrap();
    let dest_path = Path::new(&out_dir).join("map_oui.rs");
    let handle = File::create(dest_path).unwrap();
    let mut writer = BufWriter::new(handle);
    let response = ureq::get("http://standards-oui.ieee.org/oui.txt")
        .call()
        .expect("Conection Error");
    let mut reader = BufReader::new(response.into_reader());
    let mut line = Vec::new();

    writer
        .write(
            b"#[rustfmt::skip]
lazy_static! {
    static ref MAP_MACS: FxHashMap<&'static [u8; 6], &'static str> = {
    let mut map_macs = HashMap::with_capacity_and_hasher(29231, CustomHasher::default());\n",
        )
        .unwrap();
    loop {
        match reader.read_until('\n' as u8, &mut line) {
            Ok(bytes_read) => {
                if bytes_read == 0 {
                    break;
                }
                if line.get(12..=18).map_or(false, |s| s == b"base 16") {
                    let mac_oui = String::from_utf8_lossy(&line[0..6]);
                    let vendor = String::from_utf8_lossy(&line[22..]);
                    writer.write(b"    map_macs.insert(b\"").unwrap();
                    writer.write(mac_oui.as_bytes()).unwrap();
                    writer.write(b"\", \"").unwrap();
                    writer.write(vendor.trim().as_bytes()).unwrap();
                    writer.write(b"\");\n").unwrap();
                }
                line.clear();
            }
            Err(_) => (),
        }
    }
    writer
        .write(
            b"    map_macs
    };
}
",
        )
        .unwrap();
    writer.flush().unwrap();
    println!("cargo:rerun-if-changed=build.rs");
}
Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
Adrián Delgado
  • 130
  • 1
  • 8
  • 2
    Maybe https://github.com/sfackler/rust-phf can help you? – user2722968 Jan 11 '21 at 17:05
  • 1
    If you need a workaround: you could embed the data file directly as a string. Sort it and zero-pad to the length of the longest string, so you can do a binary search inside it. Or, if you can afford the space, just stick those zero-padded strings in an array indexed on the 5 hex digits, which is at most 1M entries, leaving the unused entries blank. – Thomas Jan 11 '21 at 17:33
  • 1
    I wouldn't be surprised if creating a huge array of tuples ended up being faster to compile and execute: `[(key, value)]`. Another workaround would be to move all of this to a different crate completely; that way it should be built less frequently. – Shepmaster Jan 11 '21 at 17:51
  • Outputting a slice of tuples and compiling that takes 2 seconds and my shell reports it took ~145 MiB of RAM. – Shepmaster Jan 11 '21 at 18:24
  • `rust-phf` seemed promising but it is slightly slower. I followed @Thomas and @Shepmaster suggestions and it worked. Currently `build.rs` generates a `const MAP_MACS: [([u8; 6], &str); 29246]` and I wrote a wrapper function called `vendor_lookup` around a binary search of the array. Should I post the code as an answer for future reference? – Adrián Delgado Jan 11 '21 at 19:53
  • 1
    An answer is certainly appropriate. You might want to withhold accepting in case someone can answer the underlying question (which is IMHO still pertinent): why is the creation of this hash map at compile time so slow and memory intensive? – user4815162342 Jan 11 '21 at 19:57
  • So the question is “Why does compiling a function with 30,000 instruction uses a lot of resources?”?. – mcarton Jan 11 '21 at 20:01
  • 1
    @mcarton maybe with the implied "compared to this other way of doing it (slice of tuples) that is way faster"? – Shepmaster Jan 11 '21 at 20:23

1 Answers1

0

I followed @Thomas and @Shepmaster suggestions and it worked. Currently build.rs generates a const MAP_MACS: [([u8; 6], &str); 29246] and I wrote a wrapper function called vendor_lookup around a binary search of the array. However, it would be good to know how to use a HashMap with a custom Hasher.

main.rs

include!(concat!(env!("OUT_DIR"), "/map_oui.rs"));

fn vendor_lookup(mac_oui: &[u8; 6]) -> &'static str {
    let idx = MAP_MACS
        .binary_search_by(|probe| probe.0.cmp(mac_oui))
        .unwrap(); // this should be a `?`
    MAP_MACS[idx].1
}
fn main() {
    assert_eq!(vendor_lookup(b"4C3C16"), "Samsung Electronics Co.,Ltd");
}

map_oui.rs

const MAP_MACS: [([u8; 6], &str); 29246] = [
    ([48, 48, 48, 48, 48, 48], "XEROX CORPORATION"),
    ([48, 48, 48, 48, 48, 49], "XEROX CORPORATION"),
    ([48, 48, 48, 48, 48, 50], "XEROX CORPORATION"),
    //---snip---
]

build.rs

use std::env;
use std::fs::File;
use std::io::prelude::*;
use std::io::{BufReader, BufWriter};
use std::path::Path;

fn main() {
    let response = ureq::get("http://standards-oui.ieee.org/oui.txt")
        .call()
        .expect("Conection Error");
    let mut reader = BufReader::new(response.into_reader());

    let mut data: Vec<(Vec<u8>, String)> = Vec::new();
    let mut line = Vec::new();
    while reader.read_until(b'\n', &mut line).unwrap() != 0 {
        if line.get(12..=18).map_or(false, |s| s == b"base 16") {
            let mac_oui = line[0..6].to_owned();
            let vendor = String::from_utf8_lossy(&line[22..]).trim().to_owned();
            data.push((mac_oui, vendor));
        }
        line.clear();
    }
    data.sort_unstable();

    let out_dir = env::var_os("OUT_DIR").unwrap();
    let dest_path = Path::new(&out_dir).join("map_oui.rs");
    let handle = File::create(dest_path).unwrap();
    let mut writer = BufWriter::new(handle);
    writeln!(
        &mut writer,
        "const MAP_MACS: [([u8; 6], &str); {}] = [",
        data.len()
    )
    .unwrap();
    for (key, value) in data {
        writeln!(&mut writer, "    ({:?}, \"{}\"),", key, value).unwrap();
    }
    writeln!(&mut writer, "];").unwrap();
    writer.flush().unwrap();
    println!("cargo:rerun-if-changed=build.rs");
}
Adrián Delgado
  • 130
  • 1
  • 8