13

The expected approach of String.truncate(usize) fails because it doesn't consider Unicode characters (which is baffling considering Rust treats strings as Unicode).

let mut s = "ボルテックス".to_string();
s.truncate(4);

thread '' panicked at 'assertion failed: self.is_char_boundary(new_len)'

Additionally, truncate modifies the original string, which is not always desired.

The best I've come up with is to convert to chars and collect into a String.

fn truncate(s: String, max_width: usize) -> String {
    s.chars().take(max_width).collect()
}

e.g.

fn main() {
    assert_eq!(truncate("ボルテックス".to_string(), 0), "");
    assert_eq!(truncate("ボルテックス".to_string(), 4), "ボルテッ");
    assert_eq!(truncate("ボルテックス".to_string(), 100), "ボルテックス");
    assert_eq!(truncate("hello".to_string(), 4), "hell");
}

However this feels very heavy handed.

Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
Peter Uhnak
  • 9,617
  • 5
  • 38
  • 51
  • 11
    Unicode is freaking complicated. Are you sure you want `char` (which corresponds to code points) as unit and not grapheme clusters? –  Jul 19 '16 at 14:35
  • 4
    Actually, the other direction is just as valid: Impose a limit on the number of *bytes* the UTF-8 encoding takes (you need some care to chop off whole characters — take as many `char`s as possible without going over N bytes). While this does not match people's perception of character counts, it is reasonable when the restriction is storage-motivated (e.g., the size of a database column). –  Jul 19 '16 at 14:50

1 Answers1

24

Make sure you read and understand delnan's point:

Unicode is freaking complicated. Are you sure you want char (which corresponds to code points) as unit and not grapheme clusters?

The rest of this answer assumes you have a good reason for using char and not graphemes.

which is baffling considering Rust treats strings as Unicode

This is not correct; Rust treats strings as UTF-8. In UTF-8, every code point is mapped to a variable number of bytes. There's no O(1) algorithm to convert "6 characters" to "N bytes", so the standard library doesn't hide that from you.

You can use char_indices to step through the string character by character and get the byte index of that character:

fn truncate(s: &str, max_chars: usize) -> &str {
    match s.char_indices().nth(max_chars) {
        None => s,
        Some((idx, _)) => &s[..idx],
    }
}

fn main() {
    assert_eq!(truncate("ボルテックス", 0), "");
    assert_eq!(truncate("ボルテックス", 4), "ボルテッ");
    assert_eq!(truncate("ボルテックス", 100), "ボルテックス");
    assert_eq!(truncate("hello", 4), "hell");
}

This also returns a slice that you can choose to move into a new allocation if you need to, or mutate a String in place:

// May not be as efficient as inlining the code...
fn truncate_in_place(s: &mut String, max_chars: usize) {
    let bytes = truncate(&s, max_chars).len();
    s.truncate(bytes);
}

fn main() {
    let mut s = "ボルテックス".to_string();
    truncate_in_place(&mut s, 0);
    assert_eq!(s, "");
}
Community
  • 1
  • 1
Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
  • How is using `char_indices()` different from my use of `chars()`? – Peter Uhnak Jul 20 '16 at 07:41
  • 2
    @Peter `chars` only returns the characters. `char_indices` is similar in concept to `chars().enumerate()` except it returns the actual index of the `u8` that character starts at in the original `str`. – Linear Jul 20 '16 at 11:05
  • 1
    `.skip(max_chars).next()` → `.nth(max_chars)`. – Veedrac Jul 20 '16 at 12:20
  • 2
    @Veedrac Every. Single. Time. I will never remember it! [Clippy feature request](https://github.com/Manishearth/rust-clippy/issues/1112)! – Shepmaster Jul 20 '16 at 12:48