Effective substring in Rust

June 10, 2024 1479 Words

Where is my substring?

Recently, I was working on a service of Peky which is written in Rust. I found that there is no built-in substring() in Rust, this is annoying. It leaves me no choice but to implement my own substring().

Implementations

There are so many approaches to implementing a substring() in Rust, to find out which version is the best and the most efficient, I also wrote a benchmark using Criterion.rs.

rust

fn benchmark(c: &mut Criterion) {
    let text = "Hello✨, 🎈 this 🎉 is a text.";
    let (start, end) = (1, 20);
    let expected = "ello✨, 🎈 this 🎉 is ";
    c.bench_function("...", |bencher| {
        bencher.iter(|| {
            assert_eq!("", substring_v1(text, start, end));
        })
    });
    c.bench_function("....", |bencher| {
        bencher.iter(|| {
            assert_eq!(expected, substring_v2(text, start, end));
        })
    });
    ...
}

0. String slices

This simply doesn't work with multi-byte characters, DON'T use it.

rust

pub fn substring_v0(text: &str, start: usize, end: usize) -> &str {
    &text[start..end]
}

1. Skip and take on chars with collect

To get a char-based substring, we need chars(). This version is the cleanest and one-liner. It skips to the start index and takes the chars we need, then collects them to a String.

rust

/// skip and take on [[std::str::Chars]
pub fn substring_v1(text: &str, start: usize, end: usize) -> String {
    text.chars().skip(start).take(end - start).collect()
}

Benchmark result:

skip take and collect   time:   [313.51 ns 315.61 ns 318.00 ns]

Note: 300ns = 0.0003ms

It seems not slow but I wouldn't say it's fast according to later implementations.

2. Calculate byte indices manually

If substring is all about byte skipping and selecting, can we just calculate the indices manually?

My final implementation comes up with a loop and char.len_utf8(), which sums the byte count of each char until the start and the end index are found, it finally creates a string slice using the byte indices.

rust

/// Find the byte indices manually
pub fn substring_v2(text: &str, start: usize, end: usize) -> &str {
    let mut curr_byte_idx = 0;
    let mut start_byte_idx = 0;
    let mut end_byte_idx = text.len();
    for (index, ch) in text.chars().enumerate() {
        if index == start {
            start_byte_idx = curr_byte_idx;
        }
        if index == end {
            end_byte_idx = curr_byte_idx;
            break;
        }
        curr_byte_idx += ch.len_utf8();
    }
    &text[start_byte_idx..end_byte_idx]
}

Benchmark result:

manual byte indices   time:   [43.970 ns 44.178 ns 44.407 ns]

This only took 95ns, it's about 2.2x faster! My guess on this is that collect() in the first implementation took much more time than a simple string slice.

3. Calculate byte indices using char_indices()

There is a built-in function char_indices(), which returns a tuple(byte_index, char) iterator of the string, so we don't need to calculate the byte indices manually. And the implementation will be much cleaner.

rust

/// Find the byte indices using [str::char_indices]
pub fn substring_v3(text: &str, start: usize, end: usize) -> &str {
    let start_byte_idx = text.char_indices().nth(start).map(|(i, _)| i).unwrap_or(0);
    let end_byte_idx = text.char_indices().nth(end).map(|(i, _)| i).unwrap_or(text.len());
    &text[start_byte_idx..end_byte_idx]
}

Benchmark result:

2 char_indices()   time:   [30.878 ns 31.016 ns 31.160 ns]

It is even about 10ns faster than the previous version! The only drawback of the version is we used two iterators, but this will be improved later.

4. Calculate byte indices using single char_indices()

Due to the implementation of nth() function: It advances the iterator by n elements.

rust

fn nth(&mut self, n: usize) -> Option<Self::Item> {
    self.advance_by(n).ok()?;
    self.next()
}

fn advance_by(&mut self, n: usize) -> Result<(), NonZero<usize>> {
    for i in 0..n {
        if self.next().is_none() {
            return Err(unsafe { NonZero::new_unchecked(n - i) });
        }
    }
    Ok(())
}

We can reuse the iterator to find the end_byte_idx after calling the first nth() call. The implementation is much like the previous version.

rust

/// Find the byte indices using single [str::char_indices]
pub fn substring_v4(text: &str, start: usize, end: usize) -> &str {
    if start == end {
        // Ensure we won't get an overflow from nth(end - start - 1)
        return "";
    }
    let mut iter = text.char_indices();
    let start_byte_idx = iter.nth(start).map(|(i, _)| i).unwrap_or(0);
    let end_byte_idx = iter.nth(end - start - 1).map(|(i, _)| i).unwrap_or(text.len());
    &text[start_byte_idx..end_byte_idx]
}

Benchmark result:

1 char_indices()   time:   time:   [31.925 ns 32.055 ns 32.189 ns]

The performance remains the same, likely. It's good even though I was expecting a better performance on this because it skips less (start + end vs start + (end - start - 1)) and only uses one iterator, maybe it's because my start index is just 1 in the benchmark.

5. Left-right char skipping

This approach is similar to the first approach. Instead of calling take() on the iterator, it calls next() to advance the iterator to the start and calls next_back() to remove elements until the end is met. Finally, it calls Chars.as_str() instead collect() to get the subsequence, this is much faster.

rust

/// Left-right char skipping on [std::str::Chars]
pub fn substring_v5(text: &str, start: usize, end: usize) -> &str {
    let mut iter = text.chars();
    let mut left = 0;
    let mut right = iter.clone().count();
    loop {
        if left < start {
            iter.next();
            left += 1;
        }
        if right > end {
            iter.next_back();
            right -= 1;
        }
        if left == start && right <= end {
            break;
        }
    }
    iter.as_str()
}

Benchmark result:

left-right char skipping   time:   [30.525 ns 30.667 ns 30.817 ns]

As expected, this is much faster than the first version, so collect() is the evil. It also has the same seed as the previous two versions. Plus, when working with large texts:

rust

fn benchmark(c: &mut Criterion) {
    let text = &("Hello✨, 🎈 this 🎉 is a text.".repeat(1000));
    let (start, end) = (1, 20000);
    let expected = "ello✨, 🎈 ...";
    ...
}

It's even faster than the previous two versions!

skip take and collect      time:   [50.435 µs 50.770 µs 51.132 µs]
manual byte indices        time:   [29.578 µs 29.742 µs 29.923 µs]
2 char_indices()           time:   [16.551 µs 16.659 µs 16.780 µs]
1 char_indices()           time:   [16.583 µs 16.664 µs 16.751 µs]
left-right char skipping   time:   [14.096 µs 14.338 µs 14.584 µs]

The drawback of this approach is we used two iterators and it's not as clean as the previous two versions.

Which version to use?

Version 4: My first recommendation would be Version 4, which is clean, efficient, and fast.

Version 3: Version 3 is even cleaner than Version 4 and has the same speed, it's safe to use too.

Version 5: When dealing with large texts, Version 5 can be a better choice, it's the fastest!

Version 1: If you are a one-liner and don't care about speed, using Version 1 is okay.

Conclusion

I wrote this blog because something really changed my thoughts about Rust recently, one of my Rust functions took 6000ms while the equivalent npm package only took 600ms, just because I used text.chars().skip(...).take(...).collect() on a large number of strings. So it's better to say Rust can be fast than Rust is fast.

There are also some limitations in the benchmark, such as not considering the position or length of the subsequence, the relative performance differences may change if the start and end differ. But in most cases, Version 3, 4, and 5 should be fine.

Thanks for reading this. You can find me on Twitter: @letmutex.

All source code is in this repo

Credits

  • nhellman on Hacker News pointed out that the return type of substring() should be &str.
  • usr1106 on Hacker News pointed out a case of misuse of 'Effective'.
©2024 letmutex.com