An Optimization That's Impossible in Rust!

TL;DR Here’s the link to the crate.

As I’m studying database systems for my Master’s degree in Germany, this article Why German Strings are Everywhere immediately caught my attention. I was pretty excited to learn it’s the string data structure described in the paper Umbra: A Disk-Based System with In-Memory Performance, which had been introduced to me through another paper about its storage engine - LeanStore: In-Memory Data Management Beyond Main Memory. I was even more interested to learn that it has been implemented in many different data solutions like DuckDB, Apache Arrow, and Polars.

One thing that particularly stood out from the article and had me puzzled was the following claim:

This string implementation also allows for the very important “short string optimization”: A short enough string can be stored “in place,” i.e., we set a specific bit in the capacity field, and the remainder of capacity, as well as size and ptr, become the string itself. This way, we save on allocating a buffer and a pointer dereference each time we access the string. An optimization that’s impossible in Rust, by the way ;).

If I correctly interpret the statement, it’s objectively wrong because there are already lots of packages in Rust that support this exact feature, e.g., compact_str, smartstring, and smol_str. Moreover, polars, an example from the article, is written in Rust and implements Umbra-style string. I got nerd-sniped and started implementing German string to show myself it was possible.

Nerd sniping.
by xkcd licensed under CC BY-NC 2.5

In this article, I’ll describe how I implemented German string and the challenges of doing so in Rust. Specifically, I’ll examine how to enable shared ownership for such data structure. Providing unique ownership is relatively trivial, and there is already a nicely written tutorial from the Rustonomicon teaching you how to implement a Vec<T>, which is not much different from a String. But first, let’s talk about the concept of German string.

don’t we have enough string types in Rust?

The term “German-style string” was coined by Andy Pavlo in one of his lectures on Advanced Database Systems. “German string,” “German-style string,” and “Umbra-style string” all refer to the same concept.

At a high level, strings are conceptually simple - just sequences of characters. However, there’s more going on with strings than one may realize. Even the concept of a character is already a complex thing in itself. Depending on the use case, how a string is represented in memory can differ widely. Just within Rust, there are several different types of string. Each of the following types also has a corresponding borrowed version:

String - a UTF-8 encoded, growable string.
CString - an owned, C-compatible, null-terminated string with no null bytes in the middle.
OSString - an owned, mutable platform-native string but is cheaply inter-convertible with Rust strings.
PathBuf - a thin wrapper around OSString representing an owned mutable path.

Rust strings

String is the most commonly used type in Rust and thus will be the subject of our discussion. Internally, String is just a Vec<u8>, i.e., a sequence of bytes. Additionally, that sequence of bytes must be UTF-8 encoded, meaning a character is not always just a byte but ranges between one and four bytes. The implementation of String is as follows, minus some details:

pub struct String {
    vec: Vec<u8>,
}

pub struct Vec<T> {
    buf: RawVec<T>,
    len: usize,
}

pub(crate) struct RawVec<T, A: Allocator = Global> {
    ptr: Unique<T>,
    cap: Cap,
    alloc: A,
}

pub struct Unique<T: ?Sized> {
    pointer: NonNull<T>,
    _marker: PhantomData<T>,
}

That looks too complicated. But, essentially, a String consists of three parts on the stack that take up 24 bytes:

len - the length of the string in bytes.
cap - the size of the heap-allocated buffer.
ptr - the pointer to the heap-allocated buffer.

A String is mutable and can be appended to when needed, so we need to keep track of the length and the capacity with the invariant that cap >= len. Whenever len grows larger than cap, a new buffer is created with enough capacity to hold the string with the new length, and the old buffer gets deallocated.

German strings

Theoretically, a string can hold infinitely many different kinds of texts with infinite sizes. Nonetheless, from real-world usages of strings, especially in databases, we can observe a few common characteristics:

Most strings are short.
Most strings don’t change.
Only a tiny sub-string is looked at when comparing or ordering strings.

Based on those observations, German string was invented to address these specific use cases, aiming to accelerate string processing performance. The basic idea follows what is called “short string optimization.” Instead of allocating a buffer on the heap, we can avoid the expensive allocation and store the bytes directly on the stack for short enough strings by repurposing cap and ptr. Having no allocation also means we don’t need to follow pointers when accessing short strings.

Internally, a German string takes only 16 bytes on the stack and has two different representations depending on the string length. A German string is also immutable, removing the need to track the buffer capacity.

short strings

Strings at most 12 bytes are inlined and stored directly on the stack. Out of the 16 bytes, 4 are used to hold the string length, and 12 are used to hold the string content.

long strings

Strings over 12 bytes are heap-allocated like String and have a pointer to the heap-allocated buffer. But there are a few twists:

Only 4 bytes hold the string length instead of 8, thus limiting string size to 4GBs. This is acceptable because most strings used in the real world are short.
A 4-byte prefix from the string is inlined and stored directly on the stack. For most string comparison operations, this can help avoid the expensive pointer dereference when the result can be determined using just the prefix.

another one

Now that we know the basic shape of a German string and its motivation, let’s see how we can implement it using Rust. To start, we aimed to implement an owned immutable string that is UTF-8 encoded and supports the optimization techniques described earlier. Moreover, the heap-allocated buffer will have shared owners using atomic reference counting, allowing us to reuse it even across threads.

There are two ways we can structure our type. The first layout that came to mind was this.

pub struct UmbraString {
    len: u32,
    data: Data,
}

union Data {
    buf: [u8; 12],
    ptr: ManuallyDrop<SharedDynBytes>,
}

struct SharedDynBytes {
    prefix: [u8; 4],
    ptr: NonNull<SharedDynBytesInner>
    phantom: PhantomData<SharedDynBytesInner>
}

Under Rust layout rules, SharedDynBytes has a size of 16 bytes and an alignment of 8, resulting in UmbraString having a size of 24 bytes and an alignment of 8. This is different from the layout that we want. To make UmbraString 16 bytes, we can lower SharedDynBytes’s alignment to 4 and make everything correctly sized using #[repr(packed(4))]. However, clippy, Rust linter, won’t be happy with us when we reference the ptr field in SharedDynBytes because we might create a reference to an unaligned address, which is undefined behavior in Rust. While we never need to make such a reference, we are prevented from using some APIs on pointers, which are nice to have.

UmbraString layout for long strings with fields padded to their alignment.

A more considerable annoyance is that this complicates implementing string comparison and ordering. When comparing or ordering strings, we want to deal with the prefix before anything else. With this layout, we must first check the string length to determine whether Data is a buf or a ptr before getting to the prefix. To overcome that, we can mark everything with #[repr(C)] and rely on an invariant that prefix is the first field in SharedDynBytes. As a result, we can interpret data as a buf without checking the length and still be sound if we only access the first 4 bytes. A better approach is as follows, without relying on such an invariant and freeing us from worrying about field orders when providing structs for different ownership strategies for the allocated buffer.

#[repr(C)]
pub struct UmbraString {
    len: u32,
    prefix: [u8; 4],
    trailing: Trailing,
}

#[repr(C)]
union Trailing {
    buf: [u8; 8],
    ptr: ManuallyDrop<SharedDynBytes>,
}

#[repr(C)]
struct SharedDynBytes {
    ptr: NonNull<SharedDynBytesInner>
    phantom: PhantomData<SharedDynBytesInner>
}

Memory-wise, there’s no difference compared to the #[repr(C, packed(4))] version. Instead of having a 12-byte union, our new union, Trailing, is only 8 bytes, containing either the last 8 bytes of the string or a pointer to a heap-allocated buffer. Furthermore, we must annotate the struct with #[repr(C)] so the fields’ order follows their declaration. It is essential for UmbraString to have the prefix field followed by the trailing field. Here, the prefix is pulled out of the union for easy access and allows Rust layout rules to work its magic without too much interference from us. Having the prefix as a 4-byte array also improves performance because comparing fixed-size arrays performs much better than comparing slices.

atomic reference counted bytes

SharedDynBytes, an atomic reference counted byte slice, behaves like an Arc. This begs the question - why don’t we just use Arc<[u8]>?

dynamically sized types

Most types in Rust have a statically known size and alignment. Dynamically size types (DSTs) are excepted from this rule, and Rust exposes two major DSTs:

trait objects: dyn Trait
slices: [T], str, etc.

Because DSTs don’t have a statically known size, they can’t be on the stack and must be accessed through a pointer. Consequently, every pointer to a DST is a fat pointer consisting of the pointer itself and additional information about the data being pointed to. In the case of a slice in which we’re interested, the extra information is the length. This results in Arc<[T]> having a size of 16 bytes instead of 8, causing UmbraString to take 24 bytes instead of 16.

We know that UmbraString already keeps track of the length, and the extra length associated with the fat pointer can be discarded. We need a way to turn a fat pointer into a thin pointer, and that’s not supported by Arc<T>. Currently, there’s no way to construct a DST, and we can only create an instance of a custom DST by making it generic and performing unsizing coercion.

layout

Having the problem described, here are our structs describing a reference counted bytes, slightly different from the previous example.

#[repr(C)]
pub struct SharedDynBytes {
    ptr: NonNull<SharedDynBytesInner<[u8; 0]>>,
    phantom: PhantomData<SharedDynBytesInner<[u8; 0]>>,
}

#[repr(C)]
struct SharedDynBytesInner<T: ?Sized> {
    count: AtomicUsize,
    data: T,
}

A struct can be made a DST by having its last field be a DST. Here, SharedDynBytesInner holds a reference count and a generic value over an unsized type. SharedDynBytesInner is always allocated on the heap, and the atomic count tracks the number of references currently being given out. By allocating for both the reference count and data, we can avoid one extra allocation and indirection. While it is generic, we are only interested in 2 specializations of the type that have the exact struct alignment and field alignment:

SharedDynBytesInner<[u8]> is dynamically sized, and pointers to it are fat pointers.
SharedDynBytesInnner<[u8; 0]> is statically sized, and pointers to it are thin pointers.

One cool trick we can do is converting between a fat pointer of type *mut SharedDynBytesInner<[u8]> and a thin pointer of type *mut SharedDynBytesInner<[u8; 0]>. The conversion allows us to add or remove information about the slice length from a pointer depending on how we want to use it. Converting a fat pointer to a thin pointer is easy - it’s just a single cast.

let fat_ptr: *mut SharedDynBytesInner<[u8]> = _;
let thin_ptr = fat_ptr as *mut SharedDynBytesInner<[u8; 0]>;

But there’s more dark magic to converting a thin pointer to a fat one. Instead of a direct cast, we must first turn a thin pointer to a pointer to a slice with a specific length, which tells Rust to embed information about the length into the pointer. Only after that can the slice pointer be cast to a pointer to our custom DST.

let thin_ptr: *mut SharedDynBytesInner<[u8; 0]> = _;
let fake_slice = std::ptr::slice_from_raw_parts_mut(thin_ptr.cast::<u8>(), len);
let fat_ptr = fake_slice as *mut SharedDynBytesInner<[u8]>;

allocating

Now that we can convert between a thin pointer and a fat pointer, we can start allocating memory for it. Unless we can coerce a statically sized type into a dynamically sized type, we can only create a value of a DST by manually allocating it. Before doing that, we must define a layout for the allocated value.

fn shared_dyn_bytes_inner_layout(len: usize) -> Layout {
    Layout::new::<SharedDynBytesInner<()>>()
        .extend(array_layout::<u8>(len))
        .expect("A valid layout for a SharedDynBytesInner")
        .0
        .pad_to_align()
}

The layout is constructed by first creating a layout for SharedDynBytesInner<()> where the field data has the zero-sized type () taking no space. We then extend that layout with the layout of a byte array, resulting in the layout for SharedDynBytesInner<[u8]>. With that layout, we can allocate just enough space for our atomic count and a byte array of arbitrary size.

fn from(bytes: &[u8]) -> Self {
    let ptr = if bytes.is_empty() {
        NonNull::dangling()
    } else {
        let layout = shared_dyn_bytes_inner_layout(bytes.len());
        let nullable = unsafe { std::alloc::alloc(layout) };
        let nullable_fat_ptr = SharedDynBytesInner::<[u8]>::cast(nullable, bytes.len());
        let Some(fat_ptr) = NonNull::new(nullable_fat_ptr) else {
            std::alloc::handle_alloc_error(layout)
        };
        unsafe {
            std::ptr::write(
                std::ptr::addr_of_mut!((*fat_ptr.as_ptr()).count),
                AtomicUsize::new(1),
            );
            std::ptr::copy_nonoverlapping(
                bytes.as_ptr(),
                std::ptr::addr_of_mut!((*fat_ptr.as_ptr()).data).cast(),
                bytes.len(),
            );
        }
        fat_ptr.cast()
    };
    Self {
        ptr,
        phantom: PhantomData,
    }
}

We take a lazy approach to allocation - nothing is allocated if the given slice is empty. If it’s not empty, we do the following:

Allocate using the layout we defined.
Cast the returned pointer to a fat pointer to our DST.
Copy data from the given slice into the data field.
Cast the fat pointer back into a thin pointer and store it.

deallocating

Because our reference counted type doesn’t have information about the length of the array it holds, we can’t implement Drop for it. Instead, we provide a method for the user of the type to manually deallocate it by giving the correct array length.

unsafe fn dealloc_unchecked(&self, len: usize) {
    if len > 0 {
        let inner = unsafe { &*self.ptr.as_ptr() };
        if inner.count.fetch_sub(1, atomic::Ordering::Release) == 1 {
            inner.count.load(atomic::Ordering::Acquire);
            unsafe {
                std::alloc::dealloc(
                    self.ptr.as_ptr().cast::<u8>(),
                    shared_dyn_bytes_inner_layout(len),
                );
            };
        }
    }
}

When called, the method first checks if the given length is non-zero. If it’s zero, no memory was allocated, and we don’t have to do anything. When memory is allocated, we decrease the reference count and see if we’re the only owner. If we are the only owner, we deallocate using the layout from earlier.

cloning

Similarly, we can’t implement Clone for our type and rely on the user-provided length to know if memory was allocated. Cloning involves no copy and just increments the reference count.

unsafe fn clone_unchecked(&self, len: usize) -> Self {
    let ptr = if len == 0 {
        NonNull::dangling()
    } else {
        let inner = unsafe { &*self.ptr.as_ptr() };
        inner.count.fetch_add(1, atomic::Ordering::Relaxed);
        self.ptr
    };

    Self {
        ptr,
        phantom: PhantomData,
    }
}

slicing

Getting a slice to the underlying array also relies on a user-provided length and is done in two steps:

Casting the thin pointer to a fat pointer.
Create a slice using a pointer to the data field and the user-provided length.

#[inline]
unsafe fn as_bytes_unchecked(&self, len: usize) -> &[u8] {
    if len == 0 {
        Default::default()
    } else {
        let fat_ptr = SharedDynBytesInner::<[u8]>::cast(self.ptr.as_ptr().cast::<u8>(), len);
        let ptr = unsafe { (*fat_ptr).data.as_ptr() };
        unsafe { std::slice::from_raw_parts(ptr, len) }
    }
}

putting things together

With all the essential components, we can now start implementing a German string. While much is needed to provide a usable string, we only present some basic functionalities.

allocating

Creating a string is straightforward. We copy the prefix while handling two cases:

If the string is at most 12 bytes, we inline the suffix.
Otherwise, we allocate a new SharedDynBytes holding the string content.

fn new(s: &str) -> Result<UmbraString, Error> {
    let bytes = s.as_bytes();
    let len = bytes.len();
    if len > u32::MAX as usize {
        return Err(Error::TooLong);
    }
    let mut prefix = [0u8; 4];
    let trailing = if len <= 12 {
        let mut buf = [0u8; 8];
        if len <= 4 {
            prefix[..len].copy_from_slice(&bytes[..len]);
        } else {
            prefix.copy_from_slice(&bytes[..4]);
            buf[..len - 4].copy_from_slice(&bytes[4..]);
        }
        Trailing { buf }
    } else {
        prefix.copy_from_slice(&bytes[..4]);
        Trailing {
            ptr: ManuallyDrop::new(SharedDynBytes::from(bytes)),
        }
    };
    #[allow(clippy::cast_possible_truncation)]
    Ok(Self {
        len: len as u32,
        prefix,
        trailing,
    })
}

`Drop`

Deallocation can now be provided by implementing Drop for UmbraString because it has all the needed information. We see if there was any data to be deallocated by checking the length. If there’s data on the heap, we delegate to the dealloc_unchecked method introduced earlier.

impl Drop for UmbraString {
    fn drop(&mut self) {
        if self.len > 12 {
            unsafe {
                self.trailing.ptr.dealloc_unchecked(len);
            }
        }
    }
}

`Clone`

Similarly, we can now implement Clone for UmbraString. We copy the bytes to the new instance when the string is inlined. Otherwise, we copy the prefix and delegate to the clone_unchecked method of SharedDynBytes.

impl<B> Clone for UmbraString<B>
where
    B: DynBytes,
{
    fn clone(&self) -> Self {
        let trailing = if self.len <= 12 {
            unsafe {
                Trailing {
                    buf: self.trailing.buf,
                }
            }
        } else {
            unsafe {
                Trailing {
                    ptr: ManuallyDrop::new(self.trailing.ptr.clone_unchecked(self.len as usize)),
                }
            }
        };
        Self {
            len: self.len,
            prefix: self.prefix,
            trailing,
        }
    }
}

`PartialEq`

Comparison starts with the first 8 bytes, consisting of the length and prefix. This check will fail for most strings, and the method will return immediately. As a result, many comparisons don’t need to read the rest of the string beyond the prefix. Comparisons between fixed-size arrays are also very fast and well-optimized. Additionally, if both strings are short and inlined, we can compare the last 8 bytes without doing any pointer dereference. Pointer dereferences might only happen inside the call to suffix() when the strings are not inlined.

impl PartialEq<UmbraString> for UmbraString {
    fn eq(&self, other: &UmbraString) -> bool {
        let lhs_first_qword = std::ptr::from_ref(self).cast::<u64>();
        let rhs_first_qword = std::ptr::from_ref(other).cast::<u64>();
        if unsafe { *lhs_first_qword != *rhs_first_qword } {
            return false;
        }
        if self.len <= 12 {
            return unsafe { self.trailing.buf == other.trailing.buf };
        }
        self.suffix() == other.suffix()
    }
}

`PartialOrd`

Ordering also takes similar execution paths to comparison. We return immediately if the order can be determined just from the prefix. The string suffixes are only compared when the prefixes are equal. Pointer dereferences are also avoided by checking if both strings are inlined.

impl PartialOrd<UmbraString> for UmbraString {
    fn partial_cmp(&self, other: &UmbraString) -> Option<cmp::Ordering> {
        let ordering = Ord::cmp(&self.prefix, &other.prefix).then_with(|| {
            if self.len <= 4 && other.len <= 4 {
                return Ord::cmp(&self.len, &other.len);
            }
            if self.len <= 12 && other.len <= 12 {
                let ordering = unsafe { Ord::cmp(&self.trailing.buf, &other.trailing.buf) };
                return ordering.then_with(|| Ord::cmp(&self.len, &other.len));
            }
            Ord::cmp(self.suffix(), other.suffix())
        });
        Some(ordering)
    }
}

afterword

I made the impossible possible! Implementing German string in Rust requires knowledge of Rust’s memory model and type layout but is relatively easy. It’s complicated because I decided to enable shared ownership for the type so buffers can be reused. Along the journey, I discovered some interesting things about Rust:

Not all information about a type is explicitly given in Rust. Looking at the definition of Arc<T>, I would never realize that it might take 16 bytes instead of 8.
DSTs are not well-supported and can only be created through manual allocation or unsizing coercion.
We can convert between fat and thin pointers using the slice trick.

With that, thanks for reading! I hope that you learn something new, just like I did! To see more of the code, you can visit https://crates.io/crates/strumbra.

An Optimization That’s Impossible in Rust!

2024/08/30

Nerd sniping.

don’t we have enough string types in Rust?

Rust strings

Rust string layout.

German strings

short strings

German string layout for short strings.

long strings

German string layout for long strings.

another one

UmbraString layout for long strings with fields padded to their alignment.

atomic reference counted bytes

dynamically sized types

UmbraString layout using Arc<[u8]>.

layout

SharedDynBytes representation in memory

allocating

deallocating

cloning

slicing

putting things together

allocating

`Drop`

`Clone`

`PartialEq`

`PartialOrd`

afterword