In part one we started an initial exploration of the Rust programming language, using a small programming problem as a starting point. The problem was:
Given a string which contains comma-separated values, transform it into a list of Globally Unique Identifiers (GUIDs) if and only if every value is a valid GUID. Ignore whitespace and newlines.
The following Rust function was given as a candidate solution, which proved to capably solve the problem while retaining execellent runtime performance:
This is not necessarily the most optimal way to solve this problem, but it is succinct and correct and good enough for our initial purposes.
Is it obvious, though, how it works?
Let’s unpack it.
We have a function that takes as input an
&str, which is Rust’s way of saying a ‘reference to a string slice.’
References, ownership and borrows
If you have never programmed Rust, it would be very useful to read up on ownership. This is one of the fundamental concepts in Rust and one of the things that makes the language novel and interesting. Coming from a garbage collected language like C#, the strict ownership rules in Rust might seem onerous, and even bizarre. The rewards for following these rules, however, are great: memory safety without performance overhead.
I’m not going to attempt to explain all of the details of Rust’s ownership system in this blog post, but we can summarize the salient points:
- Creating a reference to some memory location is known as ‘borrowing’ that memory location.
- References (borrows) of a given location are immutable by default.
- You can have as many immutable borrows of a location in memory as you like.
- You can have exactly one mutable borrow of a memory location at any given time.
- You cannot create a mutable borrow while there are existing immutable borrows.
- Accessing a memory location without borrowing it implies transferring ownership of that location. This is called ‘moving’ ownership.
- There are exceptions for types that can be cheaply copied. This include primitive types such as integers, where accessing the memory location for the type makes a copy of the bits at the original location. C# programmers will be familiar with that as the difference between ‘value types’ and ‘reference types’ in that language.
Ok, this sounds like a lot of rules to have to keep track of. Once you grasp the rules, however, (and if you don’t, believe me the compiler will let you know) it is not too hard to get comfortable with the system. The results are well worth it.
So: using the
& symbol to create a reference means taking a borrow of the memory in question. We use
&mut to take a mutable borrow.
The Rust compiler keeps track of all of the memory in use by the program, who owns it, and what borrows of that memory exist. When the owner of a piece of memory goes out of scope (i.e., is no longer used elsewhere in the program), the memory that was owned is freed.
This is how Rust achieves memory safety without a garbage collector: the compiler does all of the work!
(This does mean compile times in Rust can be atrocious, but it’s hard to argue that this inconvenience outweighs the somewhat incredible end result.)
Examining the input type
We can see that our function takes as input a
&str, which is an immutable reference (borrow) of a string slice. A string slice is simply a pointer to some UTF8 string and a length, similar to the Span
As there is no
null reference concept in Rust (more on that in a bit), we avoid needing to write boilerplate code to validate that we were not passed input that can cause us to crash. This helps keep the code succinct while still being correct.
Incidentally, lack of boilerplate is one of those things that drew me to functional programming (F#) originally, and Rust happily continues that experience. C#, on the other hand, is boilerplate city.
Examining the return type
The return type of our function is
Option<Vec<&str>>. What’s this?
Well, we are using generic type parameters here so really what we have is: a
T is an
&str, and an
Generic type parameters here work the same as in C# code. The type parameter is a placeholder that gets replaced by some concrete type at compile time.
Vec<T> in Rust is analogous to
List<T> in C#. It’s a list type that grows as needed to accomodate new items being added to it.
Falling in love with the Option type
So, what’s an
Option types are one solution to the billion dollar mistake, the ability to have null references. They are, for me, one of the best ideas from the world of functional programming languages. The Option type provides an explicit way to encapsulate the idea that an operation either produced a value, or it produced nothing. What’s more, programming languages like Rust require your code to account for both possibilites; if you do not, the code will not compile! For seasoned C# programmers this is a concept that takes a bit of getting used to…imagine a world where you never have to check arguments for null, you never have to worry about accessing a null reference, the null coalescing and null conditional operators don’t exist…
An Option is a form of discriminated union, meaning it can contain exactly one of the possible choices for that union. For Option, we either have Some(T) (meaning, we have produced a value of type T) or None. Those are the two possibilities.
None value is not something that is going to convert to a value of type
T, the type system in Rust prevents blindly passing in an Option result to something that expects a T.
If you are thinking having to write code to always test every operation for both the Some(T) and None possibilities is going to be tedious, it’s really not! Rust provides some nice syntactic ways to do pattern matching on Option types that make the code straight-forward.
Here is some sample code where we can see Option in action:
This illustrates two possible ways (
if let and
match) of pattern matching on an Option result and extracting the result, if any.
Declarative coding fun with iterators
We have examined the input and return types for our sample string processing function: we will either produce Some(Vec<&str>) for the case where everything passed in was a valid GUID, or we will produce None.
Now let’s look at the actual computation the function is performing:
If you have used LINQ in C# this probably looks somewhat familiar; iterators in Rust are similar to IEnumerable
A major difference is that with LINQ you typically pay a performance penalty for this terse style of declarative coding, whereas with Rust you do not!
Let’s compare some significant differences between C# and Rust.
We will start by modifying our example code to simply tokenize our input string and just print the first two tokens after trimming any whitespace.
Here is a Rust version:
Here is a C# version:
Both pieces of code provide a very natural, ergonomic way to tokenize a delimited string and get the first couple of entries.
There is a reason, however, that every time I see string.Split() called in C# my eyes narrow with suspicion.
Consider our original example scenario from part one of this series where we were loading a lot of text: 10 million GUIDs, along with a bunch of commas and whitespace characters. On disk it’s about 450MB of data. In C# when we call Split(), how many strings did we just allocate? Well, the method returns a string, that array had to be allocated and filled with the results of splitting, so we just allocated 10 million additional strings. We then allocate an additional ‘trimmed’ string for each item we iterate, since we are calling string.Trim() to remove the whitespace, which allocates yet another string. At least lazy evaluation saves us from allocating 10 million trimmed strings!
A memorable line from a talk I saw some time back was: the garbage collector isn’t your problem, the garbage is your problem!
In reality, C# makes it very, very easy to allocate on the heap like crazy and, while there are ways to avoid this if you are a particularly careful programmer, most developers go with the path of least resistance. There are a lot of string.Split() calls out there in the real world.
You probably saw this next bit coming. How many new strings does the Rust code allocate?
Remember that &str is a reference to a string slice, which is just a pointer to the underlying string and a length. With default immutability having our back, is it really necessary to allocate an entirely new string when splitting? What if we just…adjusted where our slice points to? We could point to the first non-comma character and set the length to end just before the next comma. When we trim the string, can’t we just slide the pointer to the first non-whitespace character and set the length to end just before the next whitespace character? In addition, split() being lazy by default (unlike the C# version) means even if we need to stack allocate some slice instances as we are iterating, we don’t need to create millions of them. We are saving hundreds of megabytes of unnecessary memory allocations, yet we wrote essentially the exact same code.
(Caveat: I haven’t looked into how split is actually implemented in Rust, I’m just passing along what I have learned by observing the behavior through some experimentation with a debugger attached.)
Ok, let’s prove that Rust isn’t allocating strings. Recall that our input consists of a bunch of whitespace and GUIDs and commas. Let’s assume our giant 450MB file starts like this:
(and goes on for millions of more such entries.)
We read that entire file in as a string and then we do our splitting and trimming and taking.
Let’s look in the debugger what ‘x’ points to on the first iteration.
Here we are using the windbg debugger to examine the actual memory of the variable
x. We see a 64-bit address (
0x000001d626adf044) and a 64-bit length (
0x0000000000000024), which is exactly what we would expect for a string slice. Since our GUIDs are 0x24 (36) characters long, the length is correct.
Let’s look at where the pointer is…pointing.
Clearly, it appears to be pointing into our original string, since we can see the expected whitespace and newlines following this. We can look a little bit behind this string and confirm that the original four spaces are there.
Of course, there are the four spaces.
Finally, just to absolutely prove the point, we can show that the address we just dumped out (
0x000001d626adf040) should be exactly the same as the address our
input string slice variable is pointing to.
It’s a perfect match (but, we knew that.)
I find it extremely encouraging to work through these sorts of small examples and realize just how dedicated the developers of the Rust language (and core libraries) care about efficiency. When they say ‘zero cost abstractions’ are a core part of the language design, they aren’t kidding.
As I think I made clear in my previous post, I am kind of blown away that you can write what feels like very high-level code in Rust, yet get efficiency and performance that is right up there with hand-optimized, native code.
The brilliant use of things like slices in conjunction with lazily evaluated iterators, as we just saw, steers you into doing things in an efficient way, without imposing cognitive overhead or requiring the use of subtle tricks to squeeze out performance. ‘It just works’ is kind of a cliché in programming circles, but at least in this case…Rust makes it true.
Next time, we will finally unravel this line of code from our example program:
Initially inscrutable to me, working through it will show it all makes sense when we start transitioning from thinking like a C# programmer to thinking like a Rustacean.