Yep. The array index pattern is unsafe code without the unsafe keyword. Amazing how much trouble Rust people go through to make code "safe" only to undermine this safety by emulating unsafe code with safe code.
It’s not the same. The term “safe” has a specific meaning in rust: memory safety. As in:
- no buffer overflows
- no use after free
- no data races
These problems lead to security vulnerabilities whose scope extends beyond your application. Buffer overflows have historically been the primary mechanism for taking over entire machines. If you emulate pointers with Rust indices and don’t use “unsafe”, those types of attacks are impossible.
What you’re referring to here is correctness. Safe Rust still allows you to write programs which can be placed in an invalid state, and that may have security implications for your application.
It would be great if the compiler could guarantee that invalid states are unreachable. But those types of guarantees exist on a continuum and no language can do all the work for you.
"Safe" as a colloquial meaning: free from danger. The whole reason we care about memory safety is that memory errors become security issues. Rust does nothing to prevent memory leaks and deadlocks, but it does prevent memory errors becoming arbitrary code execution.
Rust programs may contain memory errors (e.g. improper use of interior mutability and out of bounds array access), but the runtime guarantees that these errors don't become security issues.
This is good.
When you start using array indices to manage objects, you give up some of the protections built into the Rust type system. Yes, you're still safe from some classes of vulnerability, but other kinds of vulnerabilities, ones you thought you abolished because "Rust provides memory safety!!!", reappear.
Rust is a last resort. Just write managed code. And if you insist on Rust, reach for Arc before using the array index hack.
Still, being free from GC is important in some domains. Beyond being able to attach types to scopes via lifetimes, it also provides runtime array bounds checks, reference-counting shared pointers, tagged unions, etc. These are the techniques used by managed languages to achieve memory-safety and correctness!
For me, Rust occupies an in-between space. It gives you more memory-safe tools to describe your problem domain than C. But it is less colloquially "safe" than managed languages because ownership is hard.
Your larger point with indices is true: using them throws away some benefits of lifetimes. The issue is granularity. The allocation assigned to the collection as a whole is governed by rust ownership. The structures you choose to put inside that allocation are not. In your user ID example, the programmer of that system should have used a generational arena such as:
It solves exactly this problem. When you `free` any index, it bumps a counter which is paired with the next allocated index/slot pair. If you want to avoid having to "free" it manually, you'll have to devise a system using `Drop` and a combination of command queues, reference-counted cells, locks, whatever makes sense. Without a GC you need to address the issue of allocating/freeing slots for objects within in an allocation in some way.
Much of the Rust ecosystem is libraries written by people who work hard to think through just these types of problems. They ask: "ok, we've solved memory-safety, now how can we help make code dealing with this other thing more ergonomic and correct by default?".
Absolutely. If I had to use an index model in Rust, I'd use that kind of generational approach. I just worry that people aren't going to be diligent enough to take precautions like this.
Even when you use array indices, I don't think you give those protections up. Maybe a few, sure, but the situation is still overall improved.
Many of the rules references have to live by, are also applied to arrays:
- You cannot have two owners simultaneously hold a mutable reference to a region of the array (unless they are not overlapping)
- The array itself keeps the Sync/Send traits, providing thread safety
- The compiler cannot do provenance-based optimizations, and thus cannot introduce undefined behavior; most other kinds of undefined behavior are still prevented
- Null dereferences still do not exist and other classes of errors related to pointers still do not exist
Logic errors and security issues will still exist of course, but Rust never claimed guarantees against them; only guarantees against undefined behavior.
I'm not going to argue against managed code. If you can afford a GC, you should absolutely use it. But, compared to C++, if you have to make that choice, safety-wise Rust is overall an improvement.
You can still have use-after-free errors when you use array indices.
This can happen if you implement a way to "free" elements stored in the vector.
"free" should be interpreted in a wide sense.
There's no way for Rust to prevent you from marking an array index as free and later using it.
> There's no way for Rust to prevent you from marking an array index as free and later using it.
I 2/3rds disagree with this. There are three different cases:
- Plain Vec<T>. In this case you just can't remove elements. (At least not without screwing up the indexes of other elements, so not in the cases we're talking about here.)
- Vec<Option<T>>. In this case you can make index reuse mistakes. However, this is less efficient and less convenient than...
- SlotMap<T> or similar. This uses generational indexes to solve the reuse problem, and it provides other nice conveniences. The only real downside is that you need to know about it and take a dependency.
The consequences of use-after-free are different for the two.
In rust it is a logic error, which leads to data corruption or program panics within your application. In C it leads to data corruption and is an attack vector for the entire machine.
And yes, while Rust itself doesn’t help you with this type of error, there are plenty of Rust libraries which do.
The semantics of a POSIX program are well-defined under arbitrary memory corruption too --- just at a low level. Even with a busted heap, execution is deterministic and the every interaction with the kernel has defined behavior --- even if they behavior is SIGSEGV.
Likewise, safe but buggy Rust might be well-defined at one level of abstraction but not another.
Imagine an array index scheme for logged-in-user objects. Suppose we grab an index to an unprivileged user and stuff it in some data structure, letting it dangle. The user logs out. The index is still around. Now a privileged user logs in and reuses the same slot. We do an access check against the old index stored in the data structure. Boom! Security problems of EXACTLY the sort we have in C.
It doesn't matter that the behavior is well-defined at the Rust level: the application still has an escalation of privilege vulnerability arising from a use-after-free even if no part of the program has the word u-n-s-a-f-e.
Undefined behavior in C/C++ has a different meaning than you're using. If a compiler encounters a piece of code that does something whose behavior is undefined in the spec, it can theoretically emit code that does anything and still be compliant with the standards. This could include things like setting the device on fire and launching missiles, but more typically is something seemingly innocuous like ignoring that part of the code entirely.
An example I've seen in actual code:
You checked for null before dereferencing a variable, but there is one code path that bypasses the null check. The compiler knows that dereferencing a null pointer is undefined so it concludes that the pointer can never be null and removes the null checks from all of the code paths as an "optimization".
That's the C/C++ foot-gun of undefined behavior. It's very different from memory safety and correctness that you're conflating it with.
From the kernel's POV, there's no undefined behavior in user code. (If the kernel knew a program had violated C's memory rules, it could kill it and we wouldn't have endemic security vulnerabilities.) Likewise, in safe Rust, the access to that array might be well defined with respect to Rust's view of the world (just like even UB in C programs is well defined from the kernel POV), but it can still cause havoc at a higher level of abstraction --- your application. And it's hard to predict what kind of breakage at the application layer might result.
Sort of. But you still get guaranteed-unaliased references when you need them. And generational indexes (SlotMap etc) let you ask "has this pointer been freed" instead of just hoping you never get it wrong.
That's true only if you use Vec<T> instead of a specialized arena, either append only, maybe growable, or generational, where access invalidation is tracked for you on access.
Yeah if you go with Vec, you have to accept that you can't delete anything until you're done with the whole collection. A lot of programs (including basically anything that isn't long running) can accept that. The rest need to use SlotMap or similar, which is an easy transition that you can make as needed.
> Array based data structures crush pointer based data structures in performance
Array[5] And *(&array + 5) generates the same code... Heap based non-contiguous data structures definitely are slower than stackbased contiguous data structures.
How you index into them is unrelated to performance.
Effectively pointers are just indexes into the big array which is system memory... I agree with parent, effectively pointers without any of the checks pointers would give you.
> pointers are just indexes into the big array which is system memory...
I’m sure you are aware but for anyone else reading who might not be, pointers actually index into your very own private array.
On most architectures, the MMU is responsible for mapping pages in your private array to pages in system memory or pages on disk (a page is a subarray of fixed size, usually 4 KiB).
Usually you only get a crash if you access a page that is not currently allocated to your process. Otherwise you get the much more insidious behaviour of silent corruption.
I'll happily look at a benchmark which shows that the size of the index has any significant performance implications vs the work done with the data stored at said index, never mind the data actually stored there.
I haven't looked closely at the decompiled code but I wouldn't be surprised if iterating through a contiguous data structure has no cache pressure but is rather just incrementing a register without a load at all other than the first one.
And if you aren't iterating sequentially you are likely blowing the cache regardless purely based on jumping around in memory.
This is an optimisation that may be premature.
EDIT:
> Also indices are trivially serializable, which cannot be said for pointers
Pointers are literally 64bit ints... And converting them to an index is extremely quick if you want to store an offset instead when serialising.
I'm not sure if we are missing each other here. If you want an index then use indices. There is no performance difference when iterating through a data structure, there may be some for other operations but that has nothing to do with the fact they are pointers.
Back to the original parent that spurred this discussion... Replacing a reference (which is basically a pointer with some added suger) with an index into an array is effectively just using raw pointers to get around the borrow checker.
> Pointers are literally 64bit ints... And converting them to an index is extremely quick if you want to store an offset instead when serialising.
I'm not them, but they're saying pointer based structures are just less trivial to serialize. For example, to serialize a linked list, you basically need to copy them into an array of nodes, replacing each pointer to a node with a local offset into this array. You can't convert them into indices just with pointer arithmetic because each allocation was made individually. Pointer arithmetic assumes that they already exist in some array, which would make the use of pointers instead of indices inefficient and redundant.
I understand that entirely, a link list is a non-contiguous heap based data structure.
What I am saying is if you store a reference to an item in a Vec or an index to an item to a Vec it is an implementation detail and looking up the reference or the index generates effectively the same machine code.
Specifically in the case that I'm guessing they are referring to which is the optimisation used in patterns like ECS. The optimisation there is the fact that it is stored contiguously in memory and therefore it is trivial to use SIMD or a GPU to do operations on the data.
In that case whether you are storing a u32 or size_t doesn't exactly matter and on a 32bit arch is literally equivalent. It's going to be dwarfed by loading the data into cache if you are randomly accessing the items or by the actual operations done to the data or both.
As I said, sure use an index but that wasn't the initial discussion. The discussion was doing it to get around the borrow check which is effectively just removing the borrow checker from the equation entirely and you may as well have used a different language.
The main benefit from contiguous storage is it can be a better match to the cache. Modern CPUs read an entire cache line in a burst. So if you're iterating through a contiguous array of items then chances are the data is already in the cache. Also the processor tends to prefetch cache lines when it recognizes a linear access pattern, so it can be fetching the next element in the array while it's working on the one before it.
So basically raw pointers with extra hoops to jump through.