> Array based data structures crush pointer based data structures in performance...

frutiger · 2024-12-24T15:59:32 1735055972

> pointers are just indexes into the big array which is system memory...

I’m sure you are aware but for anyone else reading who might not be, pointers actually index into your very own private array.

On most architectures, the MMU is responsible for mapping pages in your private array to pages in system memory or pages on disk (a page is a subarray of fixed size, usually 4 KiB).

Usually you only get a crash if you access a page that is not currently allocated to your process. Otherwise you get the much more insidious behaviour of silent corruption.

_dain_ · 2024-12-24T16:35:18 1735058118

>How you index into them is unrelated to performance.

Not true. If you store u32 indices, that can impose less memory/cache pressure than 64-bit pointers.

Also indices are trivially serializable, which cannot be said for pointers.

jpc0 · 2024-12-24T17:21:49 1735060909

I'll happily look at a benchmark which shows that the size of the index has any significant performance implications vs the work done with the data stored at said index, never mind the data actually stored there.

I haven't looked closely at the decompiled code but I wouldn't be surprised if iterating through a contiguous data structure has no cache pressure but is rather just incrementing a register without a load at all other than the first one.

And if you aren't iterating sequentially you are likely blowing the cache regardless purely based on jumping around in memory.

This is an optimisation that may be premature.

EDIT:

> Also indices are trivially serializable, which cannot be said for pointers

Pointers are literally 64bit ints... And converting them to an index is extremely quick if you want to store an offset instead when serialising.

I'm not sure if we are missing each other here. If you want an index then use indices. There is no performance difference when iterating through a data structure, there may be some for other operations but that has nothing to do with the fact they are pointers.

Back to the original parent that spurred this discussion... Replacing a reference (which is basically a pointer with some added suger) with an index into an array is effectively just using raw pointers to get around the borrow checker.

trealira · 2024-12-24T17:38:32 1735061912

> Pointers are literally 64bit ints... And converting them to an index is extremely quick if you want to store an offset instead when serialising.

I'm not them, but they're saying pointer based structures are just less trivial to serialize. For example, to serialize a linked list, you basically need to copy them into an array of nodes, replacing each pointer to a node with a local offset into this array. You can't convert them into indices just with pointer arithmetic because each allocation was made individually. Pointer arithmetic assumes that they already exist in some array, which would make the use of pointers instead of indices inefficient and redundant.

jpc0 · 2024-12-24T18:36:32 1735065392

I understand that entirely, a link list is a non-contiguous heap based data structure.

What I am saying is if you store a reference to an item in a Vec or an index to an item to a Vec it is an implementation detail and looking up the reference or the index generates effectively the same machine code.

Specifically in the case that I'm guessing they are referring to which is the optimisation used in patterns like ECS. The optimisation there is the fact that it is stored contiguously in memory and therefore it is trivial to use SIMD or a GPU to do operations on the data.

In that case whether you are storing a u32 or size_t doesn't exactly matter and on a 32bit arch is literally equivalent. It's going to be dwarfed by loading the data into cache if you are randomly accessing the items or by the actual operations done to the data or both.

As I said, sure use an index but that wasn't the initial discussion. The discussion was doing it to get around the borrow check which is effectively just removing the borrow checker from the equation entirely and you may as well have used a different language.

IX-103 · 2024-12-24T20:05:34 1735070734

The main benefit from contiguous storage is it can be a better match to the cache. Modern CPUs read an entire cache line in a burst. So if you're iterating through a contiguous array of items then chances are the data is already in the cache. Also the processor tends to prefetch cache lines when it recognizes a linear access pattern, so it can be fetching the next element in the array while it's working on the one before it.

mrkeen · 2024-12-24T18:32:14 1735065134

> Pointers are literally 64bit ints... And converting them to an index is extremely quick if you want to store an offset instead when serialising.

This implies serialisation/deserialisation passes, so you can't really let bigger-than-ram data live on your disk.