>a major feature of transformers being wildly faster inference than with LSTM Wa...

visarga · on Dec 11, 2023

Mamba has dual view - you can use it both as CNN and RNN. The first is used for pre-training and for preloading the prompt because it can process all tokens at once. The second is used for token generation because it is O(1) per token. Basically two models in one, inheriting both advantages. This is possible because the Structured State Space layer is linear, so you can reshape some sums and unroll recursion into a convolution the size of the input, which can be further sped up with FFT.

inkysigma · on Dec 12, 2023

As a quick point of clarification, I don't think MAMBA has a convolutional view since it drops the time invariance and is strictly linear. The authors use parallelized prefix sum to achieve some good speed up.

jimmySixDOF · on Dec 11, 2023

and which is why the speed up is proportional to context length so starting near parity then, theoretically, see 100x at 100k tokens