>a major feature of transformers being wildly faster inference than with LSTM
Wasn't the main issue with RNNs the fact that inference during training can't be efficiently parallelized?
The inference itself normally should be faster for an RNN than for a transformer since the former works in linear time in terms of input size while the latter is quadratic
Mamba has dual view - you can use it both as CNN and RNN. The first is used for pre-training and for preloading the prompt because it can process all tokens at once. The second is used for token generation because it is O(1) per token. Basically two models in one, inheriting both advantages. This is possible because the Structured State Space layer is linear, so you can reshape some sums and unroll recursion into a convolution the size of the input, which can be further sped up with FFT.
As a quick point of clarification, I don't think MAMBA has a convolutional view since it drops the time invariance and is strictly linear. The authors use parallelized prefix sum to achieve some good speed up.
Wasn't the main issue with RNNs the fact that inference during training can't be efficiently parallelized?
The inference itself normally should be faster for an RNN than for a transformer since the former works in linear time in terms of input size while the latter is quadratic