The \*GLU-based activations functions like GEGLU and SwiGLU use 2 input values t...

The *GLU-based activations functions like GEGLU and SwiGLU use 2 input values to produce 1 output value, which makes these numbers weird. In each value pair, one goes through the GELU/SiLU activation function and is then multiplied by the other "gate" value.

In the report, "hidden dim" matches the number of GEGLU inputs. In the config, "intermediate_size" matches the number of GEGLU outputs. Most *GLU models so far have used intermediate_size=8/3*d_model as this makes have the same number of matmul FLOPS & parameters as a 4x-expanded non-GLU model, and PaLM vaguely showed that 4x is better than a smaller expansion factor.

If one considers Llama-2-7B's FFN expansion factor to be ~5.33x, Gemma's expansion factor is 16x.