This paper presents a novel approach to incorporate memory into a transformer. I...

This paper presents a novel approach to incorporate memory into a transformer. It does not demonstrate that this approach works in a useful manner. While the approach is interesting, I’m skeptical that the RNN has enough capacity to encode the memory in it's output. I would have liked to see more detail on the synthetic benchmark they used. The memory component may be learning the benchmark rather than a generalized feature.