I wonder what is the reason different blocks are summed even though they semanti...

I wonder what is the reason different blocks are summed even though they semantically represent very different things - e.g. Positional encoding plus embedding. Intuitively wouldn't you expect the architecture to work better if you kept those things as separate inputs to the next layer? Won't the network have some blind spots where a given word will effectively be invisible if it appears in the wrong position?