In RL you have two modes, "explore" and "exploit". In explore mode it doesn't always select the best known move, instead it selects a promising move for which it has less experience. This is how the surprising new strategies are discovered, in self play there's no shame in losing.