Yes, I was expecting some software that listens directly to the mouse input, and watches a pixel on the screen. Messing around with cameras and counting frames introduces a whole bunch of other variables like the quality of the monitor and mouse and phone used for recording.
The camera, mouse, and monitor all stayed the same for the tests, but there was a significant difference in latency. Out of the 16 times the experiment was run, only once did Wayland have lower latency. It would be an amazing coincidence if the monitor, mouse, and/or camera were the reason for this.
Figuring out why there's increased latency is a job for software tooling, but I think this guy's experiment is one of the best ways to measure what users care about: The time it takes for an input (such as mouse movement) to result in a change (such as cursor movement).
Note that this doesn't mean that the Wayland protocol itself is the reason for the higher latency. It may be Gnome's implementation (testing with a wlroots compositor might shed light on this). It may be differences in default configuration options. It may be that Wayland and X11 start up different services, and the Wayland helper processes increase load on the machine. But I seriously doubt the reason for the difference in latency was because the same hardware was used throughout the experiment.
I got results with a P-value of under 0.001. That should be enough to demonstrate that there's a real difference.
Using a camera allows the methodology to he identical between Wayland and X, while I don't know how to listen for mouse movements from software in a way that wouldn't introduce its own problems. What if the photons are emitted from the screen after the same number of milliseconds across X and Wayland, but Mutter is a few milliseconds slower about notifying my measurement application? Conversely, what if Mutter has more real latency than X, but the latency is introduced after the stage where my measurement application sits?
The variables you mention are identical between the Wayland test and the X test. It is admittedly a challenge for reproducibility, but doesn't affect these results.
The mouse and monitor don't matter here. Unless their delay completely dominates the timing (it doesn't here) they can be ignored because the setup is constant been the tests. We're interested in the difference, not the absolute numbers.