Consider reading the article, which addresses all of the points you raise.
It's directly stated in the post that the entire test is meant to be humorous, not taken seriously, only that is has vaguely followed model performance to date. The author also writes that this new result shows that trend has broken..