I did not suggest that, no. I stress that claiming a possibility is not the same as claiming a fact.
I observed that the two things are quite different in terms of model capabilities. That's relevant when considering how to interpret the results of the benchmark. We need to differentiate between (at minimum) reproducing an (approximately) verbatim answer from the training set, assembling disparate items from the training set into an answer piecewise, and performing novel logical inference using items from the training set.
I further speculated about the intent of the authors but you seem to be saying that my guess was wrong. In response I will observe that for any problem that's known to be solved it's likely to be quite difficult if not impossible to confidently determine that the model performed a de novo derivation as opposed to finding pieces of the answer in various places.
Of course there's absolutely nothing wrong with the latter! It's just important to be aware of the possibility when drawing conclusions about model capabilities.