n^2 isn't a setting someone chose, it's a mathematical consequence of what attention is.
Here's what attention does: every token looks at every other token to decide what's relevant. If you have n tokens, and each one looks at n others, you get n * n = n^2 operations.
Put another way: n^2 is when every token gets to look at every other token. What would n^3 be? n^10?
(sibling comment has same interpretation as you, then handwaves transformers can emulate more complex systems)
There are lots more complicated operations than comparing every token to every other token & the complexity increases when you start comparing not just token pairs but token bigrams, trigrams, & so on. There is no obvious proof that all those comparisons would be equivalent to the standard attention mechanism of comparing every token to every other one.