Maybe impressive in one way, but I'm also pretty sure a simple n-gram Markov model (a la Niall on the Amiga) would have a lower loss on the test set.
Transformers don't scale down very well, in my experience - I used to train local models all the time as new ones were released, as I recall transformers were the first ones I couldn't get better results out of with my limited training data and GPU.