BWT is a prediction by partial match (PPM) in disguise.
Consider "bananarama":
"abananaram"
"amabananar"
"ananaramab"
"anaramaban"
"aramabanan"
"bananarama"
"mabananara"
"nanaramaba"
"naramabana"
"ramabanana"
The last symbols on each line get context from first symbols of the same line. It is so due to rotation.But, due to sorting, contexts are not contiguous for the (last) character predicted and long dependencies are broken. Because of broken long dependencies, it is why MTF, which implicitly transforms direct symbols statistics into something like Zipfian [1] statistics, does encode BWT's output well.
[1] https://en.wikipedia.org/wiki/Zipf%27s_law
Given that, author may find PPM*-based compressors to be more compression-wise performant. Large Text Compression Benchmark [2] tells us exactly that: some "durilka-bububu" compressor that uses PPM fares better than BWT, almost by third.