Since we're talking about defining our own processor, that means we need to define one with cheaper traps.
Expanding on what I wrote above about "bits of hardware acceleration", maybe adding a few primitives to the instruction set that make page table walking easier would help.
And with a trusted compiler architecture you don't need to keep the ISA stable between iterations, since it's assumed that all code gets compiled at the last minute for the current ISA.
Lots of fun things to experiment with.
Taking this to an extreme, the whole idea of a TLB sounds like hardware protection too?
As a thought experiment, imagine an extremely simple ISA and memory interface where you would do address translation or even cache management in software if you needed it... the different cache tiers could just be different NUMA zones that you manage yourself.
You might end up with something that looks more like a GPU or super-ultra-hyper-threading to get throughput masking the latency of software-defined memory addressing and caching?