Agreed, and point number two is the tricky one. Creating a list of tasks is easy; evaluating them is not. You need a consistent task set, a "clean slate" control (i.e., Claude code without memory is your proper control) and an evaluation criteria which differentiates "uses fewer tokens" from "produces better results," otherwise you end up with vendors evaluating their own work.
Currently constructing a repeatable test harness for PMB: Fixed task, with/without memory, repeated N times, giving number of tokens/turns/passed/not passed with a subjective quality score too. Would be happy to share the task set and evaluation criteria for testing on anyone else's memory server or clean slate control, not just mine.