The matrix in question is tiny (2x2). There's no hope a GPU would ever outperform a CPU on that.
It only gets interesting if you need a large matmuls or millions of small matmuls in parallel.