The whole point of this benchmark is that it asks the model to work in a modality it is not trained in and does not understand well. The result is largely meaningless. This is just like the people who are endlessly surprised by the fact that a raw LLM does not work with numbers well, or miscounts letters. In short, this test benchmarks the intelligence of the person running it, not of the model.
The rasterised SVG is just a different representation of the same data. A sufficiently advanced LLM may not need to 'see' the rasterised image to be able to draw a good picture. A human could draw a very basic image through raw SVG just by mentally plotting points.