I'm not an expert, I've just been "vibe-R&D"-ing computer vision for a bit now, but I'll guarantee you SSIM is not suitable for this purpose. I've been dabbling in basically this area (comparing small, potentially low-resolution images) and SSIM produces a lot of false negatives and some false positives.
I would recommend template matching using normalized cross-correlation (TM_CCOEFF_NORMED in opencv.)
Also this paper from Nvidia critically scrutinizing SSIM may be relevant: https://research.nvidia.com/publication/2020-07_Understandin...