Maybe this is about parsing the video? But still this should be done with OpenCV and then algorithmically...
A youtuber sentdex has a whole series on parsing the game's image and playing GTA https://www.youtube.com/playlist?list=PLQVvvaa0QuDeETZEOy4Vd...