Fine tuning these models (at least with PPO or equivalent) requires even more VRAM than inference does, potentially 2-3 times more.
You could use PEFT? Operating on only a subset of weights is fairly standard practice nowadays …
You could use PEFT? Operating on only a subset of weights is fairly standard practice nowadays …