If i understand it correctly, the analogy could be:
Let's say we have a low level programmer expert and we try to teach him algebra either we:
- (SFT): give him algebra book with new nomenclature, definitions, syntax
- (RL): let him learn algebra using C syntax
I don't think so.
Fine tuning works on an input/output basis. You are rewarded for producing a plausible output _now_.
RL rewards you later for producing the right output now. So you have to learn to generate a lot of activity but you are only rewarded if you end up at the right place.
In SFT you are rewarded for generating tokens plausible to the proof, one token at a time. In RL you are expected to generate an entire proof and then you are rewarded or punished only when the proof is done.