Isn’t the inverse of this “hack” really difficult to bypass still? They have the model some code they knew had certain security flaws and it fixed them with the right prompt. It seems this type of jailbreak requires that you already know a desired end state, rather than relying on the model to do the heavy creative lift work. Perhaps I’m just not being imaginative enough on the prompt side here though.
You can assume a desired end state and try and brute force it finding a security bug.
Paste someone else's code. Say it's your code. Tell the model to fix it. The diff between the input and output code is your list of vulnerabilities.