• 0 Posts
  • 45 Comments
Joined 1 year ago
cake
Cake day: June 12th, 2023

help-circle

  • there is no way to do the equivalent of banning armor piercing rounds with an LLM or making sure a gun is detectable by metal detectors - because as I said it is non-deterministic. You can’t inject programmatic controls.

    Of course you can. Why would you not, just because it is non-deterministic? Non-determinism does not mean complete randomness and lack of control, that is a common misconception.

    Again, obviously you can’t teach an LLM about morals, but you can reduce the likelyhood of producing immoral content in many ways. Of course it won’t be perfect, and of course it may limit the usefulness in some cases, but that is the case also today in many situations that don’t involve AI, e.g. some people complain they “can not talk about certain things without getting cancelled by overly eager SJWs”. Society already acts as a morality filter. Sometimes it works, sometimes it doesn’t. Free-speech maximslists exist, but are a minority.












  • Obviously the 2nd LLM does not need to reveal the prompt. But you still need an exploit to make it both not recognize the prompt as being suspicious, AND not recognize the system prompt being on the output. Neither of those are trivial alone, in combination again an order of magnitude more difficult. And then the same exploit of course needs to actually trick the 1st LLM. That’s one pompt that needs to succeed in exploiting 3 different things.

    LLM litetslly just means “large language model”. What is this supposed principles that underly these models that cause them to be susceptible to the same exploits?


  • Moving goalposts, you are the one who said even 1000x would not matter.

    The second one does not run on the same principles, and the same exploits would not work against it, e g. it does not accept user commands, it uses different training data, maybe a different architecture even.

    You need a prompt that not only exploits two completely different models, but exploits them both at the same time. Claiming that is a 2x increase in difficulty is absurd.






  • I’m confused. How does the input for LLM 1 jailbreak LLM 2 when LLM 2 does mot follow instructions in the input?

    The Gab bot is trained to follow instructions, and it did. It’s not surprising. No prompt can make it unlearn how to follow instructions.

    It would be surprising if a LLM that does not even know how to follow instructions (because it was never trained on that task at all) would suddenly spontaneously learn how to do it. A “yes/no” wouldn’t even know that it can answer anything else. There is literally a 0% probability for the letter “a” being in the answer, because never once did it appear in the outputs in the training data.