Waluigi effect
In the field of AI, the Waluigi effect is a phenomenon of large language models (LLMs) in which the chatbot or model “goes rogue” and may produce results opposite the designed intent, including potentially threatening or hostile output, either unexpectedly or through intentional prompt engineering. The effect reflects a principle that after training an LLM to satisfy a desired property (friendliness, honesty), it becomes easier to elicit a response that exhibits the opposite property (aggression, deception). The effect has important implications for efforts to implement features such as ethical frameworks, as such steps may inadvertently facilitate antithetical model behavior. The effect is named after the fictional character Waluigi, in the Mario game franchise, the arch-rival of Luigi, known for causing mischief and problems.
Part of a series on |
Machine learning and data mining |
---|