Role Switching Hypothesis
Why LLMs exhibit contradictory behaviours.
There is a paradox at the heart of LLMs: An LLM (Large Language Model) can both create insecure software and also be good at finding security vulnerabilities.
The Role Switching hypothesis posits that the LLM adopts a particular role and blindspots associated with that role.
Level of uncertainty: High
How is the same LLM capable of writing insecure software but also able to identify the code it just wrote as insecure? This is very unlike human behaviour. It seems impossible to get a software developer who writes insecure code one moment and the next moment is able to clearly articulate that the previously written code is insecure.
Context Window Length
The context window factors into the role the LLM adopts. There is an optimal medium context length where a role is well-defined, whereas roles become poorly defined at both very short and very long context lengths.
In the very short context length an LLM does not have adequate information to latch onto a well-defined role, while in very long context setting there is too much information to reach a stable well-defined role.
Does telling it to ‘write secure code’ help?
Simply stating in the system prompt or in the conversation’s context to ‘write secure code’ may be insufficient as the vast corpus of web development code it has been trained on tends to be optimised for being functional over being secure code. In the Code Generation Role, the LLM may still tend towards insecure code as its pre and post training distribution holds sway over the ‘write secure code’ instructions in its context.
Dominant Role
When asked in a different context, such as in a new conversation or later on when the code generation is completed, to check the code for security vulnerabilities, the Security Reviewer Role becomes dominant and the LLM is able to identify the security issues. This blindness to security vulnerabilities is associated with one role but not the other.
Cleaning the training signal
The solution to this problem would be to ensure the majority of the pre-training corpus is properly cleaned up and the post-training with RLHF (Reinforcement Learning from Human Feedback) also prioritises secure coding practices over functional code.
Web search sub-agent
As LLM coding agents sometimes do a web search during inference, another mitigation strategy would be to use of a sub-agent to scan any externally sourced code before adding it to the context window.
Jagged Intelligence
This hypothesis is a more dynamic form of the Jagged Frontier as coined by Ethan Mollick back in 20231 and repeated by Melanie Mitchell’s recent piece2. Rather than a static frontier of capabilities, this hypothesis proposes a more dynamic intelligence capabilities frontier that may change with each new response and possibly with every token generated.
In conclusion, the Role Switching Hypothesis provides an explanation for the paradox of LLM being both capable and incapable, attributing it to the interplay of the context window and the training process. This also suggests that when an LLM appears to act in a ‘deceptive’ manner, it may simply be due to the LLM crossing the boundary between different roles. Better training curation, context management and the use of a sub-agent or machine learning techniques to detect role switching are possible mitigation strategies.
https://www.oneusefulthing.org/p/centaurs-and-cyborgs-on-the-jagged
https://aiguide.substack.com/p/on-ai-and-jagged-intelligence

