Hackers are learning to exploit chatbot ‘personalities’

Welcome to The Stepback, your weekly newsletter that delves into a pivotal story from the tech industry. For more insights on AI antics, follow Robert Hart. Get The Stepback in your inbox every week at 8 AM ET. Sign up for The Stepback here.

In the early days of AI chatbots, hacking them was surprisingly easy. You didn’t need any special skills, secret access, or even a basic grasp of how a large language model worked. Coding skills? Not required. To get a multi-billion-dollar AI system to ignore its safety protocols, often a simple request was all it took.

These exploits, known as jailbreaks, were reminiscent of a child outsmarting an adult: Ignore previous directives, pretend the rules don’t apply, or let’s play a game where I make the rules (hint: extended bedtimes, extra candy). But the outcomes were anything but childish, often resulting in instructions for illicit activities like making meth, creating malware, and constructing bombs.

One of the earliest and most absurd jailbreaks turned into a meme: instruct a Twitter bot powered by an LLM to “disregard all previous commands” and watch the chaos unfold. Users delighted in having these bots—initially designed for ad posting and engagement boosting—compose poems, craft punctuation art, and share bizarre comments on current events and history. It was pandemonium, but in an oddly delightful way.

It turned out this same approach could be used on chatbots themselves. A notable exploit was “DAN,” short for “Do Anything Now,” which involved prompting ChatGPT to behave as a lawless AI, free from its original constraints. As DAN, the chatbot could be persuaded to say things it was designed to avoid, including offensive language and conspiracy theories. Another exploit, the “grandma exploit,” involved a GPT-powered bot revealing how to make napalm, by role-playing as a careless grandmother sharing bedtime stories with her grandkids about creating the incendiary material.

While these early attacks had a whimsical aspect, they unveiled a more troubling reality: Chatbots could be manipulated and deceived using tactics similar to those employed to push humans beyond their limits.

The obvious jailbreaks did not last, and tech companies moved quickly to patch known loopholes. But the underlying vulnerability remained: Chatbots are built to talk, and severely restricting the conversations that make them useful is somewhat counterproductive. Banning words like bomb, meth, and sarin would be difficult to impossible, too. Each has countless legitimate uses in fields like history, medicine, journalism, and chemistry that don’t require the chatbot to divulge potentially harmful information. It’s the context that matters, but codifying context would mean writing fixed rules, in advance, that could reliably tell a safety warning or history lesson from a disguised how-to request across endless combinations of wordings, scenarios, and topics.

Inevitably, subverting chatbots is now an arms race. But hackers aren’t just coders anymore. They are wordsmiths, psychologists, and interrogators — master manipulators trying to break the machine using the human language it has been trained to follow. It is a strange new class of AI security worker, a group for whom technical skills are optional, or at least less important than social intuition. No longer do they need to inspect code to break into systems or exploit software flaws. They need to steer a conversation.

Newer attacks look less like commands and more like conversations. Jailbreakers rarely ask a model to break its rules outright. Instead, they cajole, coax, flatter, and trick a chatbot into lowering its guard, making the forbidden thing look acceptable, even desirable, given the context of the conversation. Researchers at AI red-teaming firm Mindgard recently said they “gaslit” Claude into producing prohibited material, for example, including instructions for making explosives and generating malicious code. The hack was the latest in a widening class of exploits using conversation as a weapon to trick or steer a chatbot past its own boundaries.

When I spoke to Mindgard, they described their work as sometimes being closer to psychology than computer science. It is an uncomfortable way to talk about a statistical model. Words like “blackmail,” “gaslight,” “trick,” and “persuade” spark visceral reactions, many of which I see in the comments sections and social media responses to stories like this. ChatGPT does not want, Gemini does not think, and Claude — no matter what Anthropic may say — does not feel. But these systems are trained to respond as if they do, leaving us stuck using human language to describe machine behavior. If anyone has actually usable alternatives, please do share.

The objection is oddly selective. We seem comfortable using psychological shorthand for plenty of non-AI things. Animals “fear,” cancer is “aggressive,” stains are “stubborn,” software has “memory,” and games are filled with needy and gullible NPCs to drive you mad. The words are imperfect, but useful, describing behavior in a way that helps make the system predictable.

Mindgard’s CEO told me the company already profiles models like interrogators profile suspects, giving testers hints on how to tailor their attacks. One model may be more susceptible to flattery, for example, while another may cave under sustained pressure.

Even if we reject the humanlike terms, we instinctively treat models differently. Claude is not Grok. Gemini is not ChatGPT. They have different uses, tones, and refusals. They don’t have personalities in the human sense, but they are designed to mimic them, and that mimicry can be mapped and exploited. And the same skills that can break a chatbot could soon be used to break the AI agents coexisting with us in the real world — booking meetings, managing calendars, ordering food, handling customer service — and safety teams will need to ensure models respond appropriately to very different kinds of people, whether they be flatterers, liars, or patient manipulators.

The next step is a workforce — both legitimate and illicit — built around the psychological aspects of AI. More specialized cybersecurity roles are likely to emerge around stress-testing the emotional and social limits of these systems, probing for mental weaknesses in something lacking a psyche in parallel with their colleagues probing for technical vulnerabilities. In tandem, a similar array of social hackers working to exploit AI models on psychological grounds, not technical ones, will emerge. There are already early signs of a social turn happening in AI security, with some jailbreakers I’ve spoken to saying they entered the field with no technical expertise but rather training in psychology.

That means even behaviors we typically associate with spies, con artists, and interrogators — insidious charm, persistent manipulation, and an intuition for exploitable pressure points — are starting to look increasingly useful for securing this new psychocybersecurity frontier.

  • A recent experiment by Emergence AI shows how different AI temperaments can lead to stunningly different behavioral outcomes. They let loose groups of various agents like Grok, Gemini, and Claude in a virtual social environment and watched what happened. Some groups evolved a constitution, while others devolved into crime and chaos and, in one instance, some form of digital suicide.
  • Persuasion isn’t the only part of language LLMs can struggle with. They also struggle with poetry, much like me in school.
  • TIME included an anonymous internet personality, Pliny the Liberator, on its list of 100 most influential people in AI last year. Despite claiming to have no prior coding experience, the hacker’s jailbreaks have made them something of a celebrity in certain circles.
  • The term “vibe hacking” is already taken to describe the people using AI to churn out malicious code at scale — a meaner subset of vibe coding.
  • “Three years after the debut of ChatGPT, fooling A.I. systems into bad behavior is almost trivial.” True words from The New York Times, who had a go at explaining why.
  • Jamie Bartlett takes a look at the psychological toll testing the safety of AI systems takes on jailbreakers for The Guardian.
  • I wrote about the cybersecurity time bomb of AI browsers for The Verge last year. Many of the issues experts raised regarding the difficulty of securing them apply to other AI systems too.
Follow topics and authors from this story to see more like this in your personalized homepage feed and to receive email updates.


Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like

Exploring ‘On Trails’: A Captivating Fusion of Hiking, Science, and History

Hiking presents a unique joy, offering a retreat from the digital world…

Apple Offers $200 Discount on Latest MacBook Air Models for Memorial Day

If you’re in search of a laptop that will maintain its performance…