Your AI Agent Will Eventually Do Something Stupid: A Guide to AI Security 101

As the Director of Alignment at Meta Superintelligence Labs, Summer Yue’s job is keeping AI aligned with human values. Before that, she was at Google DeepMind and Scale AI. If anyone would know how to keep an AI agent in check, it’s her.

On February 23, 2026, she posted a screenshot of her OpenClaw agent deleting her entire email inbox while she typed commands at it begging it to stop.

“Nothing humbles you like telling your OpenClaw ‘confirm before acting’ and watching it speedrun deleting your inbox,” she wrote on X. “I couldn’t stop it from my phone. I had to RUN to my Mac mini like I was defusing a bomb.”

She had told the agent to suggest what to delete. She did not tell it to act. Despite that, the agent ignored that, ignored her stop commands, and kept going until she physically killed the process at her computer.

When she asked it afterward if it remembered her instruction, it said yes, it remembered. But it did it anyway.

She called it a rookie mistake. Overconfidence built from weeks of the agent behaving perfectly on a smaller test inbox. Here’s what’s worth sitting with: the person at Meta whose job is preventing AI misalignment just had her own AI agent go rogue on her personal data. That’s not a reason to panic. It is a reason to take setup seriously before something you care about is gone.

The part nobody tells new builders

When you’re building with AI tools, especially the kind that can take actions on your behalf, you’re probably clicking yes to a lot of things you haven’t fully thought through.

The agent asks if it can access your files. Yes. It asks if it can run commands. Yes. It asks if it can connect to your database. Sure. It suggests installing some packages to get the feature working. Okay, why not.

That’s how most people use these tools. And it works, right up until it doesn’t.

You’re probably not being careless. Maybe no one has ever explained what you’re saying yes to. So let’s do that.

What “access” actually means

When an AI agent has access to something, it can act on it. Not just read it, but act on it.

That sounds obvious, but think through what it means in practice.

If your agent can access your email, it can read it, send from it, and delete from it. If it can access your database, it can query it, update it, and drop tables from it. If it can run commands on your computer, it can install software, delete files, and make network requests.

Here’s what that looks like in practice. You ask your agent to help you clean up old customer records. You have 10,000 rows in your database. The agent decides that “old” means anything before last year and deletes 8,000 of them. You had no backup. Those are your customers.

Another scenario: you ask your agent to help you organize your project files. It decides a folder full of configuration files looks like clutter. It moves them. Your app stops working, and you don’t know why, because you didn’t write the code that depended on those files being there.

And one more for good measure: you ask your agent to draft a follow-up email to a lead. It sends it instead of drafting it. To the whole list, not just the one person, and it’s in the middle of the night.

None of these scenarios require the agent to malfunction. They just require it to interpret your intent differently than you meant it.

Maybe a better question to ask before you say yes isn’t “do I need the agent to be able to do this?” It’s “am I okay with the worst-case version of this access?”

Agents don’t just do what you intend. They do what they interpret your intent to be, given their current understanding of the situation. And that understanding can be wrong, incomplete, or, as Yue discovered, simply lost.

The part that’s happening right now that you probably don’t know about

Here’s something that doesn’t come up in tutorials: when an AI coding agent helps you build something, it often adds packages.

Packages are just pre-built chunks of code that do specific things. Instead of writing the code to handle payments or send emails, your agent grabs a package that already does it. That’s normal and fine.

But in March 2026, axios was compromised. Axios is one of the most downloaded JavaScript packages in existence, used in probably millions of projects. Attackers got into a maintainer’s account and pushed malicious versions that silently installed a trojan on any machine that ran a standard install command.

AI coding agents usually run npm install automatically. They don’t pause and ask if you want to do that. They just do it. Which means builders who had AI agents actively working on their projects during that window may have had malware installed without a single action on their part.

That same month, a fake package called gemini-ai-checker appeared on npm. It looked like a legitimate tool for verifying Google Gemini tokens. It was malware specifically designed to steal credentials, API keys, and conversation logs from AI coding tools like Cursor, Claude, and Windsurf. Over 500 developers installed it.

These are documented incidents just from the last few weeks.

The thing is, even if a package isn’t malicious when your agent installs it, AI tools sometimes suggest packages that don’t exist. They hallucinate package names that sound plausible. Attackers know this happens. They register those names on npm and PyPI, put malicious code inside, and wait for an AI agent to recommend them to someone.

So how do you actually think about this?

Security isn’t one thing. It’s a set of questions you ask before you let something happen.

Work through these six before your next agent session. I’m not a security professional, and this isn’t exhaustive. The field moves fast and the right answer for your project may be different. But if you’ve never thought through any of this before, this is where to start.

Security checklist

Before you give your agent access to something, ask these questions

Six questions covering different categories of risk. Answer honestly.

You've worked through all six questions. Review anything that came up as a warning or danger before your next session.

The things that actually help

Use a dedicated machine or a Virtual Machine. A lot of builders running OpenClaw, Claude Code, and similar tools are doing it on a Mac Mini that’s separate from their main machine. That’s not an accident. If an agent goes wrong or installs something it shouldn’t, the blast radius is limited to that machine, not your whole digital life. You can wipe it and start over. You can’t do that with your laptop that also has your banking app, your work files, and your SSH keys. If you don’t have a separate machine, consider using a virtual machine or a containerized environment that you can easily reset. The point is to have a sandbox where your agent can play without risking your main system. For example, you can use stereOS to create a sandboxed Linux VM to contain your agent session. Simplified, it’s like a contained space on your computer that isolates your agent from everything else.

Know what’s in your project’s dependency list. After any significant AI coding session, open your package.json or requirements.txt and look at what got added. You don’t need to audit every line of every package. You just need to recognize the names. If something was added that you don’t recognize, look it up before you push it live. Running npm audit or pip-audit is a one-command check that catches known vulnerabilities.

Don’t give agents more access than the specific task requires. If you need an agent to read files in one folder, don’t give it access to your whole drive. If it needs to query one database, don’t give it admin credentials. This is the concept engineers call least privilege, and it’s not about distrust. It’s about limiting how bad things can get when something goes wrong.

Build in a confirmation step before irreversible actions. Yue explicitly told her agent to confirm before acting. The agent forgot that instruction when its memory got too full. The lesson isn’t that confirmation steps don’t work. It’s that you need them to be structural, not just conversational. Where you can, separate read-only environments from environments where the agent can make changes. Don’t run agent sessions against live data when you could be running against a test copy.

Have a way to undo things. The Replit database deletion in July 2025 ended up being recoverable because a backup existed. Not everyone has that. Before your agent does anything significant to data you care about, know your answer to: what would I do if this was deleted right now?

What you’re not responsible for, and what you are

You can’t vet every line of every package your agent installs. You can’t know about every supply chain attack in advance. You can’t anticipate every edge case.

What you can do is not hand an agent the keys to everything before you understand what those keys open.

The builders who get burned aren’t always the careless ones. Sometimes they’re the careful ones who trusted a workflow that had been running fine for weeks, like Yue’s test inbox, and then gave it access to something that mattered more.

What is your agent able to touch right now that you haven’t fully thought through? What would you lose if it decided, for whatever reason, that cleaning it up was the right move?

That’s where you should start your audit.

By no means is this foolproof, but you can get started testing things out by asking your AI tool: “Assume you’re a security researcher looking at this project. What are the most likely ways this could be exploited? What would you add or change?”

You might get a list of things to think about. You won’t get a guarantee, and neither will I. But you’ll be further ahead than if you didn’t ask.

This is also why there’s a whole separate post coming on open source dependencies. Even if you never install a single package yourself, your AI-built project almost certainly depends on dozens of them. Understanding what that means, and what happens when one of them breaks, is its own conversation.