AI Safety

Trust the tool. Verify the output. Back everything up.

Claude is remarkably trustworthy. It asks before deleting files. It warns you before running destructive commands. It flags uncertainty. It's more careful than most humans with a terminal. But it is not infallible, and neither are you. This page is about the space between trust and verification.

The trust model

When you give Claude access to your files, your databases, your email — you're granting real permissions to a system that can take real actions. This is different from a chatbot. A chatbot can give you a wrong answer. An agent with tool use can give you a wrong answer and act on it.

Claude's safety design is good. It asks for confirmation before destructive actions. It won't run rm -rf (a command that permanently deletes files without asking) without checking. It won't push to a remote repository without asking. It won't send an email without showing you the draft first. These guardrails are real and they work.

But guardrails are not a substitute for understanding what's happening.

Rule 1: Know what you're changing

If Claude is about to modify a file, read the diff (the before-and-after comparison of what changed). If it's about to run a command, understand what the command does. If you don't understand — ask it to explain before you approve.

You don't need to understand the syntax. You need to understand the intent. "This command deletes all files matching *.tmp in your project directory" is enough. "This command does something with files" is not.

The most dangerous moment is when you're moving fast and approving actions without reading them. Slow down at the approval step. That's where the human judgment lives.

Rule 2: If you're not sure, keep the stakes low

Learning a new tool? Practice on something that doesn't matter. Don't start by pointing Claude at your production database. Start with a test directory. A scratch project. A copy of the data.

The case studies on this site were built against:

None of these are irreversible. That's not an accident. Design your workflow so mistakes are cheap.

Rule 3: Keep versions of your work

Before Claude changes a file, you should be able to get back to the version before the change. That's it. That's the whole rule.

How you do it is up to you:

Git is better. It tracks every change with a message explaining why, lets you compare versions side by side, and makes recovery instant. The memoir was built in a git repository, and when a restructuring broke the reading order, the fix took seconds instead of hours.

But "better" doesn't mean "required." If proposal_v5_final_FINAL2.docx is your system and it works for you, that's fine. The point is not the tool. The point is that when something goes wrong — and eventually it will — you can get back to where you were. Any versioning is better than no versioning.

Rule 4: Back everything up

Git tracks changes. Backups survive disasters. They are not the same thing.

The memoir project is backed up to S3. The working directory syncs to OneDrive. The git repository is the source of truth, but it lives on a local disk that can fail, be stolen, or be wiped by a corporate IT policy you didn't read.

Backup options that cost nothing or nearly nothing:

The rule: your work should exist in at least two places that can fail independently. If your laptop dies, can you recover? If yes, you're fine. If no, fix that before you do anything else.

Rule 5: Understand false termination

This is the one that matters most, and it's the one nobody talks about.

Claude can be wrong. Not confused, not uncertain — confidently, cleanly wrong. It will return a well-formatted, definitive answer to a question that has no answer, and it will not flag the error. This is called false termination, and it happens because no computing system can reliably detect every case where it should say "I don't know" instead of giving an answer. It's not a bug in Claude. It's a limitation of all computing systems.

The defense is not to distrust Claude. The defense is to know when you're in territory where errors are possible and think harder at those moments.

When is Claude most likely to be wrong?

When is Claude almost always right?

The pattern: Claude is excellent at execution and unreliable at judgment about novel edge cases. That's why YOU++ puts judgment in the human's hands. Not as a philosophy. As a safety practice.

Rule 6: Review before sending

If Claude drafts an email, read it before you send it. If Claude generates a report, review it before you share it. If Claude fills out a form, check the fields before you submit.

This is not about distrust. This is about the fact that Claude doesn't know your audience, your relationships, your context, or your tone as well as you do. The draft is a draft. You are the editor. The Stephen King rule: write with the door closed, rewrite with the door open.

Rule 7: Know whose name is on it

There's a difference between editing and authoring. If you direct the AI, review the output, and approve the result — you're the editor. If the AI wrote the prose and you didn't substantially rewrite it — the AI is the author. These are different claims. Don't confuse them.

When the stakes are low and the value is the content, not the writing, attribution is simple. Emails, internal reports, status updates, data summaries — nobody cares who typed the sentences. The value is the information. Send it under your name. That's fine.

When the value is the writing — creative work, published essays, literary prose, anything where the quality of the sentences is the point — the rules change completely. If Claude wrote the prose and you directed, edited, and approved it, say so. If you wrote every word yourself and Claude handled production, say that instead. The reader deserves to know which one happened.

The test: if someone praises the writing, who earned it? If the answer is Claude, and you accept the compliment, you just took credit for someone else's work. The fact that "someone else" is a machine doesn't change the ethics. It changes the awkwardness.

Three levels, honestly applied:

The memoir I am bill? is level one — every word of the narrative is Bill's. Claude built the production infrastructure. This website is level two — the ideas are Bill's, Claude generated the prose, Bill reviewed and approved every word. Both are disclosed, on every page, because the reader has a right to know what they're reading.

Attribution also tells the reader what to trust. This connects directly to false termination — the AI has specific, predictable failure modes, and they're different from human failure modes. Claude is stateless: it has no memory between sessions, no clock, no calendar. If Claude wrote a paragraph that includes a date, a timeline, or a claim about when something happened, that fact deserves extra scrutiny — because the AI is guessing, not remembering. But if Claude wrote a paragraph explaining how a JavaScript function works, that's drawn from deep training data and is almost certainly correct.

When you disclose who wrote what, you're giving the reader a map of where to be skeptical. "Bill wrote the narrative" means the dates and personal details are trustworthy — he was there. "Claude wrote the technical explanation" means the code examples are solid but check the version numbers. "Claude wrote this entire page under Bill's direction" means the ideas are vetted but the specific claims should be verified against primary sources. Different authors, different failure modes, different trust calibration. The reader can only do that calibration if you tell them who held the pen.

Disclosure is not an apology. It's not "I was too lazy to write this myself." It's the opposite — it's showing your work. Saying "I had Claude draft this and I reviewed it" tells people you used a powerful tool intelligently and you're confident enough to say so. Hiding it tells people you're ashamed of your own process, or worse, that you're hoping they'll assume you did work you didn't do.

In practice, disclosure sounds like this:

Each of those tells the recipient exactly what to trust, what to verify, and how much of your judgment is behind it. That's not weakness. That's the most useful thing you can attach to a deliverable: a reliability map.

This isn't a legal question. It's a character question. The same kind of character question that the book's license is designed to surface. If you can't be honest about who wrote a paragraph, what else are you going to be dishonest about?

Rule 8: Permissions are real

Both Claude Code and Claude Desktop will ask for permissions when they want to take actions. These prompts are not bureaucracy. They're checkpoints.

You can configure auto-approval for low-risk operations and require confirmation for high-risk ones. Set this up thoughtfully. The default settings are conservative for a reason.

Rule 9: Put your boundaries in the project, not in the chat

Claude supports project files — configuration files that live in your project directory and tell Claude who it's working with, what the rules are, and what it should never do. These are read automatically at the start of every conversation. They persist across sessions. They don't depend on you remembering to say them.

Here's the difference between a rule and a boundary.

A rule is something you tell someone: "Don't drop the production table." A boundary is something the system enforces: the account you're logged in as literally cannot delete tables. You can't drop the table. Not because you remembered the rule. Because the identity you're operating under doesn't have the capability.

This is how real infrastructure works. On the Snowflake data platform that runs the six-country audit, there's a role called AI_SAFEGUARD. It's a distinct identity — a separate login with its own permissions. It can read tables. It can run queries. It cannot delete data. It cannot modify the structure of the database. It cannot grant itself more access. The boundary isn't a note in a readme. It's enforced by the system. The role literally cannot do the dangerous thing, regardless of what anyone types into the prompt.

Build your AI workflows the same way.

Claude supports project identities — configuration files that define who Claude is in this context, what it's allowed to do, and what it cannot do. In Claude Code, this is a CLAUDE.md file in the top-level folder of your project. In Claude Desktop, it's the project instructions panel. These are loaded automatically at the start of every conversation. They don't depend on you remembering to say them.

This matters because you will forget. At 2 AM, deep in a problem, you will not type "don't push to main." You won't remember to say "this directory has patient data, don't send it over the network." You'll be focused on the problem and the guardrails will slip. Humans are bad at remembering rules under pressure. That's not a character flaw. That's why we invented roles and permissions in the first place.

So don't tell Claude the rules. Give Claude an identity where the rules are built in.

Example from the project that built this site:

# Project identity
- You are working on the YOU++ educational site
- Never push to remote without asking
- Never delete files without showing what will be deleted
- All external-facing text must include authorship disclosure
- Use aws profile "personal" for all S3 operations
- Do not access or reference any files outside ~/Documents/persona/

That's not a sticky note. That's a role definition. It loads every session. It doesn't drift. It doesn't get forgotten at 2 AM. It's the difference between telling someone "be careful" and giving them an account that can't break things.

Where possible, set up a separate login for AI tools that can only read data, not change or delete it. For example: a database account that can run reports but can't modify tables, or cloud storage settings that allow downloads but not deletions. Push the boundary into the infrastructure, not the instructions. Instructions can be ignored. Permissions can't.

The summary

  1. Know what you're changing. Read the diff. Understand the command.
  2. Keep stakes low while learning. Practice on things that don't matter.
  3. Keep versions. Git is best. Numbered file copies work too. Any versioning beats no versioning.
  4. Back up. Two independent copies minimum.
  5. Understand false termination. Claude can be confidently wrong. Think hardest when the answer looks cleanest.
  6. Review before sending. You are the editor, not the audience.
  7. Know whose name is on it. If the AI wrote the prose, say so. Emails and reports are fine. Creative work and published writing require honest attribution. If someone praises the writing and the answer is Claude, don't accept the compliment.
  8. Respect the permission prompts. They exist for you, not against you.
  9. Build identities, not instructions. Give Claude a project file that defines its role and permissions. Where your platform supports it, use real service accounts with limited access. Push boundaries into infrastructure. Instructions can be ignored. Permissions can't.

Claude is the most trustworthy AI tool available. This page is not about fear. It's about the same discipline you'd apply to any powerful tool — a table saw, a car, a database with production access. Respect the tool. Understand the tool. Use the tool. Don't sleepwalk through the parts that matter.

Disclosure: This page was generated by Claude (Anthropic) under Bill's direction. The irony of an AI writing its own safety manual is not lost on either of us. Bill reviewed and approved every word.