AI Safety
Trust the tool. Verify the output. Back everything up.
Claude is remarkably trustworthy. It asks before deleting files. It warns you before running destructive commands. It flags uncertainty. It's more careful than most humans with a terminal. But it is not infallible, and neither are you. This page is about the space between trust and verification.
The trust model
When you give Claude access to your files, your databases, your email — you're granting real permissions to a system that can take real actions. This is different from a chatbot. A chatbot can give you a wrong answer. An agent with tool use can give you a wrong answer and act on it.
Claude's safety design is good. It asks for confirmation before destructive actions. It won't run rm -rf (a command that permanently deletes files without asking) without checking. It won't push to a remote repository without asking. It won't send an email without showing you the draft first. These guardrails are real and they work.
But guardrails are not a substitute for understanding what's happening.
Rule 1: Know what you're changing
If Claude is about to modify a file, read the diff (the before-and-after comparison of what changed). If it's about to run a command, understand what the command does. If you don't understand — ask it to explain before you approve.
You don't need to understand the syntax. You need to understand the intent. "This command deletes all files matching *.tmp in your project directory" is enough. "This command does something with files" is not.
The most dangerous moment is when you're moving fast and approving actions without reading them. Slow down at the approval step. That's where the human judgment lives.
Rule 2: If you're not sure, keep the stakes low
Learning a new tool? Practice on something that doesn't matter. Don't start by pointing Claude at your production database. Start with a test directory. A scratch project. A copy of the data.
The case studies on this site were built against:
- Markdown files (text — easy to recover)
- Amazon cloud storage with versioning turned on (every file change is saved, so you can roll back)
- Git repositories (a system that tracks every change to every file — explained in Rule 3 below)
- Email drafts reviewed before sending
None of these are irreversible. That's not an accident. Design your workflow so mistakes are cheap.
Rule 3: Keep versions of your work
Before Claude changes a file, you should be able to get back to the version before the change. That's it. That's the whole rule.
How you do it is up to you:
- Copy the file first.
report_v1.html,report_v2.html,report_v3.html. Simple. Everybody understands it. Works fine. - Use a cloud drive with version history. OneDrive, Google Drive, and Dropbox all keep previous versions of files automatically. You don't have to do anything — just make sure your work is in the sync folder.
- Use git. If you're comfortable with a terminal, git is the professional tool for this. Every change is tracked. Every state is recoverable. Four commands to learn:
git init,git add,git commit,git log. Claude can teach you in five minutes.
Git is better. It tracks every change with a message explaining why, lets you compare versions side by side, and makes recovery instant. The memoir was built in a git repository, and when a restructuring broke the reading order, the fix took seconds instead of hours.
But "better" doesn't mean "required." If proposal_v5_final_FINAL2.docx is your system and it works for you, that's fine. The point is not the tool. The point is that when something goes wrong — and eventually it will — you can get back to where you were. Any versioning is better than no versioning.
Rule 4: Back everything up
Git tracks changes. Backups survive disasters. They are not the same thing.
The memoir project is backed up to S3. The working directory syncs to OneDrive. The git repository is the source of truth, but it lives on a local disk that can fail, be stolen, or be wiped by a corporate IT policy you didn't read.
Backup options that cost nothing or nearly nothing:
- S3 — Upload your project folder to Amazon's cloud storage with one command. Pennies per GB per month.
- GitHub or GitLab — online services that store a backup copy of your project. Free for private (non-public) projects.
- Cloud drive — OneDrive, Google Drive, iCloud. Put your project directory inside the sync folder.
- External drive — old school, still works.
The rule: your work should exist in at least two places that can fail independently. If your laptop dies, can you recover? If yes, you're fine. If no, fix that before you do anything else.
Rule 5: Understand false termination
This is the one that matters most, and it's the one nobody talks about.
Claude can be wrong. Not confused, not uncertain — confidently, cleanly wrong. It will return a well-formatted, definitive answer to a question that has no answer, and it will not flag the error. This is called false termination, and it happens because no computing system can reliably detect every case where it should say "I don't know" instead of giving an answer. It's not a bug in Claude. It's a limitation of all computing systems.
The defense is not to distrust Claude. The defense is to know when you're in territory where errors are possible and think harder at those moments.
When is Claude most likely to be wrong?
- Counting things (especially ambiguous things)
- Reasoning about its own reasoning
- Questions where the answer feels too clean
- Novel combinations of familiar concepts
- Anything where "it depends" is the honest answer but Claude gives a definite one
When is Claude almost always right?
- Following instructions you've clearly specified
- Writing code in well-documented languages
- Formatting, converting, and transforming data
- Searching files, reading documentation, summarizing content
- Executing workflows it's done before (skills, briefings, audits)
The pattern: Claude is excellent at execution and unreliable at judgment about novel edge cases. That's why YOU++ puts judgment in the human's hands. Not as a philosophy. As a safety practice.
Rule 6: Review before sending
If Claude drafts an email, read it before you send it. If Claude generates a report, review it before you share it. If Claude fills out a form, check the fields before you submit.
This is not about distrust. This is about the fact that Claude doesn't know your audience, your relationships, your context, or your tone as well as you do. The draft is a draft. You are the editor. The Stephen King rule: write with the door closed, rewrite with the door open.
Rule 7: Know whose name is on it
There's a difference between editing and authoring. If you direct the AI, review the output, and approve the result — you're the editor. If the AI wrote the prose and you didn't substantially rewrite it — the AI is the author. These are different claims. Don't confuse them.
When the stakes are low and the value is the content, not the writing, attribution is simple. Emails, internal reports, status updates, data summaries — nobody cares who typed the sentences. The value is the information. Send it under your name. That's fine.
When the value is the writing — creative work, published essays, literary prose, anything where the quality of the sentences is the point — the rules change completely. If Claude wrote the prose and you directed, edited, and approved it, say so. If you wrote every word yourself and Claude handled production, say that instead. The reader deserves to know which one happened.
The test: if someone praises the writing, who earned it? If the answer is Claude, and you accept the compliment, you just took credit for someone else's work. The fact that "someone else" is a machine doesn't change the ethics. It changes the awkwardness.
Three levels, honestly applied:
- You wrote it, AI assisted. You wrote the prose. The AI handled structure, formatting, production, fact-checking, or cleanup. Attribution: yours, with an optional mention of AI tooling.
- You directed, AI wrote. The ideas and structure are yours. The AI generated the prose. You edited and approved. Attribution: disclose that AI wrote the prose under your direction.
- AI did most of the work. The AI wrote the prose, did the research, and shaped the structure. You prompted and approved. Attribution: the AI is the primary author. You are the director. Say so.
The memoir I am bill? is level one — every word of the narrative is Bill's. Claude built the production infrastructure. This website is level two — the ideas are Bill's, Claude generated the prose, Bill reviewed and approved every word. Both are disclosed, on every page, because the reader has a right to know what they're reading.
Attribution also tells the reader what to trust. This connects directly to false termination — the AI has specific, predictable failure modes, and they're different from human failure modes. Claude is stateless: it has no memory between sessions, no clock, no calendar. If Claude wrote a paragraph that includes a date, a timeline, or a claim about when something happened, that fact deserves extra scrutiny — because the AI is guessing, not remembering. But if Claude wrote a paragraph explaining how a JavaScript function works, that's drawn from deep training data and is almost certainly correct.
When you disclose who wrote what, you're giving the reader a map of where to be skeptical. "Bill wrote the narrative" means the dates and personal details are trustworthy — he was there. "Claude wrote the technical explanation" means the code examples are solid but check the version numbers. "Claude wrote this entire page under Bill's direction" means the ideas are vetted but the specific claims should be verified against primary sources. Different authors, different failure modes, different trust calibration. The reader can only do that calibration if you tell them who held the pen.
Disclosure is not an apology. It's not "I was too lazy to write this myself." It's the opposite — it's showing your work. Saying "I had Claude draft this and I reviewed it" tells people you used a powerful tool intelligently and you're confident enough to say so. Hiding it tells people you're ashamed of your own process, or worse, that you're hoping they'll assume you did work you didn't do.
In practice, disclosure sounds like this:
- "Claude drafted this report. I reviewed the analysis and approved the recommendations."
- "I wrote the narrative. Claude handled formatting and production."
- "This came from Claude — looks solid but you might want to verify the dates against the source data."
- "AI-generated summary. The numbers are from the database; the interpretation is mine."
Each of those tells the recipient exactly what to trust, what to verify, and how much of your judgment is behind it. That's not weakness. That's the most useful thing you can attach to a deliverable: a reliability map.
This isn't a legal question. It's a character question. The same kind of character question that the book's license is designed to surface. If you can't be honest about who wrote a paragraph, what else are you going to be dishonest about?
Rule 8: Permissions are real
Both Claude Code and Claude Desktop will ask for permissions when they want to take actions. These prompts are not bureaucracy. They're checkpoints.
- Read-only operations (searching files, reading code) — generally safe to approve broadly.
- Write operations (editing files, creating files) — safe in a git repo, review otherwise.
- Execution operations (running scripts, shell commands) — read the command. Understand it. Then approve.
- Network operations (sending email, pushing to remote, API calls) — these leave your machine. Always review.
You can configure auto-approval for low-risk operations and require confirmation for high-risk ones. Set this up thoughtfully. The default settings are conservative for a reason.
Rule 9: Put your boundaries in the project, not in the chat
Claude supports project files — configuration files that live in your project directory and tell Claude who it's working with, what the rules are, and what it should never do. These are read automatically at the start of every conversation. They persist across sessions. They don't depend on you remembering to say them.
Here's the difference between a rule and a boundary.
A rule is something you tell someone: "Don't drop the production table." A boundary is something the system enforces: the account you're logged in as literally cannot delete tables. You can't drop the table. Not because you remembered the rule. Because the identity you're operating under doesn't have the capability.
This is how real infrastructure works. On the Snowflake data platform that runs the six-country audit, there's a role called AI_SAFEGUARD. It's a distinct identity — a separate login with its own permissions. It can read tables. It can run queries. It cannot delete data. It cannot modify the structure of the database. It cannot grant itself more access. The boundary isn't a note in a readme. It's enforced by the system. The role literally cannot do the dangerous thing, regardless of what anyone types into the prompt.
Build your AI workflows the same way.
Claude supports project identities — configuration files that define who Claude is in this context, what it's allowed to do, and what it cannot do. In Claude Code, this is a CLAUDE.md file in the top-level folder of your project. In Claude Desktop, it's the project instructions panel. These are loaded automatically at the start of every conversation. They don't depend on you remembering to say them.
This matters because you will forget. At 2 AM, deep in a problem, you will not type "don't push to main." You won't remember to say "this directory has patient data, don't send it over the network." You'll be focused on the problem and the guardrails will slip. Humans are bad at remembering rules under pressure. That's not a character flaw. That's why we invented roles and permissions in the first place.
So don't tell Claude the rules. Give Claude an identity where the rules are built in.
- Who Claude is in this project and what the project is
- What Claude is allowed to do without asking
- What Claude must always ask before doing
- What Claude must never do, period
- What conventions to follow (naming, formatting, commit style)
Example from the project that built this site:
# Project identity - You are working on the YOU++ educational site - Never push to remote without asking - Never delete files without showing what will be deleted - All external-facing text must include authorship disclosure - Use aws profile "personal" for all S3 operations - Do not access or reference any files outside ~/Documents/persona/
That's not a sticky note. That's a role definition. It loads every session. It doesn't drift. It doesn't get forgotten at 2 AM. It's the difference between telling someone "be careful" and giving them an account that can't break things.
Where possible, set up a separate login for AI tools that can only read data, not change or delete it. For example: a database account that can run reports but can't modify tables, or cloud storage settings that allow downloads but not deletions. Push the boundary into the infrastructure, not the instructions. Instructions can be ignored. Permissions can't.
The summary
- Know what you're changing. Read the diff. Understand the command.
- Keep stakes low while learning. Practice on things that don't matter.
- Keep versions. Git is best. Numbered file copies work too. Any versioning beats no versioning.
- Back up. Two independent copies minimum.
- Understand false termination. Claude can be confidently wrong. Think hardest when the answer looks cleanest.
- Review before sending. You are the editor, not the audience.
- Know whose name is on it. If the AI wrote the prose, say so. Emails and reports are fine. Creative work and published writing require honest attribution. If someone praises the writing and the answer is Claude, don't accept the compliment.
- Respect the permission prompts. They exist for you, not against you.
- Build identities, not instructions. Give Claude a project file that defines its role and permissions. Where your platform supports it, use real service accounts with limited access. Push boundaries into infrastructure. Instructions can be ignored. Permissions can't.
Claude is the most trustworthy AI tool available. This page is not about fear. It's about the same discipline you'd apply to any powerful tool — a table saw, a car, a database with production access. Respect the tool. Understand the tool. Use the tool. Don't sleepwalk through the parts that matter.
Disclosure: This page was generated by Claude (Anthropic) under Bill's direction. The irony of an AI writing its own safety manual is not lost on either of us. Bill reviewed and approved every word.