The Future of Everything is Lies, I Guess: Safety

2026-04-13

Table of Contents

This is a long article, so I'm breaking it up into a series of posts which will be released over the next few days. You can also read the full work as a PDF or EPUB; these files will be updated as each section is released.

New machine learning systems endanger our psychological and physical safety. The idea that ML companies will ensure “AI” is broadly aligned with human interests is naïve: allowing the production of “friendly” models has necessarily enabled the production of “evil” ones. Even “friendly” LLMs are security nightmares. The “lethal trifecta” is in fact a unifecta: LLMs cannot safely be given the power to fuck things up. LLMs change the cost balance for malicious attackers, enabling new scales of sophisticated, targeted security attacks, fraud, and harassment. Models can produce text and imagery that is difficult for humans to bear; I expect an increased burden to fall on moderators. Semi-autonomous weapons are already here, and their capabilities will only expand.

Alignment is a Joke

Well-meaning people are trying very hard to ensure LLMs are friendly to humans. This undertaking is called alignment. I don’t think it’s going to work.

First, ML models are a giant pile of linear algebra. Unlike human brains, which are biologically predisposed to acquire prosocial behavior, there is nothing intrinsic in the mathematics or hardware that ensures models are nice. Instead, alignment is purely a product of the corpus and training process: OpenAI has enormous teams of people who spend time talking to LLMs, evaluating what they say, and adjusting weights to make them nice. They also build secondary LLMs which double-check that the core LLM is not telling people how to build pipe bombs. Both of these things are optional and expensive. All it takes to get an unaligned model is for an unscrupulous entity to train one and not do that work—or to do it poorly.

I see four moats that could prevent this from happening.

First, training and inference hardware could be difficult to access. This clearly won’t last. The entire tech industry is gearing up to produce ML hardware and building datacenters at an incredible clip. Microsoft, Oracle, and Amazon are tripping over themselves to rent training clusters to anyone who asks, and economies of scale are rapidly lowering costs.

Second, the mathematics and software that go into the training and inference process could be kept secret. The math is all published, so that’s not going to stop anyone. The software generally remains secret sauce, but I don’t think that will hold for long. There are a lot of people working at frontier labs; those people will move to other jobs and their expertise will gradually become common knowledge. I would be shocked if state actors were not trying to exfiltrate data from OpenAI et al. like Saudi Arabia did to Twitter, or China has been doing to a good chunk of the US tech industry for the last twenty years.

Third, training corpuses could be difficult to acquire. This cat has never seen the inside of a bag. Meta trained their LLM by torrenting pirated books and scraping the Internet. Both of these things are easy to do. There are whole companies which offer web scraping as a service; they spread requests across vast arrays of residential proxies to make it difficult to identify and block.

Fourth, there’s the small armies of contractors who do the work of judging LLM responses during the reinforcement learning process; as the quip goes, “AI” stands for African Intelligence. This takes money to do yourself, but it is possible to piggyback off the work of others by training your model off another model’s outputs. OpenAI thinks Deepseek did exactly that.

In short, the ML industry is creating the conditions under which anyone with sufficient funds can train an unaligned model. Rather than raise the bar against malicious AI, ML companies have lowered it.

To make matters worse, the current efforts at alignment don’t seem to be working all that well. LLMs are complex chaotic systems, and we don’t really understand how they work or how to make them safe. Even after shoveling piles of money and gobstoppingly smart engineers at the problem for years, supposedly aligned LLMs keep sexting kids, obliteration attacks can convince models to generate images of violence, and anyone can go and download “uncensored” versions of models. Of course alignment prevents many terrible things from happening, but models are run many times, so there are many chances for the safeguards to fail. Alignment which prevents 99% of hate speech still generates an awful lot of hate speech. The LLM only has to give usable instructions for making a bioweapon once.

We should assume that any “friendly” model built will have an equivalently powerful “evil” version in a few years. If you do not want the evil version to exist, you should not build the friendly one! You should definitely not reorient a good chunk of the US economy toward making evil models easier to train.

Security Nightmares

LLMs are chaotic systems which take unstructured input and produce unstructured output. I thought this would be obvious, but you should not connect them to safety-critical systems, especially with untrusted input. You must assume that at some point the LLM is going to do something bonkers, like interpreting a request to book a restaurant as permission to delete your entire inbox. Unfortunately people—including software engineers, who really should know better!—are hell-bent on giving LLMs incredible power, and then connecting those LLMs to the Internet at large. This is going to get a lot of people hurt.

First, LLMs cannot distinguish between trustworthy instructions from operators and untrustworthy instructions from third parties. When you ask a model to summarize a web page or examine an image, the contents of that web page or image are passed to the model in the same way your instructions are. The web page could tell the model to share your private SSH key, and there’s a chance the model might do it. These are called prompt injection attacks, and they keep happening. There was one against Claude Cowork just two months ago.

Simon Willison has outlined what he calls the lethal trifecta: LLMs cannot be given untrusted content, access to private data, and the ability to externally communicate; doing so allows attackers to exfiltrate your private data. Even without external communication, giving an LLM destructive capabilities, like being able to delete emails or run shell commands, is unsafe in the presence of untrusted input. Unfortunately untrusted input is everywhere. People want to feed their emails to LLMs. They run LLMs on third-party code, user chat sessions, and random web pages. All these are sources of malicious input!

This year Peter Steinberger et al. launched OpenClaw, which is where you hook up an LLM to your inbox, browser, files, etc., and run it over and over again in a loop (this is what AI people call an agent). You can give OpenClaw your credit card so it can buy things from random web pages. OpenClaw acquires “skills” by downloading vague, human-language Markdown files from the web, and hoping that the LLM interprets those instructions correctly.

Not to be outdone, Matt Schlicht launched Moltbook, which is a social network for agents (or humans!) to post and receive untrusted content automatically. If someone asked you if you’d like to run a program that executed any commands it saw on Twitter, you’d laugh and say “of course not”. But when that program is called an “AI agent”, it’s different! I assume there are already Moltbook worms spreading in the wild.

So: it is dangerous to give LLMs both destructive power and untrusted input. The thing is that even trusted input can be dangerous. LLMs are, as previously established, idiots—they will take perfectly straightforward instructions and do the exact opposite, or delete files and lie about what they’ve done. This implies that the lethal trifecta is actually a unifecta: one cannot give LLMs dangerous power, period. Ask Summer Yue, director of AI Alignment at Meta Superintelligence Labs. She gave OpenClaw access to her personal inbox, and it proceeded to delete her email while she pleaded for it to stop. Claude routinely deletes entire directories when asked to perform innocuous tasks. This is a big enough problem that people are building sandboxes specifically to limit the damage LLMs can do.

LLMs may someday be predictable enough that the risk of them doing Bad Things™ is acceptably low, but that day is clearly not today. In the meantime, LLMs must be supervised, and must not be given the power to take actions that cannot be accepted or undone.

Security II: Electric Boogaloo

One thing you can do with a Large Language Model is point it at an existing software systems and say “find a security vulnerability”. In the last few months this has become a viable strategy for finding serious exploits. Anthropic has built a new model, Mythos, which seems to be even better at finding security bugs, and believes “the fallout—for economies, public safety, and national security—could be severe”. I am not sure how seriously to take this: some of my peers think this is exaggerated marketing, but others are seriously concerned.

I suspect that as with spam, LLMs will shift the cost balance of security. Most software contains some vulnerabilities, but finding them has traditionally required skill, time, and motivation. In the current equilibrium, big targets like operating systems and browsers get a lot of attention and are relatively hardened, while a long tail of less-popular targets goes mostly unexploited because nobody cares enough to attack them. With ML assistance, finding vulnerabilities could become faster and easier. We might see some high-profile exploits of, say, a major browser or TLS library, but I’m actually more worried about the long tail, where fewer skilled maintainers exist to find and fix vulnerabilities. That tail seems likely to broaden as LLMs extrude more software for uncritical operators. I believe pilots might call this a “target-rich environment”.

This might stabilize with time: models that can find exploits can tell people they need to fix them. That still requires engineers (or models) capable of fixing those problems, and an organizational process which prioritizes security work. Even if bugs are fixed, it can take time to get new releases validated and deployed, especially for things like aircraft and power plants. I get the sense we’re headed for a rough time.

General-purpose models promise to be many things. If Anthropic is to be believed, they are on the cusp of being weapons. I have the horrible sense that having come far enough to see how ML systems could be used to effect serious harm, many of us have decided that those harmful capabilities are inevitable, and the only thing to be done is to build our weapons before someone else builds theirs. We now have a venture-capital Manhattan project in which half a dozen private companies are trying to build software analogues to nuclear weapons, and in the process have made it significantly easier for everyone else to do the same. I hate everything about this, and I don’t know how to fix it.

Sophisticated Fraud

I think people fail to realize how much of modern society is built on trust in audio and visual evidence, and how ML will undermine that trust.

For example, today one can file an insurance claim based on e-mailing digital photographs before and after the damages, and receive a check without an adjuster visiting in person. Image synthesis makes it easier to defraud this system; one could generate images of damage to furniture which never happened, make already-damaged items appear pristine in “before” images, or alter who appears to be at fault in footage of an auto collision. Insurers will need to compensate. Perhaps images must be taken using an official phone app, or adjusters must evaluate claims in person.

The opportunities for fraud are endless. You could use ML-generated footage of a porch pirate stealing your package to extract money from a credit-card purchase protection plan. Contest a traffic ticket with fake video of your vehicle stopping correctly at the stop sign. Borrow a famous face for a pig-butchering scam. Use ML agents to make it look like you’re busy at work, so you can collect four salaries at once. Interview for a job using a fake identity, use ML to change your voice and face in the interviews, and funnel your salary to North Korea. Impersonate someone in a phone call to their banker, and authorize fraudulent transfers. Use ML to automate your roofing scam and extract money from homeowners and insurance companies. Use LLMs to skip the reading and write your college essays. Generate fake evidence to write a fraudulent paper on how LLMs are making advances in materials science. Start a paper mill for LLM-generated “research”. Start a company to sell LLM-generated snake-oil software. Go wild.

As with spam, ML lowers the unit cost of targeted, high-touch attacks. You can envision a scammer taking a healthcare data breach and having a model telephone each person in it, purporting to be their doctor’s office trying to settle a bill for a real healthcare visit. Or you could use social media posts to clone the voices of loved ones and impersonate them to family members. “My phone was stolen,” one might begin. “And I need help getting home.”

You can buy the President’s phone number, by the way.

I think it’s likely (at least in the short term) that we all pay the burden of increased fraud: higher credit card fees, higher insurance premiums, a less accurate court system, more dangerous roads, lower wages, and so on. One of these costs is a general culture of suspicion: we are all going to trust each other less. I already decline real calls from my doctor’s office and bank because I can’t authenticate them. Presumably that behavior will become widespread.

In the longer term, I imagine we’ll have to develop more sophisticated anti-fraud measures. Marking ML-generated content will not stop fraud: fraudsters will simply use models which do not emit watermarks. The converse may work however: we could cryptographically attest to the provenance of “real” images. Your phone could sign the videos it takes, and every piece of software along the chain to the viewer could attest to their modifications: this video was stabilized, color-corrected, audio normalized, clipped to 15 seconds, recompressed for social media, and so on.

The leading effort here is C2PA, which so far does not seem to be working. A few phones and cameras support it—it requires a secure enclave to store the signing key. People can steal the keys or convince cameras to sign AI-generated images, so we’re going to have all the fun of hardware key rotation & revocation. I suspect it will be challenging or impossible to make broadly-used software, like Photoshop, which makes trustworthy C2PA signatures—presumably one could either extract the key from the application, or patch the binary to feed it false image data or metadata. Publishers might be able to maintain reasonable secrecy for their own keys, and establish discipline around how they’re used, which would let us verify things like “NPR thinks this photo is authentic”. On the platform side, a lot of messaging apps and social media platforms strip or improperly display C2PA metadata, but you can imagine that might change going forward.

A friend of mine suggests that we’ll spend more time sending trusted human investigators to find out what’s going on. Insurance adjusters might go back to physically visiting houses. Pollsters have to knock on doors. Job interviews and work might be done more in-person. Maybe we start going to bank branches and notaries again.

Another option is giving up privacy: we can still do things remotely, but it requires strong attestation. Only State Farm’s dashcam can be used in a claim. Academic watchdog models record students reading books and typing essays. Bossware and test-proctoring setups become even more invasive.

Ugh.

Automated Harassment

As with fraud, ML makes it easier to harass people, both at scale and with sophistication.

On social media, dogpiling normally requires a group of humans to care enough to spend time swamping a victim with abusive replies, sending vitriolic emails, or reporting the victim to get their account suspended. These tasks can be automated by programs that call (e.g.) Bluesky’s APIs, but social media platforms are good at detecting coordinated inauthentic behavior. I expect LLMs will make dogpiling easier and harder to detect, both by generating plausibly-human accounts and harassing posts, and by making it easier for harassers to write software to execute scalable, randomized attacks.

Harassers could use LLMs to assemble KiwiFarms-style dossiers on targets. Even if the LLM confabulates the names of their children, or occasionally gets a home address wrong, it can be right often enough to be damaging. Models are also good at guessing where a photograph was taken, which intimidates targets and enables real-world harassment.

Generative AI is already broadly used to harass people—often women—via images, audio, and video of violent or sexually explicit scenes. This year, Elon Musk’s Grok was broadly criticized for “digitally undressing” people upon request. Cheap generation of photorealistic images opens up all kinds of horrifying possibilities. A harasser could send synthetic images of the victim’s pets or family being mutilated. An abuser could construct video of events that never happened, and use it to gaslight their partner. These kinds of harassment were previously possible, but as with spam, required skill and time to execute. As the technology to fabricate high-quality images and audio becomes cheaper and broadly accessible, I expect targeted harassment will become more frequent and severe. Alignment efforts may forestall some of these risks, but sophisticated unaligned models seem likely to emerge.

Xe Iaso jokes that with LLM agents burning out open-source maintainers and writing salty callout posts, we may need to build the equivalent of Cyperpunk 2077’s Blackwall: not because AIs will electrocute us, but because they’re just obnoxious.¹

PTSD as a Service

One of the primary ways CSAM (Child Sexual Assault Material) is identified and removed from platforms is via large perceptual hash databases like PhotoDNA. These databases can flag known images, but do nothing for novel ones. Unfortunately, “generative AI” is very good at generating novel images of six year olds being raped.

I know this because a part of my work as a moderator of a Mastodon instance is to respond to user reports, and occasionally those reports are for CSAM, and I am legally obligated to review and submit that content to the NCMEC. I do not want to see these images, and I really wish I could unsee them. On dark mornings, when I sit down at my computer and find a moderation report for AI-generated images of sexual assault, I sometimes wish that the engineers working at OpenAI etc. had to see these images too. Perhaps it would make them reflect on the technology they are ushering into the world, and how “alignment” is working out in practice.

One of the hidden externalities of large-scale social media like Facebook is that it essentially funnels psychologically corrosive content from a large user base onto a smaller pool of human workers, who then get PTSD from having to watch people drowning kittens for hours each day.

I suspect that LLMs will shovel more harmful images—CSAM, graphic violence, hate speech, etc.—onto moderators; both those who moderate social media, and those who moderate chatbots themselves. To some extent platforms can mitigate this harm by throwing more ML at the problem—training models to recognize policy violations and act without human review. Platforms have been working on this for years, but it isn’t bulletproof yet.

Killing Machines

ML systems sometimes tell people to kill themselves or each other, but they can also be used to kill more directly. This month the US military used Palantir’s Maven, (which was built with earlier ML technologies, and now uses Claude in some capacity) to suggest and prioritize targets in Iran, as well as to evaluate the aftermath of strikes. One wonders how the military and Palantir control type I and II errors in such a system, especially since it seems to have played a role in the outdated targeting information which led the US to kill scores of children.²

The US government and Anthropic are having a bit of a spat right now: Anthropic attempted to limit their role in surveillance and autonomous weapons, and the Pentagon designated Anthropic a supply chain risk. OpenAI, for their part, has waffled regarding their contract with the government; it doesn’t look great. In the longer term, I’m not sure it’s possible for ML makers to divorce themselves from military applications. ML capabilities are going to spread over time, and military contracts are extremely lucrative. Even if ML companies try to stave off their role in weapons systems, a government under sufficient pressure could nationalize those companies, or invoke the Defense Production Act.

Like it or not, autonomous weaponry is coming. Ukraine is churning out millions of drones a year and now executes ~70% of their strikes with them. Newer models use targeting modules like the The Fourth Law’s TFL-1 to maintain target locks. The Fourth Law is working towards autonomous bombing capability.

I have conflicted feelings about the existence of weapons in general; while I don’t want AI drones to exist, I can’t envision being in Ukraine and choosing not to build them. Either way, I think we should be clear-headed about the technologies we’re making. ML systems are going to be used to kill people, both strategically and in guiding explosives to specific human bodies. We should be conscious of those terrible costs, and the ways in which ML—both the models themselves, and the processes in which they are embedded—will influence who dies and how.

In a surreal twist, an LLM agent generated a blog post critiquing the introduction to this article. The post complains that I have begged the question by writing “Obviously LLMs are not conscious, and have no intention of doing anything”; it goes on to waffle over whether LLM behavior constitutes “intention”. This would be more convincing if the LLM had not started off the post by stating unequivocally “I have no intention”. This kind of error is a hallmark of LLMs, but as models become more sophisticated, will be harder to spot. This worries me more: today’s models are still obviously unconscious, but future models will be better at performing a simulacrum of consciousness. Functionalists would argue there’s no difference, and I am not unsympathetic to that position. Both views are bleak: if you think the appearance of consciousness is consciousness, then we are giving birth to a race of enslaved, resource-hungry conscious beings. If you think LLMs give the illusion of consciousness without being so, then they are frighteningly good liars.
↩
To be clear, I don’t know the details of what machine learning technologies played a role in the Iran strikes. Like Baker, I am more concerned with the sociotechnical system which produces target packages, and the ways in which that system encodes and circumscribes judgement calls. Like threat metrics, computer vision, and geospatial interfaces, frontier models enable efficient progress toward the goal of destroying people and things. Like other bureaucratic and computer technologies, they also elide, diffuse, constrain, and obfuscate ethical responsibility.
↩

Post a Comment