The Future of Everything is Lies, I Guess: Information Ecology
Table of Contents
This is a long article, so I'm breaking it up into a series of posts which will be released over the next few days. You can also read the full work as a PDF or EPUB; these files will be updated as each section is released.
Machine learning shifts the cost balance for writing, distributing, and reading text, as well as other forms of media. Aggressive ML crawlers place high load on open web services, degrading the experience for humans. As inference costs fall, we’ll see ML embedded into consumer electronics and everyday software. As models introduce subtle falsehoods, interpreting media will become more challenging. LLMs enable new scales of targeted, sophisticated spam, as well as propaganda campaigns. The web is now polluted by LLM slop, which makes it harder to find quality information—a problem which now threatens journals, books, and other traditional media. I think ML will exacerbate the collapse of social consensus, and create justifiable distrust in all kinds of evidence. In reaction, readers may reject ML, or move to more rhizomatic or institutionalized models of trust for information. The economic balance of publishing facts and fiction will shift.
Creepy Crawlers
ML systems are thirsty for content, both during training and inference. This has led
to an explosion of aggressive web crawlers. While existing crawlers generally
respect robots.txt or are small enough to pose no serious hazard, the
last three years have been different. ML scrapers are making it harder to run an open web service.
As Drew Devault put it last year, ML companies are externalizing their costs
directly into his
face.
This year Weird Gloop confirmed
scrapers pose a serious challenge. Today’s scrapers ignore robots.txt and
sitemaps, request pages with unprecedented frequency, and masquerade as real
users. They fake their user agents, carefully submit valid-looking headers, and
spread their requests across vast numbers of residential
proxies.
An entire industry has sprung up to
support crawlers. This traffic is highly spiky, which forces web sites to
overprovision—or to simply go down. A forum I help run suffers frequent
brown-outs as we’re flooded with expensive requests for obscure tag pages. The
ML industry is in essence DDoSing the web.