All that data that we’ll never use

Soon, we will be producing thousands of zettabytes a year. It’s a tsunami of data every day, every hour of every day, every minute of every hour, every second of every minute. As a result, important data that definitely does need storing is getting lost. In relation to academic research, for example, we are flooding our research environments with low-quality—often AI-produced—research paper garbage. It is becoming more and more expensive to store all this stuff. Research repositories are thus disappearing and lots of good research is being lost. In a thousand years, there may be more quality data artifacts on the Maya and Inca than on our digital generation. Digital is fragile, transient, and it will sink in its own crap.

We’ve never had more data and yet we’ve never had fewer information architecture skill. That’s because organizations don’t want to invest in the hard and vital work of professionally organizing and managing data. AI is making things much worse because it is feeding the idea that humans no longer need to worry about how we create and organize our data—that AI will look after all that. It won’t. AI is a great big lying, great big crap-producing machine.

Teachers are finding that students, brought up on Google search, don’t even know what a file is, let alone where it is saved or how to organize it in a classification hierarchy with other files. For the Google generation, “the concept of file folders and directories, essential to previous generations’ understanding of computers, is gibberish to many modern students,” one professor said.

Archiving data can significantly reduce overall data pollution because the most important decision in archiving is what to delete. Bob Clark, director of archives at the US Rockefeller Archive Center, said that less than 5% of stuff is worth saving in any situation, while a representative from Library and Archives Canada told me that only 1% to 3% of information in any department has archival or historical value.

“Don’t make me think” has long been a mantra of modern design and user experience. It’s a wonderful idea in the right context of helping people navigate complex environments. However, when used to do the design work itself, it is deeply flawed. Buy this technology and software, the pitch goes: it does the thinking for you, it does the data organizing for you. And it’s always on, always available. Store everything and no matter what time of day or night it is, you can get exactly what you want instantly. In the data center industry, they call it 99.99% uptime. It comes at the same cost to the environment as making silicon 99.99% pure. This whole technology-first approach simply doesn’t work because no matter what the tech bros say, quality data management requires human skill and years of human experience—knowing what to delete and how to organize and classify what is left.

Instead, we treat the problem as one of storage and access. In a typical data center, “only 6 to 12 percent of energy consumed is devoted to active computational processes,” data expert Steven Gonzalez has estimated. “The remainder is allocated to cooling and maintaining chains upon chains of redundant fail-safes to prevent costly downtime.” Perhaps this has changed somewhat because of the voracious processing demand from AI. However, the basic point remains true. Guaranteeing our convenience, and our access to all that badly organized crap data we’re never going to look at again, costs 90% more mining, 90% more materials, 90% more electricity, 90% more water, 90% more waste. All so that we can ‘potentially’ access that photo or file that there is a 99.99% chance we will never access. A data center is like before the start of a Formula 1 race. All these high-performance, energy-intense cars revving and revving for a race most of them will never run. Here we are. This is us. This is civilization, modernity, progress, innovation. Spending so much energy to create and store crap we’ll never use again.