Making Sense of Data Voids

Reckoning with Mis/Disinformation in 2024

 

 

August 7, 2024

Read the full series.

In 2016, Michael Golebiewski (a project manager at Microsoft’s Bing) and I were watching a new type of adversarial search engine optimization (SEO) emerge, and trying to make sense of it. I was reading through online forums where people were talking about how to “pwn” (slang for dominate or defeat) Google by capturing terms that few people used. Over at Bing, Michael was concerned with how networks of conspiracy theorists, supplement salesmen, and hate mongers had started iterating on these techniques to capture search queries for attention, money, and ideological conversion. While we were both familiar with new brands’ tendency to seek out a unique URL and search term they could squat on, what we were seeing now appeared a bit different. Those who were building digital traces around novel terms weren’t aiming to drive people to a singular website nor did they seek fame as the creator behind the terms. Instead, they were trying to create an ecology around the terms, as if inviting people to come down the rabbit hole with them. In recognition of the absence of available content being exploited and the black hole that appeared to open up as a result, Michael labeled this phenomenon “data voids.” (Neither of us realized at the time that the term rhymed with my name.) 

In our report Data Voids: Where Missing Data Can Easily Be Exploited, Michael and I tried to map out the adversarial moves that were leveraging gaps in the data infrastructure. What intrigued us was how data voids appeared to be a security vulnerability in the collective knowledge fabric that search engines depend on. I had yet to discover the scholarship on “agnotology,” an academic term for the study of ignorance. Scholars from that lineage identify three types of ignorance: that which we do not yet know, that which is forgotten or lost, and that which is strategically polluted. Search engines aren’t designed to grapple with ignorance; they are designed to make knowledge available. Thus, a fourth type of ignorance is notable: that which is not in a machine readable, publicly available digital format. When a search engine seeks to give people viable data to their query, this type of ignorance becomes a vulnerability. At the time, search engines never said “there is not enough information here.” They were designed to try to return something, anything

What search is (and isn’t) for

People approach search engines for a variety of reasons. Depressingly, the most common use case is equivalent to typing in a URL. For example, the most popular search term is “YouTube” – people search for that rather than typing http://www.youtube.com/ into the browser. But people also regularly seek out information through search. Another popular search term is “weather.” Presumably, people aren’t searching for the weather in Galveston, Texas on September 9, 1900. Perhaps they are looking for a general purpose weather site. More likely, they’re looking for the current weather in their location. This motivates search engines to keep tabs on current weather information from around the world and also prompts companies to try to discern a searcher’s location. Search engines are designed to be useful — and so if a user wants the weather, the designers for the search engine want to be able to provide it.

Most people think of search engines as a tool for seeking information. Indeed, students search for information they need for schoolwork. A wide range of people might search for recipes to cook with, places to stay on vacation, and services that might be useful for them. Some might turn to search engines to make sense of something they heard in passing, like a reference on the radio or a concept they encountered on a different website. People also search out of curiosity or boredom, wandering from page to page on the internet, where their path might be shaped by what they see. There is a tendency to focus on the negative dynamics involved in this: people staying up long past their bedtime because they can’t resist exploring; people “doomscrolling” or traveling down “rabbit holes” to find more and more dubious information. And, of course, people stumbling their way through toxic content and disinformation, presumably “radicalized” by their journey. These things do happen, and we documented some of them in our report — and in others at Data & Society. But what makes these explorative search dynamics complicated is the intersection between the structure of the available content, the algorithms people encounter (such as with search), and the state of mind of the person who is taking a tour of the internet.

Media effects and their limits

With the development of every new technology — from radio to TV to video games — there has been a moral panic about how a particular technology is causing harm. This is often referred to as a theory of “media effects.” Debates rage within scholarly communities about the potency or accuracy of media effects theory. On one hand, people don’t typically become serial murderers after watching a documentary about them. On the other, people are clearly influenced by some of the content to which they are exposed. 

Perhaps the most notorious studies of media effects concern suicide ideation. Long before the internet, it became clear that when someone was contemplating suicide, exposing them to news articles about people who died by suicide could mentally push them over the edge. Journalists developed practices to avoid amplifying content that could trigger such content. But the problem never went away, and countless stories of suicide completion can be linked back to the publicized deaths of celebrities or the broadcasting of certain TV shows that included content about suicide. 

More recently, scholars have repeatedly asked questions about the role that social media and search engines play in suicide ideation and completion: Should a search engine be held liable for content that someone encounters that might increase their likelihood of dying by suicide? 

While suicide is the most extreme version of this conversation, there are also important questions to be asked about the practices of “radicalization” through online exposure. Sociologists have also discovered that when people feel isolated, lonely, disillusioned, or otherwise disconnected from the social fabric around them, they are more vulnerable to joining gangs, cults, and terrorist organizations. This same state of mind also prompts people to turn to religion, multi-level marketing, and sports. In short, people seek a sense of belonging — and where they find that varies tremendously, based in no small part on what’s available to them and who reaches out when they are feeling low. 

What worried us as we were watching conspiracy theorists and white nationalists discuss strategies to exploit data voids is that they were targeting vulnerable populations who were seeking a sense of community. And they were using content that might be innocuous in some contexts, but downright dangerous in others. 

Structural vulnerabilities

The #techlash was already in full swing when Michael and I were trying to articulate the structural vulnerabilities we saw in the web of information — and to describe the ways adversarial actors could and did exploit them. In trying to make sense of data voids, we weren’t just worried about how conspiracy theorists might be amplifying their messages; we were worried about how they might proselytize and recruit. We were also concerned by how quickly they could jump on breaking news situations and weave a conspiracy before anyone with actual knowledge knew enough to properly report on the situation. 

We had good reason to be concerned. At the time, QAnon was gaining traction and people were turning to Q forums to find community and “facts.” The toxic combination of lonely people encountering conspiracy theories alongside a vibrant community would only get exponentially worse. Moreover, with each news cycle, we saw new waves of manufactured content responding to real-time incidents. Journalists couldn’t compete with those hellbent on using breaking news moments to provide fabricated information. 

While the well-known and web-savvy conspiracy theorist Alex Jones would eventually be arrested and convicted multiple times for the web of conspiracies he wove, attention during the #techlash quickly shifted from the actors who manipulated technical systems to the companies who created the algorithms. Companies were called on to “fix” the algorithms that led people to toxic content and radical communities. 

In 2024, María Angel and I described how policymakers embraced the logics of technological solutionism that were endemic to the tech industry. They turned these around to demand technological solutionism for good, or what we called “techno-legal solutionism.” Since algorithms were blamed for radicalizing people, policymakers started calling on tech companies to design better algorithms that would not radicalize people. Unfortunately, this effort was rooted in false logic about media effects.

While the term “sociotechnical” is increasingly being used (and misused) in conversations about technology and society, we do not yet have robust frameworks for envisioning sociotechnical accountability. Conversations tend to shift from focusing on what tech companies must do, to what lawmakers should do, to user responsibility. None of it offers a satisfying way to grapple with problems in which social and technical systems are so deeply entwined. 

Sociotechnical exploitation

Michael and I were fascinated by how media manipulators actively exploited a system that was vulnerable because it was designed with certain assumptions about how content generated outside of its purview is structured. Search engines were never designed to be exploited in this way, but addressing this vulnerability was not as simple as changing the algorithm. After all, data voids are a type of sociotechnical exploitation that stems from the entanglement between people, practices, technology, and companies.

A search engine is powerful because the public comes to rely on it; it is socially constructed to be powerful. And once it is powerful, those who are seeking to shape the arrangement of power target the system for their own agendas. They find and exploit vulnerabilities, triggering a game of whack-a-mole with those who are responsible for the system. Meanwhile, as users approach that system with all sorts of expectations and intentions (not to mention mental states), they become enrolled in these competing efforts to control the information. Everyone has a role to play in this script, but it makes creating a stable and healthy state nearly impossible. 

With the data voids project, Michael and I explored different ways to address the problem. We educated product teams inside tech companies to support them in taking different strategies to tackle data voids; many have tried. (Google went so far as to rename the problem “information gap” to avoid using a name created by a Microsoft employee.) We hosted events with comedians, content creators, and civil society advocates to help create new content to address specific data voids. We also tried to shed light on how the problem is not simply one of bad algorithms, but indicative of how any sociotechnical arrangement can and will be exploited. Given how much of the conversation has devolved into tech is good vs. tech is bad, I think we’ve been less successful on this front. As the #techlash unfolded, most critics wanted to see any negative activity involving tech platforms as the product of malfeasant technologists rather than the interaction between imperfect systems and manipulative actors hellbent on getting their way. No one wanted to contend with the breeding grounds of conspiracy theorists and media manipulators, let alone the complex sociotechnical entanglements involved in data voids.

Manipulation of data infrastructures and algorithmic systems continues to be an endemic problem, affecting search engines as well as social media, AI, and all aspects of algorithmic decision-making. The current wave of generative AI systems is complicating long-standing algorithmic systems, including search engines, creating new challenges for those in charge of maintaining the algorithms as well as those who simply want a good result. We can’t simply ignore these dynamics — or blame them exclusively on those who are designing the technology. The exploitation of data voids is one type of sociotechnical attack. The “data voids” project shows one way to grapple with the entanglement of algorithmic systems, data, and media manipulators. I hope that those who are trying to understand where systems go terribly awry as people exploit them will build off of the work we did.