Can We Run Experiments on History with AI?

AI in Science

Can We Run Experiments on History with AI?

An experimental history project helps clarify the promise and the danger of treating AI as a laboratory for the past.

March 4, 2026

This post is part of a series drawn from my ethnographic research on how AI systems are being taken up in scientific practice. My aim is more descriptive than evaluative: to document how scientists decide what to trust and what to consider “good enough” as new tools become part of the routine of doing science. These posts stay close to the scenes where such judgments are made: lab meetings and draft manuscripts, tool demos and troubleshooting sessions, arguments over validation, and small acts of verification that rarely make it into published accounts. If you come to this series with suspicion about the usefulness of AI, you are not alone. If you come with enthusiasm that these tools will inevitably transform science for the better, you are not alone either. I’m not asking you to suspend your critical faculties. I’m asking you to treat this series as an experiment in attention to practice: to watch how AI becomes ordinary in research, how standards of evidence and “good enough” judgment get renegotiated in the process, and what kinds of authority, responsibility, and constraint travel with these systems as they move from novelty to infrastructure.

Imagine being able to gently perturb literary history: delay a landmark novel by ten years, remove a major prize, or introduce a new genre or auteurial style a bit earlier, and then watch how the ecosystem of fiction reorganizes around that change. That is the animating question behind a current project that looks at using large language models as a kind of laboratory for “experimental history” in contemporary literature. The project, being conducted by a digital humanities scholar who I’m calling Andrew, puts a familiar scientific impulse into an unfamiliar setting: treating models as spaces where counterfactuals can be staged and inspected. Rooted in his digital humanities training, Andrew has recently been drawn to a provocation: how would literary history change if a landmark book did not exist at all, and can we even ask that question in a disciplined way?

By “experimental history” here, he does not mean speculative fiction or playful alternate timelines. The aim is more austere and methodological. Can we build models that behave in ways that are recognizably constrained by a particular historical moment, and then use those models to ask structured “what if” questions about how literary worlds evolve? The project sits at the edge of both digital humanities and AI research, and it exposes some surprisingly basic problems in how today’s AI models understand time.

Andrew starts from a comparatively tractable piece of the literary field: prize-nominated contemporary fiction. Prizes provide a visible, if imperfect, signal of value. They create a small, curated set of novels that critics and juries have already agreed to “stand for” something about their moment. Using this corpus, Andrew constructs a kind of map of contemporary fiction, where books cluster together or diverge based on style, genre, voice, and theme. As implemented, this “map” depends on a technical routine of representing texts in a way a system can compare and then clustering them to produce neighborhoods of resemblance that track with familiar differences in genre and style. The first technical question is empirical: can a model, with enough scaffolding, generate first chapters that land in roughly the same part of that map?

Early work suggests that, in a narrow sense, the answer is yes. With prompt designs that include invented author biographies, explicit planning steps, and instructions to follow the conventions of a prize novel, Andrew can get a model to produce text that looks computationally similar to real short-listed books. On the particular quantitative measures he is using for this “map,” the generated chapters tend to land in the same neighborhood as the books they are meant to resemble. This is neither a claim about literary value nor, by itself, a claim about reproduction. It is a claim about recognizability within a chosen representation of style and genre. This is the “wow” moment that tends to surface in lab meetings: the realization that imitation at this level is already feasible.

But that moment also marks the beginning of a more interesting difficulty. For experimental history, it is not enough for a model to write “good” literary fiction in a generic way. Time has to matter. A first novel that appears in 2005 sits in a very different world than one that appears in 2015: different geopolitical events and crises dominate the news, different technologies saturate everyday life, different social questions are being asked about identity and belonging, and different stories about “the novel” circulate in criticism and publishing. If a model is to serve as a stand-in for that period, its outputs have to be responsive to those shifts.

One naive approach can be pretty straightforward: tell the model what year it is. Ask it to write the first chapter of a prize-worthy novel in 2005, then in 2015, and see what changes. In practice, this kind of time-tagging does remarkably little. The model produces plausible fiction for both prompts, but the differences do not clearly reflect the kinds of historical shifts that literary scholars actually debate. The outputs are temporally flavored in small ways, but they do not yet feel anchored in a specific sensibility or conjecture. And when Andrew pushes further back, the difficulty becomes harder to ignore. Asking a model to “sound like the 1920s” quickly runs into the fact that past discourse is at times more openly racist than what contemporary systems will readily reproduce, which means that today’s guardrails and refusals become part of the simulation. Time, in other words, is entangled with the normative conditions under which these systems are allowed to speak.

A conversation I had with another researcher captured this problem in a different register. She mentioned, almost in passing, that if you ask a large language model what time it is, it will often fail to give you the correct answer. But if you ask it to write the code for a digital clock, it will produce a program that, once run on a computer, tells the time perfectly, down to the second. The model itself is not “in” time; it has no access to the current moment. What it can do, instead, is assemble a procedure that hooks into an external system that does keep time.

Andrew’s project is chasing a literary analogue of that distinction. Asking a model to “write like it’s 2005” is a bit like asking it for the current time: you get something that looks roughly right, but it is drawing on a hazy, averaged sense of “contemporary fiction” rather than on any situated experience of that year. Building a historically constrained simulator is closer to the digital clock: you have to specify the mechanisms and boundaries that make a particular temporal world available, and then let the system run under those constraints.

This is not an accident. Large language models, as they are currently built, are not designed to be situated in time. They are trained on text spanning many decades (and sometimes centuries) and then fine-tuned to perform as if they were speaking from an eternal, relatively context-free present. When we ask a model to “go back” to 2005, it does not forget what happened afterwards; it merely layers a few adjustments on top of a broad sense of what contemporary fiction looks like. The result is a kind of historical blur: outputs that are locally plausible but out of sync with time. In some cases, that blur is compounded by safety tuning: the system is not only temporally unsituated, it is also configured to resist reproducing certain kinds of language and hierarchy that were, historically, ordinary.

One response Andrew has been exploring is to treat time as a constraint rather than a label. This can mean restricting what the model is allowed to see, so that a “1920s” model is not implicitly drawing on postwar prose styles, later theoretical debates, or social movements that had not yet coalesced. It can also mean giving the model richer human-curated snapshots of a year — brief sketches of political and newsworthy events, changes in fashion and iconography, publishing trends, changes in culture and public discourse — and testing whether these compressed worlds help to produce fiction that feels more historically grounded. Neither strategy is straightforward, but both force us to be explicit about what we are asking the model to hold constant and what we are letting it vary.

Even if we solved the technical problem of enforcing time-relevant outputs, a second set of questions quickly appears: whose history is being simulated? Prize lists themselves encode exclusions. The archives that feed model training do, too. A system that claims to rerun literary history may end up amplifying the same hierarchies of language, genre, and geography that shaped the original record. Experimental history is therefore not only a problem of alignment between year labels and prose style; it is also a problem of whose literary world becomes legible enough to simulate.

This is where the project has been especially generative for my own work on understanding the use of AI in scientific practice. At first glance, Andrew’s questions might look far removed from a biology lab or a climate modeling group. But the core move — using models as spaces where we can probe counterfactuals that are otherwise impossible to stage — recurs across domains. In the sciences, researchers increasingly treat models as experimental infrastructures: places where they can test interventions that would be too slow or too expensive in the world. Andrew’s insistence on time-relevance makes visible what is often backgrounded in those conversations: that models encode particular histories of data, practice, and oversight, and that those histories matter for what counts as a meaningful simulation.

In our follow-up conversations, we often return to the question: what would count as “good enough” validation for experimental history? It is one thing for a single model-generated novel to feel plausible to a reader; it is another for a whole simulated decade of novels to reproduce something like the diversification, retrenchment, backlash, or consolidation that we see in the historical record. A point of agreement in these discussions is that believability is not the same thing as accuracy. Some kinds of simulation are valuable because they are believable enough to support sandbox testing of hypotheses, even if they fall well short of the historian’s standard of historical truth. The evaluative tension here is partly about which of those standards is actually in play, and what gets lost when they are substituted for one another. That kind of comparison pulls the project away from spectacle (“look, the model can write like a prizewinner”) and toward a more modest, but to my mind more interesting, role: using simulations to surface and test our assumptions about causality in cultural change.

Andrew’s work is an invitation to treat models as infrastructures that shape how we imagine evidence and intervention in the context of historical time. It also underlines how much human judgment and disciplinary expertise are required to make these simulations intelligible. Models can generate endless alternate literary histories. Deciding which of those scenarios are historically responsible, analytically useful, or ethically defensible remains a human task.

My hunch is that we will never be able to truly rerun the twentieth century to see what happens if a single book disappears or a prize never comes into being. But in trying to approximate that kind of experimental flexibility — and in being explicit about the limits and politics of these approximations — projects like Andrew’s help clarify both the promise and the danger of treating AI as a laboratory for the past. They give us a concrete place to ask, in public and in detail, what kinds of histories we are willing to simulate, and on whose terms.