Simulated Flies: How Behavior Models Become Scientific Instruments

“If the model appears to ‘care’ about a particular cue, can that cue be manipulated in a real fly and the predicted shift observed?”

February 18, 2026

This post is part of  a series drawn from my ethnographic research on how AI systems are being taken up in scientific practice. My aim is more descriptive than evaluative: to document how scientists decide what to trust and what to consider “good enough” as new tools become part of the routine of doing science. These posts stay close to the scenes where such judgments are made: lab meetings and draft manuscripts, tool demos and troubleshooting sessions, arguments over validation, and small acts of verification that rarely make it into published accounts. If you come to this series with suspicion about the usefulness of AI, you are not alone. If you come with enthusiasm that these tools will inevitably transform science for the better, you are not alone either. I’m not asking you to suspend your critical faculties. I’m asking you to treat this series as an experiment in attention to practice: to watch how AI becomes ordinary in research, how standards of evidence and “good enough” judgment get renegotiated in the process, and what kinds of authority, responsibility, and constraint travel with these systems as they move from novelty to infrastructure.

Neuroethology is, at its core, a discipline built around understanding how nervous systems generate natural behavior in the settings where that behavior matters. It asks how an organism’s sensory world, bodily constraints, and ecological pressures co-produce the patterns we recognize as competence: navigation, pursuit, avoidance, courtship, rest. The animating question behind ongoing research conducted by someone I’m calling Rachel, a computational biologist working at the intersection of neuroscience and machine learning, turns this tradition sideways, and toward the current moment in AI for science. Her work asks: if we can now train artificial neural networks that behave in ways that look recognizably animal-like, what would it mean to do neuroethology on these networks themselves? 

Rachel’s work is a particularly sharp illustration of how AI is reshaping scientific practice, because she treats generative modeling as the beginning of her inquiry. A behavior model becomes a scientific instrument only if it can be interrogated in something like the way neuroethologists interrogate animals: by asking what it “knows,” what it systematically ignores, what internal variables it constructs to act, and which failures reveal the boundaries of its competence. Instead of using AI as a tool to classify animal behavior, Rachel is exploring whether simulated behavior can be used to generate hypotheses about real behavior that can be tested back on living animals. In her framing, the model becomes a kind of simulated organism, complete with a designed sensory world and an internal organization that has to be studied in its own right. In what follows, I trace a recurring arc in her work: collect and compress behavior at scale, train forecasting models that can generate realistic trajectories of movement and interaction, and then confront the harder problem of evaluation and interpretation — the work required to make simulated competence answerable to experimental questions. 

The difficulty is that “animal-like” is an unstable achievement. Behavioral prediction is one of those problems where it is surprisingly easy to look right for the wrong reasons. If a fly is walking forward, a next-step predictor can score well by learning inertia. If a social interaction has long stretches of relative stasis punctuated by short bursts of pursuit or contact, a generative model can appear realistic by reproducing the dominant tempo while missing the decision-relevant moments. This is why Rachel often reiterates a consequential distinction: “models can be competent without being illuminating.” Across computational ethology, and across the newer wave of generative modeling for behavior in particular, researchers have gotten better at training systems that generate plausible trajectories of animal movement. They are not yet as good at making those systems explain themselves in ways that can drive experiment.

Rachel’s chosen partner species is the fruit fly. On paper, flies are modest creatures: roughly 100,000 neurons packed into something you can barely see, running on power similar to a watch battery. In practice, they navigate cluttered environments, find food, avoid predators, court mates, and learn from experience. This contrast between tiny nervous systems and robust competence is central to the question of what it would take to explain this competence without flattening behavior into a handful of statistics. 

Forecasting Behavior

Over the past decade or two, the disciplinary community Rachel works in has already transformed the recording side of this problem. Once, characterizing what a fly did meant an observer had to manually label its behavior. Now pose-estimation pipelines track dozens of body points across thousands of videos, and classifiers label behaviors automatically. In Rachel’s lab, at least for flies, these tools “mostly work.” The bottleneck has changed. They no longer struggle to get enough data; they struggle to know what to do with the data they have. Rachel described collecting tens of thousands of videos and then compressing them into trajectories, only to realize that even condensed, the data remain “more than we know what to do with.” The classic data-science move of reducing complexity into population-level statistics and 2D plots felt to her like “losing a lot of the level of detail that’s in this data.” I have encountered this frustration in other labs as well: visualization becomes a proxy for what counts as signal.

Her lab’s response has been behavior forecasting: training a model to predict what an animal will do next, given its recent past. Rachel emphasizes a deliberate analogy to large language models. Instead of predicting the next word from a prior sequence of words, they predict the next “frame” of behavior from a prior sequence of sensory and bodily states. The model does not get omniscient access to the scene. It gets a handful of coarse signals: a rough sense of what’s ahead (vision), whether it’s touching something (contact), where it is in the arena (position), and what its own body is doing from moment to moment (an internal sense of motion and posture). 

In practice, that analogy is instructive because it breaks in all the places that matter. Words are (mostly) discrete, while trajectories are continuous. Language comes with culturally stabilized units (words, sentences), while behavior arrives on multiple timescales at once. And the “good predictor” problem is more acute in motion data, because predicting the next frame as “same as the last frame” can look like success while teaching an observer nothing about the animal. A model can get “99 percent of the signal right” by simply repeating, which is exactly the kind of performance that is technically impressive and scientifically empty at the same time. In effect, the lab can produce simulated flies that walk, gesture, and even exhibit known behavior such as courtship or finding food. But what would it mean for this to be useful to a neuroscientist or theorist, rather than merely a convincing demonstration?

This is when evaluation becomes central, and where Rachel’s work offers a good anchor example. Consider a clip in which a simulated fly passes straight through a live fly. On one level, it is a simple tell that the model is not grounded in physics. But Rachel frames it more subtly: the “real” flies cannot perceive the simulated fly, so the collision is also an artifact of mismatched interactional worlds. The generative model is being asked to inhabit a social ecology in which one side is blind to the other. In that setting, failure can be the product of experimental arrangement, not simply model deficiency.

Demonstrating Scientific Usefulness

When the obvious question of being able to tell the difference between real and simulated flies comes up, Rachel answers candidly. Biologists (including herself) can sometimes tell the difference, but trained discriminators can do better. Behavior classifiers can compare rates of chasing or other motifs. But even as she lists these options, she marks evaluation as “an open question for a lot of these generative models,” which is another way of saying that the field does not yet have a shared account of what should count as a convincing demonstration of scientific usefulness. This is also where her project differs from research where believable outputs are sufficient for sandbox exploration. Rachel needs something closer to accuracy in the strong sense, because the wager is that whatever the model “cares about” can be tested back on real flies.

Rachel shared this slide during a presentation that made the focus on accuracy explicit. On the left, the outline looks like a conventional pipeline: collect a big behavior dataset, train models that behave “like” animals, figure out what the model gets right, interrogate the model to “understand” how it is doing this, formulate hypotheses, design experiments to test them. But the right side punctures any sense that this is a smooth path forward. Next to each step Rachel adds a “confidence that we know how to do this” score, and the numbers drop sharply as soon as the work shifts from building models to making them explanatory: 95/100 for collecting a dataset, 50/100 for training behavior-like models, then 10/100 for figuring out what the model gets right and 10/100 for interrogating how it does it, before falling to 5/100 for hypothesis formation and 5/100 for experiment design.

What Rachel is describing through this slide is a recursive loop in which evaluation and interpretation force a return upstream, changing what gets recorded and what constitutes a meaningful test. In this sense, her slide is also an argument about scientific labor. The community is becoming adept at producing simulated behavior. It is much less prepared for the interpretive and experimental craft that would turn that behavior into a mechanism that can be named and a hypothesis that can be tested.

This is where “neuroethology of artificial neural networks” becomes more than a metaphor. Classical neuroethology relies on an experimental posture; behavior is not simply observed and declared realistic. The animal’s world and body are varied and then the changes are tracked. Sensation is constrained. Affordances are altered. Noise is introduced. The goal is to locate the conditions under which a behavioral pattern holds, and the conditions under which it breaks. Rachel’s claim is that a trained model should be approached in the same spirit. If the model is a simulated organism, then the scientific question is not only whether it can generate fly-like behavior, but what cues it depends on to do so, and what internal structure is doing this work.

This posture immediately reframes the role of interpretability in AI systems. In many AI discussions, interpretability is treated as a way of narrating what a black box “might be doing.” Rachel treats it more like experimental instrumentation. To interrogate the model is to develop practical tests that can make its internal workings testable: methods for measuring what information it carries internally, deliberately altering those internal signals, and then checking whether the model’s behavior changes in a predictable way. In plain terms: the question is what is “inside” the model, and whether changing that internal variable changes what the model does.

Building Shared Evidentiary Norms

Exemplifying this approach, researchers at Anthropic have experimented with methods that can isolate an internal feature in a large model that reliably activates for an everyday, unambiguous concept such as the Golden Gate Bridge. The point of choosing something like the bridge was to ensure that the concept is stable enough for readers to agree on what it refers to. When tracked as a feature inside the model, it reliably turns on across many different prompts that mention the bridge, even across languages. And when it is deliberately boosted, the model’s outputs shift in predictable ways: it becomes more likely to talk about the bridge or to weave it into a response. In effect, the Golden Gate example shows what it would mean to treat interpretability as a set of tests that can detect and manipulate an internal variable. 

There is a disciplinary translation that sits underneath this approach. In machine learning, these readouts are often discussed as “linear probes.” In neuroscience, closely related moves have long been called “decoders.” The naming difference hints at what makes this work hard: there is not yet a shared, cross-field standard for what counts as evidence that an internal representation corresponds to a scientifically meaningful variable. A probe that correlates with “distance to other fly” may satisfy an ML audience as interpretability. For computational ethologists, it remains a story unless that representation can be manipulated to produce a predictable shift in behavior. Neuroethology of artificial neural networks, in this sense, is as much about building shared evidentiary norms as it is about inventing new tools.

Seen this way, Rachel’s simulated flies are a testbed for a new kind of inquiry. The point is to ask whether an artificial neural network trained on coarse sensory cues develops internal variables that resemble the intermediate constructs biologists already debate: attraction, arousal, orientation, threat sensitivity, social distance, or something else entirely. And to make those variables operational in a way that feeds back into experiment. If the model appears to “care” about a particular cue, can that cue be manipulated in a real fly and the predicted shift observed? 

If there is a broader lesson here for AI for Science, it is that data collection and model training are increasingly infrastructural, while the frontier is increasingly epistemic. The challenge is not only building systems that can generate plausible behavior, but building the interpretive and experimental arrangements that make those systems answerable to the questions scientists actually care about. Neuroethology of artificial neural networks points to a future in which “understanding the model” becomes an experimental method for converting simulated competence into hypotheses worth testing on animals. It also points to something more mundane and more difficult: the work of building shared standards for what counts as evidence, across fields that have different instincts about causality, correlation, mechanism, and explanation — and the labor of making those standards usable inside real research workflows.