Labs are rethinking AI research workflows as ai autoresearch yields measurable gains and raises questions about autonomous systems.
In recent weeks, a viral experiment from Andrej Karpathy has turned ai autoresearch from a niche idea into a central talking point in the AI research community.
The origins of Karpathy’s autoresearch concept
Earlier this month, Andrej Karpathy, a prominent AI researcher and one of the founding employees of OpenAI, shared a striking experiment on X. He later headed AI at Tesla and now works independently while running Eureka Labs, a project building a new kind of school for the AI era.
Karpathy, who has 1.9 million followers on X, is influential enough that almost any comment on AI spreads rapidly. This latest post, however, stood out because it showcased a hands-on system he built for automated research, which he dubbed “autoresearch”. The idea quickly captured the imagination of both practitioners and theorists.
In the experiment, Karpathy deployed an AI coding agent to run a sequence of tests aimed at improving the training of a small language model. Over two continuous days, the agent executed 700 experiments, systematically exploring training configurations to find better setups.
Across those experiments, the agent discovered 20 optimizations that improved training efficiency. Moreover, when Karpathy applied the same 20 tweaks to a larger, though still relatively small, language model, he recorded an 11% speed increase in training time. This concrete gain underscored the practical potential of his approach.
From lab demo to potential new research paradigm
Karpathy described the framework as a general research engine for code and model optimization. Crucially, he emphasized that the autoresearch agent was not tuning itself but rather adjusting the training code and initial neural network parameters of a different, smaller AI model. That distinction matters for safety discussions, even if the implications for research workflows are profound.
He argued that such tools could reshape how leading labs run AI research. “All LLM frontier labs will do this. It’s the final boss battle,” Karpathy wrote on X. However, he acknowledged that scaling the idea from a 630-line Python project to a frontier model codebase that is orders of magnitude larger introduces major complexity.
Karpathy still framed the challenge as an engineering problem rather than a conceptual barrier. In his view, labs will spin up a swarm of agents, have them collaborate to tune smaller models, then progressively promote the most promising ideas to larger scales. Humans, he suggested, will “optionally” contribute at the edges, guiding and evaluating rather than hand-coding every modification.
Today, his implementation focuses on a single agent that iteratively improves a codebase along one path. In the future, though, he expects multiple AI agents to explore different hypotheses and experiments in parallel. He wrote that the next step for autoresearch is to become an asynchronously, massively collaborative environment for agents, designed to emulate a research community rather than a single PhD student.
Industry reaction and the Shopify test
The experiment quickly moved beyond theory when Tobias Lütke, cofounder and CEO of Shopify, decided to try the setup on company data. Lütke reported on X that he used the system to optimize an internal AI model, instructing the agent to improve both quality and speed. This made the concept tangible for enterprise applications.
According to Lütke, after letting the process run overnight, the agent conducted 37 experiments and delivered a 19% performance gain. That said, he did not publish full technical details, but the result was impressive enough to fuel further excitement and speculation about commercial impact.
Karpathy later remarked that any metric that is reasonably efficient to evaluate can be targeted by such an agent swarm. Moreover, he noted that if a metric has a cheaper proxy, such as training a smaller network instead of a large one, it can still be incorporated. He urged technologists to consider whether their own optimization problems fall into this bucket.
Links to the dream and fear of self-improving AI
What truly captured public attention was how close this looked to the long-discussed idea of self-improving AI. Science fiction has often portrayed systems that rewrite their own code, while some modern researchers aspire to such capabilities and others fear them. The notion of recursive self-improvement has particular resonance in AI safety circles.
In those discussions, a key worry is that an AI could continually optimize its own architecture and training data in a loop. Over many cycles, this might trigger what some safety researchers call a “hard takeoff” or an “intelligence explosion.” In such a scenario, an AI could quickly surpass human cognitive abilities, making it challenging or impossible to retain meaningful control.
Karpathy’s setup, however, falls short of that idealized or alarming picture. The agent he used is not modifying its own training pipeline or changing its own internals. Instead, it is rewriting the training code and neural network settings of a different, simpler model. This separation keeps the current system within a more conventional optimization paradigm, though the direction of travel is clear.
Nevertheless, many observers interpreted the work as a preview of how labs might eventually orchestrate more autonomous systems. Moreover, by making agent-driven experimentation look both accessible and effective, the project could accelerate adoption of similar architectures, including more advanced agentic system optimization loops.
The Karpathy Loop and generalized agent patterns
Some analysts highlighted that the core pattern behind the project can be abstracted and reused. Janakiram MSV, principal analyst at Janakiram & Associates, wrote in tech outlet The New Stack that Karpathy had effectively defined a reusable loop. He labeled it “the Karpathy Loop”, suggesting a template for broader agent systems.
According to Janakiram, the loop has three essential elements. First, an agent must have access to a single file that it can freely modify. Second, it needs a single, objectively testable metric to optimize. Third, there must be a fixed time limit for each experiment, constraining how long the agent can run a given trial before reporting results.
He also stressed that the instructions Karpathy embedded in his configuration file provide a strong model for how to talk to any AI agent. The plain text file carefully specified what the agent should do, which constraints applied, what it must not touch, and the stopping criteria. Moreover, it defined exactly how long each loop should run and when the agent must halt and summarize outcomes.
Commentators argued that this style of precise prompt engineering is becoming a crucial skill. While the underlying models grow more powerful, effective control still relies on humans writing clear, structured directives that align the agent’s autonomy with concrete goals and boundaries.
Autoresearch versus existing AutoML approaches
Not everyone agreed that Karpathy’s work represented a breakthrough. Some critics said he had effectively rediscovered components of AutoML, a set of techniques that Google, Microsoft, and other AI labs have used for years. AutoML frameworks also run iterative experiments in search of better data, architectures, and hyperparameters.
Classic AutoML systems rely heavily on automated optimization loops and search strategies. They explore model architectures, tune hyperparameters, and sometimes select training data using random variations or evolutionary algorithms. However, they generally do not involve an AI agent that can read research papers, design new hypotheses, and write arbitrary code changes in response.
Karpathy pushed back on comparisons that minimized the difference. He pointed to methods like neural architecture search, which emerged as a way to automate model design. In his view, earlier forms of this technique were weak compared to an agent that can reason over code, learn from past trials, and pull information from the internet.
He described historical neural architecture search as “such a weak version of this that it’s in its own category of totally useless by comparison.” Moreover, he emphasized that his system uses a large language model to write arbitrary code, interpret results from previous experiments, and adapt strategies on the fly, making it far more flexible than traditional automl neural architecture search pipelines.
Looking ahead to agent swarms and broader impact
As attention builds, some researchers are exploring how karpathy autoresearch experiment ideas could be scaled up into full agent swarms. The vision is a network of specialized agents that divide tasks, cross-check results, and propose novel approaches, all while humans set high-level objectives and guardrails. This could transform both academic and industrial AI workflows.
However, scaling agent swarms raises open questions about safety, reliability, and governance. Observers concerned about recursive self improvement risks warn that as these systems gain greater autonomy and influence over critical infrastructure, careful oversight will be essential. It will be crucial to maintain robust evaluation metrics and human review at each promotion step.
For now, Karpathy’s project remains a relatively contained illustration of how language models can conduct autoresearch agent experiments on modest codebases. Yet the reaction from figures like Lütke and analysts across the industry suggests that the underlying pattern may spread quickly, blurring the line between human researchers and autonomous agent collectives.
In summary, Karpathy’s autoresearch work demonstrates that a single well-configured agent can discover measurable performance gains in days, not months. Moreover, as labs push these techniques toward larger models and multi-agent swarms, they may unlock powerful new capabilities while also intensifying long-standing debates about autonomy, control, and the future direction of AI research.
en.cryptonomist.ch