The Problem With One-Off Agents

LLM-based agents are increasingly deployed to handle streaming tasks in production environments, but the dominant deployment pattern treats each task as an isolated problem. When a task completes, the knowledge generated during that interaction is discarded rather than stored in a form the agent can retrieve and apply later. This one-off behavior means agents repeat the same reasoning steps across similar tasks, wasting compute and failing to improve over time.

Existing attempts to address this limitation fall into three categories, each with its own shortcomings. Manual skill curation requires human intervention that does not scale. Heuristic skill operations prescribe fixed rules for when to store or retrieve knowledge, which cannot adapt to novel task distributions. Training approaches that do exist focus on short-horizon skill operations and struggle to learn complex curation policies from indirect and delayed feedback, where the value of storing a skill only becomes apparent several tasks later [1].

What SkillOS Is

SkillOS introduces a two-component architecture that separates skill application from skill management. The first component is a frozen agent executor, an LLM that retrieves skills from an external repository and applies them to incoming tasks. Because this component is frozen, its weights are not updated during training, which keeps the executor stable and allows the framework to isolate the curation problem.

The second component is a trainable skill curator, a separate model that observes the executor’s trajectories and decides how to update the external SkillRepo. The SkillRepo itself is an external store of structured Markdown files, each encoding a skill distilled from past experience. Over time, these files evolve to encode higher-level meta-skills as the curator learns which abstractions transfer across tasks [1].

How the RL Training Recipe Works

Generating useful learning signals for skill curation is the central technical challenge SkillOS addresses. The value of a curation decision, such as whether to store, merge, or discard a skill, is not apparent until a related task arrives later and either benefits or fails to benefit from that decision. This delayed feedback structure makes standard reward assignment difficult.

SkillOS handles this through two mechanisms. First, it uses composite rewards that combine signals from both the immediate quality of a skill update and the downstream performance of the executor on tasks that rely on those updates. Second, it trains on grouped task streams organized around skill-relevant task dependencies. Within each group, earlier trajectories update the SkillRepo, and later related tasks evaluate those updates. This grouping creates a structured credit assignment path that connects curation decisions to observable outcomes, giving the reinforcement learning signal enough resolution to train a long-horizon curation policy [1].

Performance Against Baselines

Across both multi-turn agentic tasks and single-turn reasoning tasks, SkillOS consistently outperforms two classes of baselines: memory-free agents that receive no accumulated context, and strong memory-based agents that use existing skill storage methods without RL-trained curation [1].

The gains appear in both effectiveness, measured by task success rates, and efficiency, measured by the computational resources required to reach a given performance level. The researchers attribute the efficiency improvements in part to more targeted skill use: the learned curator selects and applies skills with greater precision than heuristic-based alternatives, reducing irrelevant context that would otherwise be passed to the executor.

Generalization and Transferability

A key practical question for any learned curation policy is whether it remains useful when the underlying executor changes or when the task domain shifts. SkillOS shows positive results on both dimensions. The trained skill curator generalizes across different executor backbones, meaning a curator trained alongside one LLM can operate with a different LLM at inference time without retraining [1].

The curator also transfers across task domains, suggesting that the curation policy learns generalizable principles about skill quality and relevance rather than overfitting to the specific task distribution used during training. The skills stored in SkillRepo were observed to evolve into more richly structured Markdown files over the course of training, indicating that the curator progressively organizes knowledge at higher levels of abstraction [1].

Implications for Agent Engineering

For practitioners building agents intended to operate over long deployment horizons, SkillOS offers a concrete training recipe for moving beyond stateless inference. The separation of the frozen executor from the trainable curator means the curation capability can be added to existing agent pipelines without modifying the core LLM. The external SkillRepo is inspectable and editable, which preserves operator visibility into what the agent has learned.

The cross-backbone generalization finding is particularly relevant for teams that anticipate swapping underlying models as newer versions become available. A curator trained on one executor generation would not necessarily require full retraining when the executor is upgraded, reducing the operational cost of maintaining a self-evolving agent system [1].

FAQ

Q. Does deploying SkillOS require modifying the weights of the base LLM executor? No. The executor is frozen during training, meaning its weights remain unchanged. Only the skill curator is trained, which allows SkillOS to be layered onto existing LLM deployments without altering the base model [1].

Q. What format does the SkillRepo use, and can operators inspect its contents? Skills are stored as structured Markdown files in the external SkillRepo. Because the repository is external and human-readable, operators can inspect, audit, or manually edit stored skills without interfering with the curator’s training process [1].

Q. How does the grouped task stream training strategy handle tasks that have no clear skill-relevant dependencies? The grouping is based on skill-relevant task dependencies, so tasks without strong dependency relationships would not be grouped together. The research does not specify a fallback mechanism for fully independent tasks, which represents an open question for practitioners with highly heterogeneous task distributions [1].

Q. Does the skill curator need to be retrained if the executor backbone is swapped for a newer model? Benchmark results indicate the learned curator generalizes across different executor backbones, suggesting full retraining may not be required when upgrading the underlying LLM. However, the degree of performance retention across very different model architectures has not been fully characterized in the published findings [1].

Q. On what types of tasks has SkillOS been evaluated? Evaluations cover both multi-turn agentic tasks, which require sequential decision-making across multiple steps, and single-turn reasoning tasks. SkillOS outperforms memory-free and memory-based baselines on both categories [1].

Key takeaways

  • SkillOS separates skill application (frozen executor) from skill management (trainable curator), allowing curation to be trained independently of the base LLM.
  • Composite rewards combined with grouped task streams provide the delayed credit assignment necessary to train long-horizon curation policies.
  • The framework outperforms both memory-free and memory-based baselines on multi-turn agentic and single-turn reasoning benchmarks in effectiveness and efficiency.
  • The trained skill curator generalizes across executor backbones and task domains, reducing retraining costs when models or deployments change.
  • Skills in SkillRepo evolve into higher-level meta-skill representations over time, stored as inspectable Markdown files that operators can audit directly.

Frequently Asked Questions

How does SkillOS handle the delayed feedback problem in skill curation?

SkillOS uses composite rewards that combine immediate skill quality signals with downstream executor performance, and trains on grouped task streams organized around skill-relevant dependencies. This structure creates a credit assignment path that connects curation decisions to observable outcomes.

Does SkillOS require modifying the base LLM weights?

No. The executor LLM is frozen during training, so its weights remain unchanged. Only the skill curator is trained, allowing SkillOS to be added to existing agent deployments without altering the base model.

Can the skill repository be inspected and edited by operators?

Yes. Skills are stored as structured Markdown files in an external SkillRepo that is human-readable and inspectable. Operators can audit or manually edit stored skills without interfering with the curator’s training.

Does the skill curator need retraining when the executor model is upgraded?

Benchmark results show the curator generalizes across different executor backbones, suggesting full retraining may not be required when upgrading the underlying LLM, though performance retention across very different architectures has not been fully characterized.

What types of tasks has SkillOS been evaluated on?

SkillOS has been evaluated on both multi-turn agentic tasks requiring sequential decision-making and single-turn reasoning tasks, outperforming memory-free and memory-based baselines on both categories.