Harness-1: Teaching Search Agents to Offload State

Search agents and the state problem A search agent is an AI system that answers a question by iterating through multiple searches. Unlike a one-shot retrieval lookup, it reads intermediate results, adjusts its search strategy, compares candidate documents, and checks whether specific claims are actually supported by what it found. Tasks like analyzing financial filings, tracing multi-hop facts across sources, or interpreting complex regulations need this kind of iterative work. A single query won’t get you there.

Read More

LLM Observability Without LangSmith: Five Open-Source Tools Compared

At some point in building LLM applications or agents, you need to know why a call failed, what the tool invocation looked like, or why the agent got stuck in a loop. LangSmith, LangChain’s commercial observability platform, has been the default answer for this: it covers trace visualization, prompt versioning, and evaluation in one place. Its usage-based pricing and cloud-hosted architecture are where teams start looking for alternatives.

Read More

DeepSWE: A Benchmark for Long-Horizon Coding Agents

SWE-bench has been the default coding-agent leaderboard for a while, but it has well-known weaknesses. Most tasks come from existing public issues and PR patches, so a high score might partly reflect memorization. Most tasks are also single-file bug fixes, which is not representative of the multi-file, long-horizon work that a coding agent does in practice.

Read More

Mixture of Agents: How Layering Open-Source LLMs Beat GPT-4 Omni

Instead of scaling a single model up, what happens when you stack multiple models in layers and have each one refine the previous layer’s output? Together AI’s research team answered that in June 2024 with arXiv:2406.04692. Using only open-source models, their Mixture of Agents (MoA) configuration scored 65.1% on AlpacaEval 2.0, versus 57.5% for GPT-4 Omni.

Read More

Open Knowledge Format: A Shared Vocabulary for Agent Knowledge

When AI agents fail in production, the model is often not the problem. The missing context is. Table schemas, metric definitions, runbooks, join paths between systems, and API deprecation notices are scattered across catalog vendors, internal wikis, code comments, and personal notes. Every agent developer solves the same context assembly problem from scratch.

Read More

Qwen3.6-35B-A3B: Community Reviews, Uncensored Variants, and MTP Benchmarks

Alibaba released Qwen3.6-35B-A3B in April 2026: a 35B-parameter MoE model with around 3B active per token, a 262K native context, and an official SWE-bench score of 73.4%. Two months in, it’s the most widely tested 35B-class model in the local LLM community.

Read More

Robot Learning: A Tutorial (From Classical Robotics to Generalist Policies)

“Robot Learning: A Tutorial” (arXiv:2510.12403) is a paper-length tutorial by Francesco Capuano, Caroline Pascal, Adil Zouitine, Thomas Wolf, and Michel Aractingi, from the University of Oxford and Hugging Face. It covers the full arc of robot learning methods, from classical dynamics-based control through reinforcement learning, imitation learning, and generalist vision-language-action models, using the Hugging Face LeRobot library throughout.

Read More

VibeThinker-3B: Packing Verifiable Reasoning into 3 Billion Parameters

“Small model beats big model” papers appear regularly. Usually the claim holds on a specific benchmark under specific conditions, not across the board. WeiboAI’s VibeThinker-3B, published June 15, 2026, follows a similar structure but draws a clearer boundary: the claim is not that a 3B model replaces a frontier generalist. The claim is that verifiable reasoning can be compressed into a small model, while open-domain knowledge and general dialogue still benefit from more parameters.

Read More

Future AGI: Evaluate, Observe, and Improve AI Agents in One Place

If you have shipped an AI agent, this will sound familiar. The demo runs fine. Then it hits production, the hallucinations start, and you can’t tell what went wrong or why. So you bolt on one tool for evals, another for tracing, another for guardrails. The real problem is that none of them talk to each other, so the loop you need to actually fix things never closes.

Read More