Blog Posts

Home /
Blog Posts

DeepSWE: A Benchmark for Long-Horizon Coding Agents

SWE-bench has been the default coding-agent leaderboard for a while, but it has well-known weaknesses. Most tasks come from existing public issues and PR patches, so a high score might partly reflect memorization. Most tasks are also single-file bug fixes, which is not representative of the multi-file, long-horizon work that a coding agent does in practice.

Mixture of Agents: How Layering Open-Source LLMs Beat GPT-4 Omni

Instead of scaling a single model up, what happens when you stack multiple models in layers and have each one refine the previous layer’s output? Together AI’s research team answered that in June 2024 with arXiv:2406.04692. Using only open-source models, their Mixture of Agents (MoA) configuration scored 65.1% on AlpacaEval 2.0, versus 57.5% for GPT-4 Omni.

Open Knowledge Format: A Shared Vocabulary for Agent Knowledge

When AI agents fail in production, the model is often not the problem. The missing context is. Table schemas, metric definitions, runbooks, join paths between systems, and API deprecation notices are scattered across catalog vendors, internal wikis, code comments, and personal notes. Every agent developer solves the same context assembly problem from scratch.

Qwen3.6-35B-A3B: Community Reviews, Uncensored Variants, and MTP Benchmarks

Alibaba released Qwen3.6-35B-A3B in April 2026: a 35B-parameter MoE model with around 3B active per token, a 262K native context, and an official SWE-bench score of 73.4%. Two months in, it’s the most widely tested 35B-class model in the local LLM community.

Robot Learning: A Tutorial (From Classical Robotics to Generalist Policies)

“Robot Learning: A Tutorial” (arXiv:2510.12403) is a paper-length tutorial by Francesco Capuano, Caroline Pascal, Adil Zouitine, Thomas Wolf, and Michel Aractingi, from the University of Oxford and Hugging Face. It covers the full arc of robot learning methods, from classical dynamics-based control through reinforcement learning, imitation learning, and generalist vision-language-action models, using the Hugging Face LeRobot library throughout.

Secret Voting Architecture with FHE, SP1, and Groth16

On-chain secret voting creates three tensions at once. Votes must stay hidden while still being tallied. Off-chain computation cannot be trusted without proof, yet results need to land on-chain. And the EVM cannot run heavy cryptographic operations natively, but it still needs to verify them. FHE (Fully Homomorphic Encryption), SP1 zkVM, and Groth16 each take on one of these.

VibeThinker-3B: Packing Verifiable Reasoning into 3 Billion Parameters

“Small model beats big model” papers appear regularly. Usually the claim holds on a specific benchmark under specific conditions, not across the board. WeiboAI’s VibeThinker-3B, published June 15, 2026, follows a similar structure but draws a clearer boundary: the claim is not that a 3B model replaces a frontier generalist. The claim is that verifiable reasoning can be compressed into a small model, while open-domain knowledge and general dialogue still benefit from more parameters.

Future AGI: Evaluate, Observe, and Improve AI Agents in One Place

If you have shipped an AI agent, this will sound familiar. The demo runs fine. Then it hits production, the hallucinations start, and you can’t tell what went wrong or why. So you bolt on one tool for evals, another for tracing, another for guardrails. The real problem is that none of them talk to each other, so the loop you need to actually fix things never closes.