Qwen3.6-35B-A3B: Community Reviews, Uncensored Variants, and MTP Benchmarks

Table of Contents

Alibaba released Qwen3.6-35B-A3B in April 2026: a 35B-parameter MoE model with around 3B active per token, a 262K native context, and an official SWE-bench score of 73.4%. Two months in, it’s the most widely tested 35B-class model in the local LLM community.

This post compiles what people have found on Reddit, HackerNews, and personal blogs. These are individual experiences; results depend heavily on hardware, driver version, and settings. Independent replication is needed before drawing firm conclusions.

Specs

FieldValue
ArchitectureMoE (256 experts, 8 routed + 1 shared per token)
Parameters35B total / ~3B active
Context262K (extendable to 1M via YaRN)
ModalitiesText, image, video
LicenseApache 2.0

Official benchmarks (Alibaba-reported): SWE-bench 73.4%, GPQA 86.0, LiveCodeBench v6 80.7, MMLU-Pro 85.2, AIME 2026 92.7.

Uncensored variants

The base model has significant refusal behavior on certain topics. Several community variants strip or reduce that.

HauhauCS Aggressive

According to the r/hermesagent definitive variant guide, by June 2026 the HauhauCS Aggressive variant had over 1.22 million downloads and 761 likes, making it the most tested uncensored option. The creator reports 0 refusals across 465 tests with no capability loss. VRAM requirements are roughly 22 GB for Q4_K_P and 20 GB for IQ4_XS. The creator acknowledges sporadic topic drift in long agentic loops.

The community consensus from that guide is that the model answers exactly what it is asked and produces unusual outputs only when given unusual inputs — in short, behavior tracks the user’s prompts rather than the model’s own tendencies (see r/hermesagent guide).

Other variants

VariantTechniqueDownloadsNotes
Wasserstein (LuffyTheFox)Embedding-space Wasserstein distance455KDifferent uncensoring path; edge-case behavior may differ
heretic (llmfan46)Abliteration + decensor hybrid53KKL divergence 0.0015, 88% fewer refusals
huihui-ai AbliteratedPure abliteration19KCreator describes it as “crude, proof-of-concept”

Hermes Agent compatibility

What works well

From the r/hermesagent guide and the r/LocalLLM tool-calling test thread:

  • Tool calling: Improved stability over Qwen3.5. One independently measured MCPMark score of 37.0.
  • Coding: Codebase-wide analysis and modification gets consistently positive marks.
  • Reasoning: Deep reasoning traces praised for complex problems.
  • Value: Several users describe frontier-level performance locally at around 21 GB VRAM.

Simon Willison wrote on his blog that “on my laptop, Qwen3.6 drew a better pelican than Claude Opus 4.7”. A comment on HN reported solving 11 out of 98 Power Ranking tasks. Both are individual data points.

Known issues

Reported consistently across threads (see HackerNoon’s overview and the r/hermesagent guide):

  • Tool-call loops: The most-cited problem. The model repeatedly calls the same tool.
  • Topic drift: Shows up in long agentic runs; the creator of HauhauCS Aggressive acknowledges this.
  • Temperature sensitivity: temp=1.0 is widely recommended to reduce repetition and looping. Default 0.6-0.8 produces more loop behavior.
  • Code recall: Distilled variants show lower CodeNeedle scores, with potential for mistakes when reproducing code.
temp=1.0, top_k=20, presence_penalty=1.5, top_p=0.95
--jinja --reasoning-budget 4096 --spec-type draft-mtp  # if MTP enabled
enable_thinking: false  # if it interferes with tool call parsing

MTP acceleration benchmarks

Multi-Token Prediction (MTP) predicts multiple tokens at once to increase generation speed. Enable it in llama.cpp with --spec-type draft-mtp. Results differ substantially based on available VRAM.

12 GB VRAM (RTX 4070 Super)

From the r/LocalLLaMA K_P quants thread: settings -fitt 1536, --spec-draft-n-max 2, -ctk/-ctv q8_0.

  • Result: 70-82 tok/s at 128K context
  • Acceptance rate: 0.69 to 0.95 depending on the task
  • This combination makes a 35B-class model with 128K context viable at 12 GB VRAM.

16 GB VRAM (RTX 5080)

SetupSpeed
Q4_K_XL + MTP74 tok/s (acceptance ~79.5%)
Q4_K_XL, no MTP, short context97 tok/s
Q4_K_XL, no MTP, 128K context56 tok/s

At 128K context, prompt processing runs around 1,584 tok/s (about 81 seconds). MTP only helps when the full model fits in VRAM. If the MTP compute buffer forces MoE expert layers onto CPU, that bottleneck can make MTP slower than running without it, even with a high acceptance rate.

These numbers are from individual community members. Hardware, drivers, and settings all affect results.

Variant summary

VariantTypeDownloadsVRAMMTPNotes
Qwopus v1Reasoning Distilled299K~22 GBNot releasedtemp=1.0 recommended
lordx64 Opus 4.7Reasoning Distilled158K~22 GBVia APEXCleanest reasoning traces
hesamation Opus 4.6Reasoning Distilled206K~22 GBVia APEXMMLU-Pro 75.71% (70 questions)
HauhauCS AggressiveUncensored1.22M~22 GBNot releasedMost downloads, most tested
hereticAbliterated54K~22 GBBuilt-inKL divergence 0.0015
unsloth MTPVanilla+MTP548K~23 GBBuilt-inReference MTP implementation
mudler APEX MTPAPEX+MTP33K~18 GBBuilt-inBest quality-per-byte for MoE

Limitations and open questions

Community data means independent replication is needed before drawing strong conclusions. MTP and Vision (--mmproj) cannot run in parallel in llama.cpp. Qwopus + MTP and HauhauCS Balanced/Moderate variants are frequently requested but not yet released. MTP performance data for 24 GB+ GPUs is limited.

Further reading

References

Share :