license: apache-2.0 base_model:

  • Qwen/Qwen3.6-35B-A3B language:
  • en tags:
  • GGUF
  • llama.cpp
  • qwen3.6
  • qwen
  • quantization
  • turboquant
  • tq3_4s
  • multimodal
  • Mixture of Experts
  • conversational pipeline_tag: image-text-to-text

thumbnail

Qwen3.6-35B-A3B-TQ3_4S

GGUF quantization of Qwen/Qwen3.6-35B-A3B using TQ3_4S with mixed-precision MoE compression — 2-bit experts, 4-bit attention.

Files

File Description
Qwen3.6-35B-A3B-TQ3_4S.gguf Main model (12.4 GiB, 3.07 BPW)
mmproj-BF16.gguf Multimodal projector (BF16)

Quantization

MoE experts tolerate aggressive compression because only 8/256 are active per token. This quantization exploits that asymmetry:

Component Quant Rationale
Expert MLP gate/up Q2_K 98% of params, MoE-tolerant
Expert MLP down Q3_K Write-back sensitivity
Attention Q/K/V/O TQ3_4S WHT-protected
Embeddings + output Q6_K Quality anchor

Runtime Requirement

This model requires the public TurboQuant runtime fork:

./build/bin/llama-server \
  -m Qwen3.6-35B-A3B-TQ3_4S.gguf \
  -ngl 99 -c 4096 -np 1 \
  -ctk q4_0 -ctv tq3_0 -fa on \
  --jinja \
  --reasoning off --reasoning-budget 0 --reasoning-format deepseek

With vision:

./build/bin/llama-server \
  -m Qwen3.6-35B-A3B-TQ3_4S.gguf \
  --mmproj mmproj-BF16.gguf \
  -ngl 99 -c 4096 -np 1 \
  -ctk q4_0 -ctv tq3_0 -fa on \
  --jinja --no-mmproj-offload \
  --reasoning off --reasoning-budget 0 --reasoning-format deepseek

Performance (RTX 5060 Ti 16GB)

Metric Value
PP512 1832 tok/s
TG128 107 tok/s
Size 12.4 GiB
BPW 3.07
ngl 99 (full GPU)

Fits entirely in 16GB VRAM — no CPU offload needed.

Quality

10/10 correct on standard QA benchmark (capital of France, 2+2, Python reverse string, gravity, WW2, primes, boiling point, Shakespeare, Jupiter, hello→Hola).

Base Model

License

Apache 2.0 — same as the base model.

Tool Call Validation

Tested with --jinja on both --reasoning off and --reasoning on --reasoning-budget 2048:

Test reasoning off reasoning on
Basic tool call trigger
Tool response → final answer (no loop)
Correct tool selection from multiple
No tool call for simple questions
Multi-step tool use
Nested quote escaping retry (no loop)
Total 10/10 10/10
--jinja --reasoning off --reasoning-budget 0 --reasoning-format deepseek

Avoid --presence-penalty above 0.5 for tool-use — high values diversify reasoning tokens but don't improve structured JSON output, and can cause repeated near-identical tool calls in agent loops.

If using --reasoning on, ensure your agent framework detects consecutive identical tool calls and breaks after 2-3 retries.

Run tests yourself

chmod +x test_tool_calls.sh
./test_tool_calls.sh 8085