Qwen3.6-35B-A3B-TQ3_4S

GGUF quantization of Qwen/Qwen3.6-35B-A3B using TQ3_4S with mixed-precision MoE compression — 2-bit experts, 4-bit attention.

Files

File	Description
`Qwen3.6-35B-A3B-TQ3_4S.gguf`	Main model (12.4 GiB, 3.07 BPW)
`mmproj-BF16.gguf`	Multimodal projector (BF16)

Quantization

MoE experts tolerate aggressive compression because only 8/256 are active per token. This quantization exploits that asymmetry:

Component	Quant	Rationale
Expert MLP gate/up	Q2_K	98% of params, MoE-tolerant
Expert MLP down	Q3_K	Write-back sensitivity
Attention Q/K/V/O	TQ3_4S	WHT-protected
Embeddings + output	Q6_K	Quality anchor

Runtime Requirement

This model requires the public TurboQuant runtime fork:

https://github.com/turbo-tan/llama.cpp-tq3

Recommended Settings (16GB VRAM)

./build/bin/llama-server \
  -m Qwen3.6-35B-A3B-TQ3_4S.gguf \
  -ngl 99 -c 4096 -np 1 \
  -ctk q4_0 -ctv tq3_0 -fa on \
  --jinja \
  --reasoning off --reasoning-budget 0 --reasoning-format deepseek

With vision:

./build/bin/llama-server \
  -m Qwen3.6-35B-A3B-TQ3_4S.gguf \
  --mmproj mmproj-BF16.gguf \
  -ngl 99 -c 4096 -np 1 \
  -ctk q4_0 -ctv tq3_0 -fa on \
  --jinja --no-mmproj-offload \
  --reasoning off --reasoning-budget 0 --reasoning-format deepseek

Performance (RTX 5060 Ti 16GB)

Metric	Value
PP512	1832 tok/s
TG128	107 tok/s
Size	12.4 GiB
BPW	3.07
ngl	99 (full GPU)

Fits entirely in 16GB VRAM — no CPU offload needed.

Quality

10/10 correct on standard QA benchmark (capital of France, 2+2, Python reverse string, gravity, WW2, primes, boiling point, Shakespeare, Jupiter, hello→Hola).

Base Model

Qwen/Qwen3.6-35B-A3B
Source: unsloth/Qwen3.6-35B-A3B-GGUF (Q8_0)

License

Apache 2.0 — same as the base model.

Tool Call Validation

Tested with --jinja on both --reasoning off and --reasoning on --reasoning-budget 2048:

Test	reasoning off	reasoning on
Basic tool call trigger	✅	✅
Tool response → final answer (no loop)	✅	✅
Correct tool selection from multiple	✅	✅
No tool call for simple questions	✅	✅
Multi-step tool use	✅	✅
Nested quote escaping retry (no loop)	✅	✅
Total	10/10	10/10

Recommended settings for tool-use / agentic workflows

--jinja --reasoning off --reasoning-budget 0 --reasoning-format deepseek

Avoid --presence-penalty above 0.5 for tool-use — high values diversify reasoning tokens but don't improve structured JSON output, and can cause repeated near-identical tool calls in agent loops.

If using --reasoning on, ensure your agent framework detects consecutive identical tool calls and breaks after 2-3 retries.

Run tests yourself

chmod +x test_tool_calls.sh
./test_tool_calls.sh 8085