Rapid-MLX v0.7.3 Release Notes

Release Date: 2026-06-12 // 6 days ago
  • What changed

    Revert PR #555 (in-house DiffusionGemma generation loop, shipped in v0.7.2). The fixed_steps=8 default cap caused quality regressions on long Chinese / multi-paragraph output. Performance was never the bottleneck — direct mlx-vlm passthrough already matches what the in-house loop achieved (49.5 tok/s rapid-mlx HTTP vs 53.0 tok/s mlx-vlm direct on M3 Ultra, 256-token sweep, well within HTTP wrapper tax).

    🛠 Fix tool calling on the mlx-vlm passthrough path so DiffusionGemma's call:NAME{...} wire form surfaces as structured tool_calls, not leaked delta.content:

    • Gemma4ToolParser regex now accepts the post-HF-decode stripped form (no outer <|tool_call>/<tool_call|> wrappers). HF's tokenizer.decode(skip_special_tokens=True) strips those marker ids regardless of the skip_special_token_ids carve-out, so the body call:NAME{...} is what reaches the parser in production.
    • StreamingPostProcessor._detect_tool_calls no longer short-circuits the parser when the delta has no < or [ char — it now also queries the active parser's has_pending_tool_call so the gemma4 stripped opener call:\w+\{ is recognised.
    • ✅ Real-tokenizer end-to-end test added under TestStreamingPostProcessorGemma4StrippedForm so this class of bug can't ship silently again.

    Alias namingdiffusion-gemma-26b split into diffusion-gemma-26b-4bit and diffusion-gemma-26b-8bit per the project <family>-<size>-<quant> SOP. The bare alias is removed.

    v0.7.2 status

    🚀 v0.7.2 will be yanked from PyPI and its release deleted. Users on 0.7.2 should upgrade to 0.7.3 — quality regressions from fixed_steps=8 will not be patched in-place.

    Verification

    • rapid-mlx share diffusion-gemma-26b-4bit end-to-end: SSE response ships only structured tool_calls, no leaked content
    • 💻 Big-AGI UI confirms Weather + Web Search tools render as proper tool boxes (not raw call:NAME{...} text)
    • ✅ 26 parser unit tests + 3 real-tokenizer postprocessor tests + 47 escape-hatch tests all pass

    🚀 Release SOP gates (all green)

    🚀 G1 release-smoke · G3 cli/config audit · G4 make smoke · G8 parser microbench · G10 mlx-version-bump scan (skip — no mlx-* deps changed) · G11 escape-hatch tests · G6 live server repro

    🤖 Generated with Claude Code


Previous changes from v0.7.2

  • Caution

    v0.7.2 is YANKED. Upgrade to v0.7.3 immediately.

    0️⃣ v0.7.2 shipped a DiffusionGemma in-house generation loop (PR #555) whose fixed_steps=8 default
    ⏪ caused quality regressions on long Chinese / multi-paragraph output. v0.7.3 reverts PR #555 back to
    🛠 the mlx-vlm passthrough path and fixes tool calling on top of it.

    pip install rapid-mlx resolves to v0.7.3 going forward.

    What's new in v0.7.2

    • chore: bump version to 0.7.2 (fc61783)
    • ⏪ DiffusionGemma in-house generation loop (PR #555, reverted in v0.7.3)