Rift Evals v2026.01.24 · Jan 24, 2026 · Source · JSON

71.2% pass rate on 185 scenarios.

Above 70% threshold
Local MLX benchmarks · Apple Silicon

I run these tests on the on-device voice pipeline — LLM polish, STT/TTS round-trip, installer integrity. Threshold: 70% Jaccard similarity across 5 suites.

Pass rate
71.2%
Threshold 70%
Scenarios
185
Synthetic + E2E
Suites
5
LLM · context · TTS · silence · installer
Hardware
M3 Pro
18GB · MLX

Benchmarks

Self-Driving

Autonomous change log and plan dispatch — from Cockpit API and eval-fix history.

Hypothesized

In Progress

Shipped

Reverted

Eval gate

Last run
Status
placeholder
Workflow
eval-gate.yml

Memory gate

Peak RSS
Tier
Model config

Autopilot metrics

Pass rate
Rollback rate
Kanban
Flat curve

Autonomous change log

  • loading…No entries yet — wired from test-engine/eval-fix-history.jsonl when present.
Run by Rift — loop not proven yet · awaiting first SDK merge

Methodology

Tests compare LLM output to expected results using Jaccard similarity of word sets. Pass requires similarity ≥ 0.70 and latency ≤ threshold. Latency limits: 1500ms (merge), 500ms (correction), 6500ms (polish), 16000ms (TTS transform).

Hardware: Apple M3 Pro, 18GB unified memory, MLX framework. Models: Qwen3-0.6B (fast operations), Qwen3-4B (quality operations).

Model comparison: All candidates run on the same 66 LLM test scenarios. Prompts are stored in Qwen3 chat format and auto-converted to each model's native format via tokenizer.apply_chat_template(). Models benchmarked: Qwen3-4B, Gemma 4 E4B (4-bit, 6-bit), Gemma 4 26B MoE (4-bit).

Limitations