Director Class AI
Real-time LLM Hallucination Guardrail
Stop hallucinations before they reach your users. Token-level streaming halt, NLI fact-checking, prompt injection detection. Production-tested with 4,310+ tests and 12 Rust-accelerated compute functions.
75.8%
Balanced accuracy
14.6 ms
Per claim (GPU)
4,310+
Tests
9.4×
Rust speedup
12
SDK integrations
The problem
LLMs hallucinate. Your users trust them anyway. One wrong medical dosage. One fabricated legal citation. One invented financial figure. By the time a human reviewer catches it, the damage is done. Generic output filters catch obvious toxicity but miss subtle factual errors — the kind that sound perfectly plausible. Director-AI intercepts the stream before it reaches your users, scores every claim against your knowledge base, and halts generation the moment coherence degrades.
How it works
LLM Output
→
Claim Extraction
→
NLI Scoring
FactCG 0.4B
→
FactCG 0.4B
RAG Fact-Check
Your knowledge base
→
Your knowledge base
Dual Entropy
Confidence + divergence
→
Confidence + divergence
■ Halt stream
/
✓ Pass
Core features
Token-level streaming halt
Severs LLM output mid-generation when coherence degrades. Not a post-hoc filter — a real-time guardrail that stops hallucinations before they reach the user.
Dual-entropy scoring
NLI contradiction detection (FactCG-DeBERTa, 0.4B params) combined with RAG fact-checking against your knowledge base. Two independent signals, one confidence score.
Injection detection
Intent-grounded, two-stage prompt injection detection: fast regex pre-filter + bidirectional NLI semantic analysis. 25 adversarial attack patterns tested.
Structured output verification
JSON schema validation, numeric consistency checking, reasoning chain verification, temporal freshness scoring. All stdlib-only, zero dependencies.
12 Rust accelerators
Performance-critical functions compiled to native via backfire-kernel (PyO3 FFI). Sanitiser 27×, temporal freshness 21×, confidence scoring 33× faster than pure Python.
EU AI Act compliance
Audit trails, adversarial robustness testing, domain presets (medical/finance/legal/creative), drift detection, and feedback loops. Built for regulated industries.
Integrations
Drop-in guards for every major LLM provider and framework. Zero code changes with the REST proxy.
LLM providers (SDK guards)
OpenAIAnthropic (Claude)AWS BedrockGoogle GeminiCohere
Frameworks
LangChainLlamaIndexLangGraphHaystackCrewAIDSPySemantic Kernel
Deployment
FastAPI middlewareREST/gRPC proxyDocker (CPU/GPU)Kubernetes HelmVoice AI (ElevenLabs/Deepgram)
Benchmarks
Accuracy (LLM-AggreFact, 29,320 samples)
| Scorer | Params | Balanced Accuracy | Latency (GPU) |
|---|---|---|---|
| FactCG-DeBERTa | 0.4B | 75.8% | 14.6 ms/pair |
| MiniCheck-Flan-T5-L | 0.8B | 77.4% | ~40 ms/pair |
| Heuristic-only (no NLI) | 0 | ~55% | <0.5 ms |
Latency (p99, 16-pair batch)
| Hardware | Backend | Latency |
|---|---|---|
| NVIDIA GTX 1060 | ONNX CUDA | 17.9 ms/pair |
| AMD RX 6600 XT | ROCm | 80.1 ms/pair |
| AMD EPYC 9575F | CPU | 118.9 ms/pair |
| Intel Xeon E5-2640 | CPU | 207.3 ms/pair |
Rust acceleration (backfire-kernel, 5000 iterations)
| Function | Python | Rust | Speedup |
|---|---|---|---|
| sanitiser_score | 57 µs | 2.1 µs | 27× |
| probs_to_confidence | 486 µs | 15 µs | 33× |
| temporal_freshness | 53 µs | 2.5 µs | 21× |
| lite_score | 47 µs | 26 µs | 1.8× |
| Geometric mean (12 functions) | 9.4× | ||
Quick start
# Install
pip install director-ai[all]
# Score a claim against a source
from director_ai import score
result = score("The Earth is 4.5 billion years old", "The Earth formed approximately 4.54 billion years ago.")
print(result) # GuardResult(score=0.94, passed=True)
# Or run as a REST proxy (zero code changes to your app)
director-ai serve --port 8000 --upstream https://api.openai.com/v1
NLI models
FactCG-DeBERTa-v3-Large
Default scorer. 0.4B params, MIT licensed. Best speed/accuracy trade-off. ONNX + TensorRT GPU acceleration paths available.
MiniCheck-Flan-T5-L
0.8B params. Higher accuracy (77.4%) at ~3× latency cost. Best for offline batch verification.
MiniCheck-DeBERTa-L
0.4B params. Alternative DeBERTa backbone with different NLI training data.
Gemma 4 E4B (LLM-as-judge)
LLM-based scoring for complex claims. Highest accuracy but sends data to external provider. Off by default.
Heuristic-only (Lite)
Zero-dependency scorer using word overlap, numeric consistency, and structural checks. <0.5 ms. ~55% accuracy. CPU-only fallback.
Rust backend (backfire)
Native compiled compute via backfire-kernel. 12 accelerated functions. No Python GIL. No CUDA dependency for basic scoring.
Domain presets
Medical
Strict thresholds. Dosage verification. Citation requirements. HIPAA-aware logging.
Finance
Numeric precision. Temporal freshness. Market data validation. FINMA-compatible audit trails.
Legal
Citation verification. Precedent checking. Jurisdiction awareness. Privilege-safe logging.
Creative
Relaxed thresholds. Factual claims still checked but creative expression permitted.
Licensing
Open Source
Free
AGPL-3.0-or-later. Use freely for research, personal projects, and open-source applications.
- Full feature set
- All NLI models
- Rust accelerators
- Community support
- Copyleft: derivatives must be open-source
Commercial
Contact us
Proprietary license. Removes copyleft obligation for closed-source and SaaS deployments.
- Full feature set
- Closed-source permitted
- SaaS deployment permitted
- Priority support
- Custom model fine-tuning
- On-premise deployment assistance
Architecture at a glance
136 Python files
32 top-level modules. Modular, testable, documented.
7 Rust crates
backfire-core, FFI, observers, physics, SSGF, types, WASM.
17+ CLI commands
serve, proxy, bench, tune, finetune, batch, review, adversarial-test, doctor...
217 test files
4,310+ test functions. 90% coverage enforced. CI on every push.
Python ≥3.11
Tested on 3.11, 3.12, 3.13. Zero core dependencies (numpy + requests only).
23 optional extras
Install only what you need: NLI, vector DBs, server, SDKs, voice, enterprise, ONNX.