Simple pricing. No surprises.
Start free. No credit card required.
Free
For individual researchers and small experiments.
- ✓3 diagnoses per month
- ✓Manual log paste (web UI)
- ✓15-type failure classification
- ✓Prescriptive fix output
- ✓7-day history
- ✓1 seat
Team
For ML teams running regular training jobs.
- Everything in Free, plus:
- ✓Unlimited diagnoses
- ✓Up to 64 GPUs monitored
- ✓Slack + email alerts
- ✓iMessage/SMS notifications (Twilio)
- ✓Multi-rank cascade analysis
- ✓Cross-run comparison (last 5 runs)
- ✓Team knowledge base (shared fixes)
- ✓5 seats
- ✓90-day history
Scale
For scale-ups and serious training infrastructure.
- Everything in Team, plus:
- ✓Up to 512 GPUs monitored
- ✓On-premise agent (logs never leave your cluster)
- ✓Silent data corruption (SDC) detection
- ✓Straggler and gray failure detection
- ✓Zombie process detection + auto-kill
- ✓Checkpoint weight delta analysis (per-layer instability trace)
- ✓Cross-run comparison (unlimited run history)
- ✓Version compatibility database (PyTorch × CUDA × cuDNN)
- ✓Checkpoint integrity validation
- ✓Unlimited seats
- ✓Priority support (4-hr SLA)
- ✓1-year history
Data Center
For GPU cloud providers and enterprise data centers. Custom contracts available.
- Everything in Scale, plus:
- ✓Unlimited GPUs
- ✓White-label and OEM options
- ✓Multi-tenant deployment
- ✓Dedicated Customer Success Manager
- ✓99.9% uptime SLA with credits
- ✓GDPR, HIPAA, SOC 2 Type II compliance
- ✓Log PII/PHI masking (configurable)
- ✓Custom knowledge base ingestion
- ✓Integration with SLURM, Ray, Kubernetes schedulers
- ✓Predictive failure scoring
- ✓Auto-remediation engine
- ✓Custom contracts, invoicing, and procurement
Frequently asked questions
What failure types do you detect?
CUDA OOM, NCCL timeout, gradient explosion, checkpoint corruption, import errors, version mismatches, device asserts, memory fragmentation, silent data corruption, straggler detection, zombie processes, weight delta anomalies, and more. 15 common failure types with prescriptive fixes.
How accurate is the diagnosis?
For common failure patterns, Denpex uses regex-based matching for high accuracy. Novel failures get LLM-inferred diagnosis with a confidence score so you know the reliability.
Do you store our logs?
Logs are processed and deleted after diagnosis. We don't store raw training data. Diagnosis metadata (failure types, frequency) helps improve our pattern database.
What frameworks do you support?
PyTorch (DDP, FSDP), DeepSpeed ZeRO-1/2/3, Megatron-LM, Axolotl, LlamaFactory, Unsloth, and NeMo. JAX/XLA and TensorFlow support on roadmap.
How fast is the diagnosis?
Most diagnoses complete in under 12 seconds. Paste your logs, get the root cause and fix recommendation immediately.
Your training logs contain your IP. We treat them that way.
PII/PHI Masking
Before any log is transmitted or processed, Denpex's client-side masking engine scans for patterns matching PII (names, emails, SSNs) and PHI (medical record patterns). Matched content is replaced with [MASKED] tokens before leaving your environment.
On-Premise Option
Scale and Data Center: The Denpex agent runs entirely within your VPC or cluster. Only anonymized failure signatures and resolution metadata are transmitted — never raw logs.
Encryption
All data encrypted with AES-256 at rest and TLS 1.3 in transit. Encryption keys are customer-managed on Enterprise.
Compliance
Working toward SOC 2 Type II certification. GDPR-ready data processing agreements available. HIPAA BAA available on Enterprise.