Pricing — Denpex

Simple pricing. No surprises.

Start free. No credit card required.

Free

$0/ month

For individual researchers and small experiments.

✓3 diagnoses per month
✓Manual log paste (web UI)
✓15-type failure classification
✓Prescriptive fix output
✓7-day history
✓1 seat

Best for most teams

Team

$499/ month

For ML teams running regular training jobs.

Everything in Free, plus:
✓Unlimited diagnoses
✓Up to 64 GPUs monitored
✓Slack + email alerts
✓iMessage/SMS notifications (Twilio)
✓Multi-rank cascade analysis
✓Cross-run comparison (last 5 runs)
✓Team knowledge base (shared fixes)
✓5 seats
✓90-day history

Scale

$2,499/ month

For scale-ups and serious training infrastructure.

Everything in Team, plus:
✓Up to 512 GPUs monitored
✓On-premise agent (logs never leave your cluster)
✓Silent data corruption (SDC) detection
✓Straggler and gray failure detection
✓Zombie process detection + auto-kill
✓Checkpoint weight delta analysis (per-layer instability trace)
✓Cross-run comparison (unlimited run history)
✓Version compatibility database (PyTorch × CUDA × cuDNN)
✓Checkpoint integrity validation
✓Unlimited seats
✓Priority support (4-hr SLA)
✓1-year history

Data Center

$9,999/ month

For GPU cloud providers and enterprise data centers. Custom contracts available.

Everything in Scale, plus:
✓Unlimited GPUs
✓White-label and OEM options
✓Multi-tenant deployment
✓Dedicated Customer Success Manager
✓99.9% uptime SLA with credits
✓GDPR, HIPAA, SOC 2 Type II compliance
✓Log PII/PHI masking (configurable)
✓Custom knowledge base ingestion
✓Integration with SLURM, Ray, Kubernetes schedulers
✓Predictive failure scoring
✓Auto-remediation engine
✓Custom contracts, invoicing, and procurement

Frequently asked questions

What failure types do you detect?

CUDA OOM, NCCL timeout, gradient explosion, checkpoint corruption, import errors, version mismatches, device asserts, memory fragmentation, silent data corruption, straggler detection, zombie processes, weight delta anomalies, and more. 15 common failure types with prescriptive fixes.

How accurate is the diagnosis?

For common failure patterns, Denpex uses regex-based matching for high accuracy. Novel failures get LLM-inferred diagnosis with a confidence score so you know the reliability.

Do you store our logs?

Logs are processed and deleted after diagnosis. We don't store raw training data. Diagnosis metadata (failure types, frequency) helps improve our pattern database.

What frameworks do you support?

PyTorch (DDP, FSDP), DeepSpeed ZeRO-1/2/3, Megatron-LM, Axolotl, LlamaFactory, Unsloth, and NeMo. JAX/XLA and TensorFlow support on roadmap.

How fast is the diagnosis?

Most diagnoses complete in under 12 seconds. Paste your logs, get the root cause and fix recommendation immediately.

Your training logs contain your IP. We treat them that way.

◈

PII/PHI Masking

Before any log is transmitted or processed, Denpex's client-side masking engine scans for patterns matching PII (names, emails, SSNs) and PHI (medical record patterns). Matched content is replaced with [MASKED] tokens before leaving your environment.

⬡

On-Premise Option

Scale and Data Center: The Denpex agent runs entirely within your VPC or cluster. Only anonymized failure signatures and resolution metadata are transmitted — never raw logs.

🔒

Encryption

All data encrypted with AES-256 at rest and TLS 1.3 in transit. Encryption keys are customer-managed on Enterprise.

☐

Compliance

Working toward SOC 2 Type II certification. GDPR-ready data processing agreements available. HIPAA BAA available on Enterprise.

Simple pricing. No surprises.

Free

Team

Scale

Data Center

Frequently asked questions

What failure types do you detect?

How accurate is the diagnosis?

Do you store our logs?

What frameworks do you support?

How fast is the diagnosis?

Your training logs contain your IP. We treat them that way.

PII/PHI Masking

On-Premise Option

Encryption

Compliance

Frequently asked questions

How does Denpex get access to my training logs?

Will adding Denpex slow down my training?

Can it diagnose failures that already happened?

What if Denpex can't diagnose it?

Does it work with spot instances and preemptible GPUs?

How does the team knowledge base work?

Can Denpex compare my current failed run against a previous successful one?

What's on your roadmap?

Can I self-host the entire platform?