Diagnose ML training
failures in seconds.

Your training job just died. Rank 42 timed out, your entire cluster went dark, and you don't know why. Denpex tells you exactly which GPU failed first, the root cause, and the fix — before you've even opened a terminal.

11.3s

avg diagnosis time

99.7%

accuracy on 25 failure types

$847

avg GPU cost recovered

denpex diagnostic

Works with your stack

🔥PyTorch
DeepSpeed
🧠Megatron-LM
🦎Axolotl
⟨⟩FSDP

Trusted by teams training at scale

500M+

Events/day

99.7%

Accuracy on 25 failure types

11.3s

Avg time to root cause

3.2 hrs

Saved per incident avg

We ran 32-node DDP jobs that kept dying at step 12k-15k. Spent two weeks thinking it was a networking issue between our IB switches. Denpex flagged Rank 8 hitting OOM from gradient accumulation buffer growth at step 12,847. One line in deepspeed config, hasn't happened since. Still blows my mind it caught that from our NCCL timeout logs.

Senior ML Infrastructure Engineer

Our FSDP fine-tunes were failing like clockwork every Thursday. Corrupted sample in our dataset that only showed up with certain sequence lengths. Without Denpex we'd have blamed the hardware vendor for another month. It pointed directly to the dataloader. One PyTorch Dataset fix, done.

ML Platform Lead

Honestly the biggest win is not the speed. It's having something that gives ML engineers and infra the same answer. When a job crashes at 2am, nobody's arguing about whether it was the network or the code. Denpex says Rank 47 hit a CUDA OOM. Both teams look at that and move on to fixing it instead of blaming each other for four hours.

Staff ML Engineer

The debugging hell you know too well

These are the exact failures that cost teams millions in wasted GPU hours every year.

NCCL 3–8 hrs to isolate

NCCL timeout is never actually NCCL's fault

When 64 ranks hit an NCCL timeout, the framework blames the communication layer. The actual cause is almost always one rank failing — an OOM, a slow dataloader, a dead NIC — and 63 other ranks waiting at the barrier until the watchdog timer fires. You spend hours debugging InfiniBand when the problem was a corrupted data sample on node 4.

OOM 2–5 hrs and wasted compute

GPU shows 30% free but crashes with OOM

CUDA memory allocators fragment over long training runs. Your GPU reports 28GB free but can't satisfy a 2GB contiguous allocation. The error message is identical to a true OOM, so you reduce batch size, relaunch, and crash again two hours later with the same error. The fix is a single environment variable.

SDC Days of compute destroyed

Model trains for 3 days and the weights are garbage

A degraded GPU matrix-multiply unit produces slightly incorrect results — no crash, no error message, just wrong math. Because AllReduce broadcasts gradients across all ranks, one corrupted gradient poisons the entire cluster. Your loss curve stalls or diverges. You find out days later when you try to evaluate the model.

PROCESS 45–90 min per cleanup cycle

Cluster OOMs on restart because dead jobs still hold VRAM

A distributed job crashes and the framework fails to clean up child processes across all nodes. These phantom processes silently hold GPU memory. Your restart immediately crashes with OOM. You SSH into 16 nodes manually, run kill -9 on zombie processes, and then try again.

DEBUG Days, often unsolvable

Setting TORCH_DISTRIBUTED_DEBUG makes the bug disappear

You enable verbose distributed debugging on your hanging DDP job. The logging I/O alters execution timing. Your bug vanishes. You disable the flag. It crashes again. You have learned nothing. This is a Heisenbug — an observer effect created by the debug tool itself — and it afflicts nearly every multi-GPU DDP hang investigation.

NCCL 4–12 hrs, often misattributed

Slow disk I/O on one node causes NCCL timeout on all nodes

A dataloader worker on node 7 reads a corrupted or unusually large training sample. That GPU tries 400ms longer on its forward pass. The other 999 GPUs finish and wait at the AllReduce barrier. The watchdog timer fires. NCCL timeout. Your monitoring shows all nodes healthy. The error points to the network. The real culprit is a slow NVMe read on one machine.

COMPILER 3–10 hrs per compiler failure

Triton compilation failures produce 800-line C++ stack traces

torch.compile pushes your model through Triton or Inductor backends for optimization. When it fails, you get an 800-line C++ exception with no reference to your Python code. Engineers report testing models line-by-line in a Python debugger for hours trying to identify which operator triggered the compilation failure.

CHECKPOINT 18+ hrs of lost training time

DeepSpeed ZeRO-3 silently saves partial weights

ZeRO-3 and FSDP shard model weights across GPUs for memory efficiency. Saving a checkpoint requires coordinating all shards. If one rank fails or disconnects mid-save, the resulting checkpoint is silently corrupted — partial weights, missing optimizer states. You discover this 18 hours later when you try to resume.

INFRA 4–8 hrs per cross-team incident

Infra team says cluster is healthy. ML team says model is failing.

Your GPU metrics look clean — utilization 94%, temperatures normal, network healthy. But your loss curve is diverging and ranks are hanging. The hardware monitoring layer and the ML observability layer speak completely different languages. Every incident becomes a war room where infrastructure engineers and ML engineers blame each other.

REGRESSION 2–6 hrs, often unsolvable

The run that worked yesterday fails today and you don't know what changed

You didn't change your model. You didn't change your data. But your loss is diverging and last week's checkpoint won't reproduce. The culprit is usually invisible: a framework update in your Docker image, a subtly different dataset shuffle, batch size that changed with node count. You spend hours diffing configs manually. Often you never find it.

Your cluster wastes this much every month
$16,537
12 failures × 64 H100 SXM5 (80GB) × 3 hrs avg
Cluster size
GPU type
Training failures / month12
Meta Llama 3 averaged 7.7 failures/day on 16k GPUs
Hours lost per failure3 hrs
Average 34.7 hrs at enterprise scale — Huawei/Platform-X FSE 2025. Estimate conservatively for your team.
Compute cost wasted
$5,737
2,304 GPU-hours/mo
Engineering time wasted
$10,800
2 engineers @ $150/hr
Team
$499/mo
up to 64 GPUs
Monthly savings
$13,557
Pays back in 2 days
Most popular
Scale
$2,499/mo
up to 512 GPUs
Monthly savings
$11,557
Pays back in 6 days
Data Center
$9,999/mo
unlimited GPUs
Monthly savings
$4,057
Pays back in 22 days
Stop losing GPU budget to failures you can't diagnose.
No credit card. No setup. Diagnosis in under 12 seconds.
89.9% of failures require 3+ hrs — Huawei Cloud 2025 · Average 34.7 hrs at enterprise scale — FSE 2025 · ByteDance FALCON study

Simple pricing. No surprises.

Start free. No credit card required.

Free

$0/ month

For individual researchers and small experiments.

  • 3 diagnoses per month
  • Manual log paste (web UI)
  • 15-type failure classification
  • Prescriptive fix output
  • 7-day history
  • 1 seat
Best for most teams

Team

$499/ month

For ML teams running regular training jobs.

  • Everything in Free, plus:
  • Unlimited diagnoses
  • Up to 64 GPUs monitored
  • Slack + email alerts
  • iMessage/SMS notifications (Twilio)
  • Multi-rank cascade analysis
  • Cross-run comparison (last 5 runs)
  • Team knowledge base (shared fixes)
  • 5 seats
  • 90-day history
Most popular

Scale

$2,499/ month

For scale-ups and serious training infrastructure.

  • Everything in Team, plus:
  • Up to 512 GPUs monitored
  • On-premise agent (logs never leave your cluster)
  • Silent data corruption (SDC) detection
  • Straggler and gray failure detection
  • Zombie process detection + auto-kill
  • Checkpoint weight delta analysis (per-layer instability trace)
  • Cross-run comparison (unlimited run history)
  • Version compatibility database (PyTorch × CUDA × cuDNN)
  • Checkpoint integrity validation
  • Unlimited seats
  • Priority support (4-hr SLA)
  • 1-year history

Data Center

$9,999/ month

For GPU cloud providers and enterprise data centers. Custom contracts available.

  • Everything in Scale, plus:
  • Unlimited GPUs
  • White-label and OEM options
  • Multi-tenant deployment
  • Dedicated Customer Success Manager
  • 99.9% uptime SLA with credits
  • GDPR, HIPAA, SOC 2 Type II compliance
  • Log PII/PHI masking (configurable)
  • Custom knowledge base ingestion
  • Integration with SLURM, Ray, Kubernetes schedulers
  • Predictive failure scoring
  • Auto-remediation engine
  • Custom contracts, invoicing, and procurement

Frequently asked questions

What failure types do you detect?

CUDA OOM, NCCL timeout, gradient explosion, checkpoint corruption, import errors, version mismatches, device asserts, memory fragmentation, silent data corruption, straggler detection, zombie processes, weight delta anomalies, and more. 15 common failure types with prescriptive fixes.

How accurate is the diagnosis?

For common failure patterns, Denpex uses regex-based matching for high accuracy. Novel failures get LLM-inferred diagnosis with a confidence score so you know the reliability.

Do you store our logs?

Logs are processed and deleted after diagnosis. We don't store raw training data. Diagnosis metadata (failure types, frequency) helps improve our pattern database.

What frameworks do you support?

PyTorch (DDP, FSDP), DeepSpeed ZeRO-1/2/3, Megatron-LM, Axolotl, LlamaFactory, Unsloth, and NeMo. JAX/XLA and TensorFlow support on roadmap.

How fast is the diagnosis?

Most diagnoses complete in under 12 seconds. Paste your logs, get the root cause and fix recommendation immediately.

Paste your logs. Get a diagnosis in seconds.

Free. 3 diagnoses remaining.

training_logs.txt

Start diagnosing failures in seconds

Free for your first 3 diagnoses. No credit card required.

The fix arrives on your phone before you open a terminal

One message. One root cause. No noise.

When your 1,000-GPU cluster fails, Denpex doesn't send 1,000 alerts. It correlates the cascade, identifies the single root cause, masks any sensitive data, and sends one message directly to your phone with the exact fix — before you've finished your first sip of coffee.

Works with iMessage, SMS, Slack, PagerDuty, and webhook. One message per incident, always.

Data masking enabled — PII and proprietary code redacted before transmission
One message per incident — cascade correlation prevents alert storms
Respects quiet hours — urgent-only between midnight and 6am by default
9:41
DA

Denpex Alerts

iMessage

How Denpex diagnoses failures in under 12 seconds

Paste your logs. Get a diagnosis. Fix the problem.

01

Paste your logs

Copy the error output from your training job. Works with PyTorch DDP, FSDP, DeepSpeed ZeRO, Megatron-LM, and Axolotl. Paste it into the diagnosis box.

Paste your training output:
[ERROR] NCCL timeout on ranks 0-63
Rank 17: out of memory
Checkpoint failed to save
✓ Diagnosis: 11 seconds
02

Get instant diagnosis

Denpex pattern-matches your logs against known failure types. For common issues like CUDA OOM, NCCL timeout, gradient explosion, and checkpoint corruption, you get an instant match with the root cause and fix.

Pattern match found:
Rank 17: OOM at step 8,432
Root cause: memory fragmentation
Fix: PYTORCH_CUDA_ALLOC_CONF
03

11 failure types covered

CUDA OOM, memory fragmentation, NCCL timeout, gradient explosion, checkpoint corruption, import errors, version mismatches, device asserts, silent hangs, and more. Each with prescriptive fixes.

CUDA_OOMOOM_FRAGMENTATIONNCCL_TIMEOUTGRADIENT_EXPLOSIONCHECKPOINT_CORRUPTIONIMPORT_ERRORVERSION_MISMATCHDEVICE_ASSERTSILENT_HANGNAN_LOSSDISK_FULLWEIGHT_DIVERGENCE
11 failure types with prescriptive fixes
04

Unknown failures get AI analysis

If your failure doesn't match a known pattern, Denpex uses AI to analyze and suggest what happened. Always get a next step, even for novel errors.

AI Analysis
Novel failure detected

This error pattern doesn't match known failures. AI analysis suggests checking memory allocator configuration and batch size settings.

Confidence: Lower — verify suggestions manually

One line. No config. Works on your next failure.

train.py
import denpex

# Add before your training loop
denpex.init(
    api_key="dpx_...",
    job_name="llama3-70b-finetune",
    notify=["slack", "sms"]  # optional
)

# The rest of your training code is unchanged
trainer.train()

What makes Denpex different

Diagnose from paste-only logs

Don't need an agent or integration. Paste your error output, get a diagnosis. Works with any framework — PyTorch, DeepSpeed, Megatron, Axolotl, whatever you're using.

Prescriptive fixes, not just error codes

Don't just告诉你错了什么 — tells you how to fix it. Every diagnosis comes with a specific env var to change, config to update, or checkpoint to resume from.

15 common failure types covered

CUDA OOM, NCCL timeout, gradient explosion, checkpoint corruption, import errors, version mismatches, device asserts, memory fragmentation, silent data corruption, straggler detection, zombie processes, weight delta anomalies, and more — all with exact pattern matching and known fixes.

Instant diagnosis — no waiting

Pattern matching runs in seconds. No AI hallucination risk for known failures. Novel errors get AI analysis with confidence scores so you know how reliable the suggestion is.

Real cluster failures, real cost

Training infrastructure is expensive. Failures are expensive. Denpex helps you diagnose faster.

466 failures

in 54 days

Meta Llama 3.1-405B on 16,384 H100 GPUs. Real training runs, real failures.

57 vs 25 days

ideal vs actual runtime

Meta OPT-175B training. Failures add 32 days of delay.

89.9%

of failures need 3+ hours

Huawei Cloud 2025. Most failures require manual investigation.

$847

avg GPU cost recovered

Per incident on a 64-GPU cluster. Diagnose faster, waste less.

See it in action

0
1
2
3
4
5
6
7
Simulating failure...

Compatible with your stack

Training Frameworks

  • PyTorch DDP
  • PyTorch FSDP
  • DeepSpeed ZeRO-1/2/3
  • Megatron-LM
  • Axolotl
  • LlamaFactory
  • Unsloth
  • NeMo

Compute Platforms

  • AWS SageMaker
  • GCP Vertex AI
  • Azure ML
  • Lambda Labs
  • CoreWeave
  • SLURM clusters
  • Kubernetes + Ray
  • On-premise

Built for every team running distributed training

Stop losing training runs to failures your senior engineers used to debug in 4 hours

Research labs burn GPU budget at a rate that demands zero tolerance for manual debugging. When a 70B fine-tune dies at step 45,000 — 18 hours in — you can't afford to spend another 4 hours finding out why. Denpex tells you the root cause in seconds, with the exact fix, so you resume from checkpoint immediately instead of rerunning from scratch.

3.1 hrsSaved per incident avg
$847Avg GPU cost recovered
94%First-diagnosis accuracy

Your training logs contain your IP. We treat them that way.

PII/PHI Masking

Before any log is transmitted or processed, Denpex's client-side masking engine scans for patterns matching PII (names, emails, SSNs) and PHI (medical record patterns). Matched content is replaced with [MASKED] tokens before leaving your environment.

On-Premise Option

Scale and Data Center: The Denpex agent runs entirely within your VPC or cluster. Only anonymized failure signatures and resolution metadata are transmitted — never raw logs.

🔒

Encryption

All data encrypted with AES-256 at rest and TLS 1.3 in transit. Encryption keys are customer-managed on Enterprise.

Compliance

Working toward SOC 2 Type II certification. GDPR-ready data processing agreements available. HIPAA BAA available on Enterprise.

What's coming next

We're building the infrastructure layer for production ML training.

Coming Q4 2026

Pre-flight cluster health scan

Run a health check before launching an expensive job. Detect stale zombie processes, GPU memory fragmentation, version incompatibilities, and network partition issues before you waste a single training step.

Coming Q3 2026

Predictive failure scoring

Machine learning on your cluster's telemetry to predict GPU degradation, thermal throttling onset, and memory leak trajectories hours before they cause a failure.

Coming Q4 2026

Auto-remediation engine

For confirmed fix types, Denpex can automatically apply the fix: set the environment variable, kill zombie processes, adjust checkpointing frequency, and trigger a checkpoint resume — without human intervention.

Available Now

Enhanced checkpoint integrity validator

Before relying on a checkpoint saved 18 hours ago, validate that it is loadable, complete, and consistent. Prevent the worst scenario: discovering your only resume point is corrupted after a failure.

Frequently asked questions