Postmortems/2026-05-05 Jellyfin XE GPU Hang
As of: 2026-07-05 — migrated from `postmortems/2026-05-05-jellyfin-xe-gpu-hang.md`
Postmortem: Jellyfin AV1 Transcoding Causes System Hard Hang
Date: 2026-05-05 Duration: ~2 minutes (20:28 → hard reset → 20:31 back online) Severity: High — full system halt, required physical hard reset Status: Mitigated (HW transcoding disabled)
Summary
Pressing play on a video in Jellyfin triggered hardware-accelerated AV1 encoding via Intel QuickSync (QSV) on the Intel Arc B580 (Battlemage). The xe kernel driver entered a GPU execution queue timeout loop (guc_exec_queue_timedout_job) on both GT0 and GT1, could not recover, and the system hard-locked. Required a physical power reset to recover.
Timeline
| Time (CST) | Event |
|---|---|
| 20:28:53 | Jellyfin starts ffmpeg transcoding Omega Doom (1996).mp4 using av1_qsv encoder via VAAPI/QSV
|
| 20:30:09 | Kernel floods journal with sysrq: HELP messages — system in distress
|
| 20:30:40 | xe driver reports guc_exec_queue_timedout_job resets on GT0 and GT1 repeatedly (dozens of attempts, all failing)
|
| ~20:30:45 | System hard-locks. No response to input. |
| ~20:31:28 | Hard reset performed. System boots. All services recover normally. |
Root Cause
Jellyfin was configured to use Intel QuickSync (QSV) hardware acceleration. When transcoding the H.264 source file, ffmpeg selected av1_qsv as the output encoder. AV1 encoding on the Arc B580 via the xe kernel driver is unstable — the GPU execution queues time out, the driver attempts repeated resets, fails to recover, and the kernel hangs.
The xe driver (Intel's replacement for i915 for Arc/Battlemage GPUs) is still maturing. AV1 hardware encoding is the newest and least-tested codec path on this driver. The system has no watchdog capable of recovering from this class of GPU hang.
ffmpeg command at crash: <syntaxhighlight lang="text"> /usr/lib/jellyfin-ffmpeg/ffmpeg
-init_hw_device vaapi=va:/dev/dri/renderD128,driver=iHD -init_hw_device qsv=qs@va -hwaccel vaapi -hwaccel_output_format vaapi -codec:v:0 av1_qsv ...
</syntaxhighlight>
Kernel error (repeated ~50+ times): <syntaxhighlight lang="text"> xe 0000:0b:00.0: [drm] Tile0: GT0: trying reset from guc_exec_queue_timedout_job [xe] xe 0000:0b:00.0: [drm] Tile0: GT1: trying reset from guc_exec_queue_timedout_job [xe] </syntaxhighlight>
What Was NOT the Cause
- This is unrelated to the previous
enable_dc=0fix (that addressed display C-state timeouts on monitor switch, a separate xe instability) - Not an OOM event — 62GB RAM, only 13GB used
- Not a filesystem or RAID issue
Fix Applied
Disabled hardware transcoding in Jellyfin:
- Dashboard → Admin → Playback → Transcoding → Hardware acceleration → None
ffmpeg now uses software (CPU) encoding. The Ryzen 5 3600X (12 threads) is capable of handling typical transcoding load in software.
Diagnostic Commands
<syntaxhighlight lang="bash">
- Check previous boot's GPU errors
journalctl -b -1 | grep "xe\|guc\|drm"
- Check current boot for xe issues
dmesg -T | grep -i "xe\|guc\|drm\|hang\|reset"
- See what Jellyfin's ffmpeg was doing at crash time
journalctl -u jellyfin --since "2026-05-05 20:25" --until "2026-05-05 20:32" </syntaxhighlight>
Hardware Context
| Component | Detail |
|---|---|
| GPU | Intel Arc B580 (Battlemage), PCI 0b:00.0, device e20b |
| Kernel driver | xe (replacing i915 for Battlemage) |
| Kernel | 6.18.1-061801-generic |
| CPU | AMD Ryzen 5 3600X (6c/12t) |
| RAM | 62GB |
| Existing xe fix | options xe enable_dc=0 in modprobe.d (display C-state fix, unrelated)
|
Options for Restoring Hardware Transcoding
Option A: Replace GPU with NVIDIA RTX 3060/4060
- NVENC/NVDEC drivers are rock-solid on Linux
- RTX 3060: AV1 decode only (no AV1 encode)
- RTX 4060: AV1 encode + decode, ~$300
- Drop-in PCIe x16, works with AM4/X570 platform
- No driver instability issues for media workloads
Option B: Replace GPU with Intel Arc A770 (Alchemist gen)
- Uses
i915driver (mature, notxe) - Full AV1 encode + decode
- ~$200–250
- Lower risk than Battlemage on xe
Option C: Wait for xe driver to mature / upgrade kernel
- xe driver improves rapidly; a future kernel may handle this gracefully
- No cost, but timeline is unknown
- Risky — hangs are full system halts, not graceful crashes
Option D: Keep B580, disable AV1 encode in Jellyfin only
- Keep VAAPI HW accel but force output codec to H.264 or H.265
- H.264/H.265 VAAPI paths on xe are more stable (but not guaranteed crash-free)
- Still some risk of xe driver hang on other codec paths
Recommendation: Option A (RTX 4060) if budget allows — eliminates the entire class of xe driver instability and gives better AV1 support. Option B (Arc A770) is the budget pick if you want to stay Intel.
Follow-up Actions
| Action | Priority | Status |
|---|---|---|
| Disable HW transcoding in Jellyfin | Critical | Done |
| Decide on GPU replacement | Medium | Pending |
| Add stronghold hardware specs to INFRASTRUCTURE_INVENTORY.md | Low | Pending |