Jump to content

Postmortems/2026-05-05 Jellyfin XE GPU Hang

From Stronghold Wiki

As of: 2026-07-05 — migrated from `postmortems/2026-05-05-jellyfin-xe-gpu-hang.md`

Postmortem: Jellyfin AV1 Transcoding Causes System Hard Hang

Date: 2026-05-05 Duration: ~2 minutes (20:28 → hard reset → 20:31 back online) Severity: High — full system halt, required physical hard reset Status: Mitigated (HW transcoding disabled)


Summary

Pressing play on a video in Jellyfin triggered hardware-accelerated AV1 encoding via Intel QuickSync (QSV) on the Intel Arc B580 (Battlemage). The xe kernel driver entered a GPU execution queue timeout loop (guc_exec_queue_timedout_job) on both GT0 and GT1, could not recover, and the system hard-locked. Required a physical power reset to recover.


Timeline

Time (CST) Event
20:28:53 Jellyfin starts ffmpeg transcoding Omega Doom (1996).mp4 using av1_qsv encoder via VAAPI/QSV
20:30:09 Kernel floods journal with sysrq: HELP messages — system in distress
20:30:40 xe driver reports guc_exec_queue_timedout_job resets on GT0 and GT1 repeatedly (dozens of attempts, all failing)
~20:30:45 System hard-locks. No response to input.
~20:31:28 Hard reset performed. System boots. All services recover normally.

Root Cause

Jellyfin was configured to use Intel QuickSync (QSV) hardware acceleration. When transcoding the H.264 source file, ffmpeg selected av1_qsv as the output encoder. AV1 encoding on the Arc B580 via the xe kernel driver is unstable — the GPU execution queues time out, the driver attempts repeated resets, fails to recover, and the kernel hangs.

The xe driver (Intel's replacement for i915 for Arc/Battlemage GPUs) is still maturing. AV1 hardware encoding is the newest and least-tested codec path on this driver. The system has no watchdog capable of recovering from this class of GPU hang.

ffmpeg command at crash: <syntaxhighlight lang="text"> /usr/lib/jellyfin-ffmpeg/ffmpeg

 -init_hw_device vaapi=va:/dev/dri/renderD128,driver=iHD
 -init_hw_device qsv=qs@va
 -hwaccel vaapi -hwaccel_output_format vaapi
 -codec:v:0 av1_qsv
 ...

</syntaxhighlight>

Kernel error (repeated ~50+ times): <syntaxhighlight lang="text"> xe 0000:0b:00.0: [drm] Tile0: GT0: trying reset from guc_exec_queue_timedout_job [xe] xe 0000:0b:00.0: [drm] Tile0: GT1: trying reset from guc_exec_queue_timedout_job [xe] </syntaxhighlight>


What Was NOT the Cause

  • This is unrelated to the previous enable_dc=0 fix (that addressed display C-state timeouts on monitor switch, a separate xe instability)
  • Not an OOM event — 62GB RAM, only 13GB used
  • Not a filesystem or RAID issue

Fix Applied

Disabled hardware transcoding in Jellyfin:

  • Dashboard → Admin → Playback → Transcoding → Hardware acceleration → None

ffmpeg now uses software (CPU) encoding. The Ryzen 5 3600X (12 threads) is capable of handling typical transcoding load in software.


Diagnostic Commands

<syntaxhighlight lang="bash">

  1. Check previous boot's GPU errors

journalctl -b -1 | grep "xe\|guc\|drm"

  1. Check current boot for xe issues

dmesg -T | grep -i "xe\|guc\|drm\|hang\|reset"

  1. See what Jellyfin's ffmpeg was doing at crash time

journalctl -u jellyfin --since "2026-05-05 20:25" --until "2026-05-05 20:32" </syntaxhighlight>


Hardware Context

Component Detail
GPU Intel Arc B580 (Battlemage), PCI 0b:00.0, device e20b
Kernel driver xe (replacing i915 for Battlemage)
Kernel 6.18.1-061801-generic
CPU AMD Ryzen 5 3600X (6c/12t)
RAM 62GB
Existing xe fix options xe enable_dc=0 in modprobe.d (display C-state fix, unrelated)

Options for Restoring Hardware Transcoding

Option A: Replace GPU with NVIDIA RTX 3060/4060

  • NVENC/NVDEC drivers are rock-solid on Linux
  • RTX 3060: AV1 decode only (no AV1 encode)
  • RTX 4060: AV1 encode + decode, ~$300
  • Drop-in PCIe x16, works with AM4/X570 platform
  • No driver instability issues for media workloads

Option B: Replace GPU with Intel Arc A770 (Alchemist gen)

  • Uses i915 driver (mature, not xe)
  • Full AV1 encode + decode
  • ~$200–250
  • Lower risk than Battlemage on xe

Option C: Wait for xe driver to mature / upgrade kernel

  • xe driver improves rapidly; a future kernel may handle this gracefully
  • No cost, but timeline is unknown
  • Risky — hangs are full system halts, not graceful crashes

Option D: Keep B580, disable AV1 encode in Jellyfin only

  • Keep VAAPI HW accel but force output codec to H.264 or H.265
  • H.264/H.265 VAAPI paths on xe are more stable (but not guaranteed crash-free)
  • Still some risk of xe driver hang on other codec paths

Recommendation: Option A (RTX 4060) if budget allows — eliminates the entire class of xe driver instability and gives better AV1 support. Option B (Arc A770) is the budget pick if you want to stay Intel.


Follow-up Actions

Action Priority Status
Disable HW transcoding in Jellyfin Critical Done
Decide on GPU replacement Medium Pending
Add stronghold hardware specs to INFRASTRUCTURE_INVENTORY.md Low Pending