MENU

Fun & Interesting

Analyzing Deepseek's "undefined" NVIDIA PTX optimizations (with benchmarks!)

LaurieWired 106,176 lượt xem 2 months ago
Video Not Working? Fix It Now

Two days ago, Deepseek surprised everyone with an "undefined-behavior" PTX optimization speeding up particular ML workloads on a Hopper NVIDIA GPU Kernel.

Let's reverse engineer the hack, implement it ourselves, and benchmark the speedup on an H100.

--

Link to my test code:
https://github.com/LaurieWired/BenchmarkCustomPTX

--

Timestamps

00:00 CUDA vs PTX vs SASS
02:12 Global Memory Target
03:27 Custom PTX Walkthrough
06:40 NVIDIA ISA Reference
07:42 Example Impelmentation
10:38 H100 Benchmark
11:46 SASS (Machine) Code

---

Follow LaurieWired on Social Media:
►https://linktr.ee/lauriewired

---

Comment