macOS mp kdp enter() system crash

From Wikistix

TL;DR: In my case, this was likely faulty hardware, possibly a marginal CPU, and because I could, I had the laptop replaced.

The last day (2021-01-15), my 2020 Apple MacBook Pro 13" (macOS aka Mac OS X, running 10.15.7 Catalina) has had 3 system crashes (panics, in UNIX/Linux speak) with the following signature:

Machine-check capabilities: 0x0000000000000c0b
 family: 6 model: 126 stepping: 5 microcode: 160
 signature: 0x706e5
 Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz
 11 error-reporting banks
Processor 0: IA32_MCG_STATUS: 0x0000000000000005
 IA32_MC7_STATUS(0x41d): 0xfe20000000081152
 IA32_MC7_ADDR(0x41e):   0x000000000ab725c0
 IA32_MC7_MISC(0x41f):   0x0000007020008086
Processor 1: IA32_MCG_STATUS: 0x0000000000000005
 IA32_MC7_STATUS(0x41d): 0xfe20000000081152
 IA32_MC7_ADDR(0x41e):   0x000000000ab725c0
 IA32_MC7_MISC(0x41f):   0x0000007020008086
Processor 2: IA32_MCG_STATUS: 0x0000000000000005
 IA32_MC7_STATUS(0x41d): 0xfe20000000081152
 IA32_MC7_ADDR(0x41e):   0x000000000ab725c0
 IA32_MC7_MISC(0x41f):   0x0000007020008086
Processor 3: IA32_MCG_STATUS: 0x0000000000000005
 IA32_MC7_STATUS(0x41d): 0xfe20000000081152
 IA32_MC7_ADDR(0x41e):   0x000000000ab725c0
 IA32_MC7_MISC(0x41f):   0x0000007020008086
Processor 4: IA32_MCG_STATUS: 0x0000000000000005
 IA32_MC7_STATUS(0x41d): 0xfe20000000081152
 IA32_MC7_ADDR(0x41e):   0x000000000ab725c0
 IA32_MC7_MISC(0x41f):   0x0000007020008086
Processor 5: IA32_MCG_STATUS: 0x0000000000000005
 IA32_MC7_STATUS(0x41d): 0xfe20000000081152
 IA32_MC7_ADDR(0x41e):   0x000000000ab725c0
 IA32_MC7_MISC(0x41f):   0x0000007020008086
Processor 6: IA32_MCG_STATUS: 0x0000000000000005
 IA32_MC7_STATUS(0x41d): 0xfe20000000081152
 IA32_MC7_ADDR(0x41e):   0x000000000ab725c0
 IA32_MC7_MISC(0x41f):   0x0000007020008086
Processor 7: IA32_MCG_STATUS: 0x0000000000000005
 IA32_MC7_STATUS(0x41d): 0xfe20000000081152
 IA32_MC7_ADDR(0x41e):   0x000000000ab725c0
 IA32_MC7_MISC(0x41f):   0x0000007020008086
mp_kdp_enter() timed-out on cpu 4, NMI-ing
mp_kdp_enter() NMI pending on cpus: 0 1 2 3 5 6 7
mp_kdp_enter() timed-out during locked wait after NMI;expected 8 acks but received 1 after 2084268 loops in 998400000 ticks
panic(cpu 4 caller 0xffffff800ac4623c): "Machine Check at …

Searching around, there seems to be little real information about this particular crash signature, and most recommendations are the usual re-install, remove hardware, remove drivers, unplug USB devices, etc, which seem largely unhelpful.

Parsing this text and the source, this is a secondary panic: the kernel has paniced (initial panic cause lost to the bit bucket, but I'm wondering if it was a machine check?), and tried to invoke the kernel debugger, which, at a very early stage attempts to halt all other processors. If this initial inter-processor interrupt (IPI) is not acknowledged, it then tries a non-maskable interrupt (NMI) IPI, which also then times out. The fact that this operation timed out, likely indicates a CPU configuration, firmware or hardware issue.

So far, I have reset the SMC, which delayed the next batch of crashes by one week. I believe one of the things the SMC has responsibility over is CPU power management, together with sleep states, thermal management, etc. It may be that one core is marginal on my machine, and runs into timing issues if not configured correctly. Another hunch might be thermal issues, as this crash only appears to have occurred with light system load (<5%), high ambient temperatures (>28°C), high case temperatures and very low fan speed.

Update 2021-01-21: Started getting crashes immediately after logging in. Resetting SMC, NVRAM had no effect. I managed to boot into recovery mode, and attempted to run Disk First Aid, and the machine crashed again, twice in two attempts. Clearly the marginal faulty hardware (CPU?) is getting worse. Time to replace hardware.