Hyper-threading and CPU time

From Wikistix
(Redirected from Hyperthreading and CPU time)

When is a CPU second not a CPU second? When you are running with hyper-threading (aka HT, HTT, Symmetric Multi-Threading (SMT), etc) enabled. Here's a simple demonstration.

NetBSD 4.0 on a Pentium 4

The system here has a "Intel(R) Pentium(R) 4 CPU 2.80GHz", single core (one "physical" CPU) with hyper-threading enabled (giving two "logical" CPUs), running NetBSD 4.0 with an SMP kernel. We run a deterministic unit of work on an idle system:

ksh$ time gzip -9 < zz > /dev/null
   10.28s real    10.05s user     0.24s system
ksh$ time gzip -9 < zz > /dev/null
   10.26s real    10.05s user     0.20s system
ksh$ time gzip -9 < zz > /dev/null
   10.31s real    10.08s user     0.23s system

The times are fairly consistent, and, roughly, real = user + sys. Next we add an arbitrary load to the system. We assume the kernel will now schedule each thread on each logical CPU, and it is then up to the CPUs hyper-threading algorithm how the instructions are scheduled on the single core.

ksh$ perl -e 'while(1){}' &
[1] 9382
ksh$ time gzip -9 < zz > /dev/null
   15.36s real    14.96s user     0.36s system
ksh$ time gzip -9 < zz > /dev/null
   15.49s real    14.97s user     0.34s system
ksh$ time gzip -9 < zz > /dev/null
   15.41s real    14.95s user     0.37s system

OK, so what has happened here? The real time has increased by about 50%, but so has the user time. On the same system with hyper-threading disabled, you would expect the user time to remain about the same, and the real time to approximately double. Here, because both threads are really sharing the same core and its resources, they tend to compete and slow each other down. However, as the real time has not doubled, the overall throughput of the system has increased over the uni-processor case.

Also, adding more load only increases the real time, as only two threads can ever be executed in parallel.

ksh$ perl -e 'while(1){}' &
[2] 12480
ksh$ perl -e 'while(1){}' &        
[3] 29686
ksh$ perl -e 'while(1){}' &        
[4] 12019
ksh$ time gzip -9 < zz > /dev/null
   38.14s real    15.12s user     0.33s system
ksh$ time gzip -9 < zz > /dev/null
   34.45s real    15.11s user     0.25s system
ksh$ time gzip -9 < zz > /dev/null
   37.96s real    15.04s user     0.34s system

This is where a common metric may be found: Cycles Per Instruction (CPI), occasionally also referred to by its inverse, Instructions Per Cycle (IPC). In the hyper-threading example above, when the second hyper-thread was busy, the CPU took more cycles to execute the same set of instructions. If we were to measure the CPI of gzip, we would have seen an increase, roughly proportional to the user CPU time increase.

For reference, the CPU tested was:

cpu0: Intel Pentium 4 (686-class), 2798.79 MHz, id 0xf25
cpu0: features 0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR>
cpu0: features 0xbfebfbff<PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX>
cpu0: features 0xbfebfbff<FXSR,SSE,SSE2,SS,HTT,TM,SBF>
cpu0: features2 0x4400<CID,xTPR>
cpu0: "Intel(R) Pentium(R) 4 CPU 2.80GHz"
cpu0: I-cache 12K uOp cache 8-way, D-cache 8KB 64B/line 4-way
cpu0: L2 cache 512KB 64B/line 8-way
cpu0: ITLB 4K/4M: 64 entries
cpu0: DTLB 4K/4M: 64 entries
cpu0: Initial APIC ID 1
cpu0: Cluster/Package ID 0
cpu0: SMT ID 1
cpu0: family 0f model 02 extfamily 00 extmodel 00

Linux 2.6 on a Xeon X5650

Second test, on Linux 2.6.38 on a 6-physical core Xeon (Intel(R) Xeon(R) CPU X5650 @ 2.67GHz). We use taskset to select which cores we're going to run these processes on:

bash$ taskset -c 5 time gzip -9 < /tmp/zz > /dev/null
11.27user 0.07system 0:11.34elapsed 99%CPU (0avgtext+0avgdata 2944maxresident)k
0inputs+0outputs (0major+229minor)pagefaults 0swaps
bash$ taskset -c 5 time gzip -9 < /tmp/zz > /dev/null
11.18user 0.01system 0:11.19elapsed 99%CPU (0avgtext+0avgdata 2944maxresident)k
0inputs+0outputs (0major+230minor)pagefaults 0swaps
bash$ taskset -c 5 time gzip -9 < /tmp/zz > /dev/null
11.21user 0.05system 0:11.26elapsed 99%CPU (0avgtext+0avgdata 2928maxresident)k
0inputs+0outputs (0major+228minor)pagefaults 0swaps

Start a CPU burning thread on the second thread on that core, and retest:

bash$ taskset -c 11 perl -e 'while(1){}' &
[1] 4391
bash$ taskset -c 5 time gzip -9 < /tmp/zz > /dev/null
16.90user 0.09system 0:17.00elapsed 99%CPU (0avgtext+0avgdata 2944maxresident)k
0inputs+0outputs (0major+229minor)pagefaults 0swaps
bash$ taskset -c 5 time gzip -9 < /tmp/zz > /dev/null
16.80user 0.03system 0:16.84elapsed 99%CPU (0avgtext+0avgdata 2944maxresident)k
0inputs+0outputs (0major+230minor)pagefaults 0swaps
bash$ taskset -c 5 time gzip -9 < /tmp/zz > /dev/null
16.71user 0.07system 0:16.79elapsed 99%CPU (0avgtext+0avgdata 2928maxresident)k
0inputs+0outputs (0major+229minor)pagefaults 0swaps

And just to complete our set of tests:

bash$ taskset -c 11 perl -e 'while(1){}' &
[2] 4730
bash$ taskset -c 11 perl -e 'while(1){}' &
[3] 4731
bash$ taskset -c 11 perl -e 'while(1){}' &
[4] 4734
bash$ taskset -c 5 time gzip -9 < /tmp/zz > /dev/null
16.66user 0.06system 0:16.73elapsed 99%CPU (0avgtext+0avgdata 2928maxresident)k
0inputs+0outputs (0major+228minor)pagefaults 0swaps
bash$ taskset -c 5 time gzip -9 < /tmp/zz > /dev/null
16.60user 0.07system 0:16.68elapsed 99%CPU (0avgtext+0avgdata 2944maxresident)k
0inputs+0outputs (0major+229minor)pagefaults 0swaps
bash$ taskset -c 5 time gzip -9 < /tmp/zz > /dev/null
16.71user 0.08system 0:16.80elapsed 99%CPU (0avgtext+0avgdata 2944maxresident)k
0inputs+0outputs (0major+229minor)pagefaults 0swaps

Whoa, what happened here? Since we're selecting each virtual core to run on explicitly, the second virtual core now has 4 threads (perl) running on it, while the first virtual core only gets the gzip. For a matching test to the NetBSD case, we could do:

bash$ taskset -c 5,11 perl -e 'while(1){}' &
[1] 4966
bash$ taskset -c 5,11 perl -e 'while(1){}' &
[2] 4969
bash$ taskset -c 5,11 perl -e 'while(1){}' &
[3] 4970
bash$ taskset -c 5,11 perl -e 'while(1){}' &
[4] 4972
bash$ taskset -c 5,11 time gzip -9 < /tmp/zz > /dev/null
16.63user 0.04system 0:42.45elapsed 39%CPU (0avgtext+0avgdata 2944maxresident)k
0inputs+0outputs (0major+229minor)pagefaults 0swaps
bash$ taskset -c 5,11 time gzip -9 < /tmp/zz > /dev/null
16.72user 0.11system 0:42.89elapsed 39%CPU (0avgtext+0avgdata 2944maxresident)k
0inputs+0outputs (0major+229minor)pagefaults 0swaps
bash$ taskset -c 5,11 time gzip -9 < /tmp/zz > /dev/null
16.83user 0.08system 0:43.64elapsed 38%CPU (0avgtext+0avgdata 2928maxresident)k
0inputs+0outputs (0major+228minor)pagefaults 0swaps

NetBSD 7.0 on Intel Core i7 (Sandy Bridge)

And a more modern example on NetBSD, on a Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz, first a baseline:

ksh$ sudo schedctl -A 3,7 time gzip -9 < /tmp/zz > /dev/null
       10.37 real        10.06 user         0.30 sys
ksh$ sudo schedctl -A 3,7 time gzip -9 < /tmp/zz > /dev/null
       10.37 real        10.17 user         0.18 sys
ksh$ sudo schedctl -A 3,7 time gzip -9 < /tmp/zz > /dev/null
       10.40 real        10.08 user         0.28 sys

With a single spinning process:

ksh$ sudo schedctl -A 3,7 perl -e 'while(1){}' &
[1] 20565
ksh$ sudo schedctl -A 3,7 time gzip -9 < /tmp/zz > /dev/null
       14.63 real        13.69 user         0.21 sys
ksh$ sudo schedctl -A 3,7 time gzip -9 < /tmp/zz > /dev/null
       14.46 real        14.24 user         0.22 sys
ksh$ sudo schedctl -A 3,7 time gzip -9 < /tmp/zz > /dev/null
       14.46 real        14.26 user         0.20 sys

And now with 3 more spinning processes:

ksh$ sudo schedctl -A 3,7 perl -e 'while(1){}' &            
[2] 19974
ksh$ sudo schedctl -A 3,7 perl -e 'while(1){}' &
[3] 25182
ksh$ sudo schedctl -A 3,7 perl -e 'while(1){}' &
[4] 27197
ksh$ sudo schedctl -A 3,7 time gzip -9 < /tmp/zz > /dev/null
       32.05 real        14.22 user         0.29 sys
ksh$ sudo schedctl -A 3,7 time gzip -9 < /tmp/zz > /dev/null
       28.45 real        14.22 user         0.27 sys
ksh$ sudo schedctl -A 3,7 time gzip -9 < /tmp/zz > /dev/null
       38.47 real        14.28 user         0.21 sys

All pretty much as expected. Single thread latency increases about 36%, for a multi-threaded instruction throughput increase of around 47%.

For another test, we'll compute the SHA1 of a 4GiB file cached in RAM, use the same command as the busy process keeping the other hyper-thread busy, and bind only the single logical core to each:

ksh$ sudo schedctl -A 3 time openssl sha1 < /tmp/zz > /dev/null
       10.52 real         6.58 user         3.90 sys
ksh$ sudo schedctl -A 3 time openssl sha1 < /tmp/zz > /dev/null
       10.39 real         6.56 user         3.81 sys
ksh$ sudo schedctl -A 3 time openssl sha1 < /tmp/zz > /dev/null
       10.35 real         6.41 user         3.90 sys
ksh$ sudo schedctl -A 7 sh -c 'while :; do openssl sha1 < zz > /dev/null; done' &
[1] 2406
ksh$ sudo schedctl -A 3 time openssl sha1 < /tmp/zz > /dev/null                       
       16.40 real        12.56 user         3.82 sys
ksh$ sudo schedctl -A 3 time openssl sha1 < /tmp/zz > /dev/null
       16.33 real        12.50 user         3.82 sys
ksh$ sudo schedctl -A 3 time openssl sha1 < /tmp/zz > /dev/null
       16.44 real        12.44 user         3.98 sys

For reference, the CPU is:

ksh$ sudo cpuctl identify 3
cpu3: highest basic info 0000000d
cpu3: highest extended info 80000008
cpu3: "Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz"
cpu3: Intel Xeon E3-12xx, 2nd gen i7, i5, i3 2xxx (686-class), 3392.45 MHz
cpu3: family 0x6 model 0x2a stepping 0x7 (id 0x206a7)
cpu3: features 0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE>
cpu3: features 0xbfebfbff<MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2>
cpu3: features 0xbfebfbff<SS,HTT,TM,SBF>
cpu3: features1 0x1fbae3ff<SSE3,PCLMULQDQ,DTES64,MONITOR,DS-CPL,VMX,SMX,EST>
cpu3: features1 0x1fbae3ff<TM2,SSSE3,CX16,xTPR,PDCM,PCID,SSE41,SSE42,X2APIC>
cpu3: features1 0x1fbae3ff<POPCNT,DEADLINE,AES,XSAVE,OSXSAVE,AVX>
cpu3: features2 0x28100800<SYSCALL/SYSRET,XD,RDTSCP,EM64T>
cpu3: features3 0x1<LAHF>
cpu3: xsave features 0x7<x87,SSE,AVX>
cpu3: xsave instructions 0x1<XSAVEOPT>
cpu3: xsave area size: current 832, maximum 832, xgetbv enabled
cpu3: enabled xsave 0x7<x87,SSE,AVX>
cpu3: I-cache 32KB 64B/line 8-way, D-cache 32KB 64B/line 8-way
cpu3: L2 cache 256KB 64B/line 8-way
cpu3: L3 cache 8MB 64B/line 16-way
cpu3: 64B prefetching
cpu3: ITLB 64 4KB entries 4-way, 2M/4M: 8 entries
cpu3: DTLB 64 4KB entries 4-way, 2M/4M: 32 entries (L0)
cpu3: L2 STLB 512 4KB entries 4-way
cpu3: Initial APIC ID 6
cpu3: Cluster/Package ID 0
cpu3: Core ID 3
cpu3: SMT ID 0
cpu3: DSPM-eax 0x77<DTS,IDA,ARAT,PLN,ECMD,PTM>
cpu3: DSPM-ecx 0x9<HWF,EPB>
cpu3: SEF highest subleaf 00000000
cpu3: microcode version 0x23, platform ID 1

Linux 3.13 on Xeon E5-1650

Slightly more modern CPU:

bash$ taskset -c 5 time gzip -9 < /tmp/zz > /dev/null
12.06user 0.08system 0:12.16elapsed 99%CPU (0avgtext+0avgdata 812maxresident)k
0inputs+0outputs (0major+253minor)pagefaults 0swaps
bash$ taskset -c 5 time gzip -9 < /tmp/zz > /dev/null
12.03user 0.06system 0:12.11elapsed 99%CPU (0avgtext+0avgdata 812maxresident)k
0inputs+0outputs (0major+253minor)pagefaults 0swaps
bash$ taskset -c 5 time gzip -9 < /tmp/zz > /dev/null
12.23user 0.06system 0:12.31elapsed 99%CPU (0avgtext+0avgdata 812maxresident)k
0inputs+0outputs (0major+253minor)pagefaults 0swaps

Busying the other hyper-thread core:

bash$ taskset -c 11 perl -e 'while(1){}' &
[1] 15995
bash$ taskset -c 5 time gzip -9 < /tmp/zz > /dev/null
17.02user 0.07system 0:17.12elapsed 99%CPU (0avgtext+0avgdata 812maxresident)k
0inputs+0outputs (0major+253minor)pagefaults 0swaps
bash$ taskset -c 5 time gzip -9 < /tmp/zz > /dev/null
16.92user 0.09system 0:17.04elapsed 99%CPU (0avgtext+0avgdata 808maxresident)k
0inputs+0outputs (0major+253minor)pagefaults 0swaps
bash$ taskset -c 5 time gzip -9 < /tmp/zz > /dev/null
16.82user 0.09system 0:16.94elapsed 99%CPU (0avgtext+0avgdata 808maxresident)k
0inputs+0outputs (0major+253minor)pagefaults 0swaps

So, in this very primitive test, about a 40% increase in CPU (equating to single-thread latency), which also means approx 43% increase in overall throughput [math]\displaystyle{ ({2}/{1.4}) }[/math] by enabling hyper-threading (overall instruction throughput by multiple threads).

CPU for this test was:

Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz.

Linux 6.5.13 on AMD Ryzen Threadripper PRO 3995WX

bash$ taskset -c 5 time gzip -9 < /tmp/zz > /dev/null
10.61user 0.06system 0:10.67elapsed 99%CPU (0avgtext+0avgdata 1024maxresident)k
0inputs+0outputs (0major+210minor)pagefaults 0swaps
bash$ taskset -c 5 time gzip -9 < /tmp/zz > /dev/null
10.56user 0.05system 0:10.62elapsed 99%CPU (0avgtext+0avgdata 1024maxresident)k
0inputs+0outputs (0major+211minor)pagefaults 0swaps
bash$ taskset -c 5 time gzip -9 < /tmp/zz > /dev/null
10.51user 0.07system 0:10.59elapsed 99%CPU (0avgtext+0avgdata 1024maxresident)k
0inputs+0outputs (0major+210minor)pagefaults 0swaps

vs

bash$ taskset -c 69 perl -e 'while(1){}' &
[1] 2971374
bash$ taskset -c 5 time gzip -9 < /tmp/zz > /dev/null
13.26user 0.04system 0:13.31elapsed 99%CPU (0avgtext+0avgdata 1024maxresident)k
0inputs+0outputs (0major+207minor)pagefaults 0swaps
bash$ taskset -c 5 time gzip -9 < /tmp/zz > /dev/null
13.23user 0.06system 0:13.30elapsed 99%CPU (0avgtext+0avgdata 1024maxresident)k
0inputs+0outputs (0major+211minor)pagefaults 0swaps
bash$ taskset -c 5 time gzip -9 < /tmp/zz > /dev/null
13.20user 0.05system 0:13.26elapsed 99%CPU (0avgtext+0avgdata 1024maxresident)k
0inputs+0outputs (0major+209minor)pagefaults 0swaps

On this CPU, we see a 25.6% increase in CPU time, equating to almost a 60% throughput increase; a great improvement over older CPUs, at this very specific workload.

Additional

In truth, similar effects can be seen with other shared resources, just not as easily. Some examples include shared L2/L3 caches, and memory bandwidth. Both may increase the CPU time required for a given unit of work.

See Also