original: http://www.tacc.utexas.edu/~avijit/papers/CUG_HT_paper2003.doc
Exploring the Effects of
Hyper-Threading
on Scientific Applications
ABSTRACT:
A 3.7 teraflops Cray-Dell Linux cluster based on
Intel Xeon processors will be installed at the Texas Advanced Computing Center
early this summer. It will represent a transition from the T3E line of massively
parallel processing systems that served researchers at The University of
Texas. Code migration to an IA-32
microarchitecture will benefit from the adoption of new performance-enhancing
architectural features within existing codes and algorithms.
Intel's Hyper-Threading (HT) Technology is one of the significant
performance improvements within the latest IA-32 microarchitecture. HT
provides a low-level, simultaneous multithreading parallelism directly within
the microarchitectural framework of the processor’s core. It is most effective when it can provide additional work via
simultaneous instruction execution of two threads per processor. The net effect is to offset stalls due to memory and I/O
latencies. This paper will discuss the performance characteristics of HT for
scientific applications.
In the last decade, the instruction-level
parallelism (ILP) in microprocessor architectures has increased substantially,
particularly in the form of pipelined and superscalar designs (multiple
execution units that can operate simultaneously). Much deeper pipelines1 in the IA-32 processors
have been engineered to allow the processor speeds to reach 3GHz and beyond.
While these architectural features boost performance, each new generation of
processors must cope with any lag in relative memory speed and increased
latency.
The new Intel Xeon and Pentium 4 processors
have a hardware feature called Hyper-Threading Technology2,3 (HT)
that makes a single processor appear as two logical processors to the operating
system and to user applications. This
technology allows two processes or two threads (e.g., MPI tasks or OpenMP
threads, respectively) to run simultaneously on a single processor; hence it is
a form of simultaneous multi-threading (SMT) implemented in hardware.
This means that instructions from each thread or task can execute
concurrently in the core[WK1].
Here, the term “threading” refers to the process of streaming instructions,
and does not imply a restriction to execution of programs coded with threads (OpenMP,
Multitasking, Pthreads) in the programming sense.
Multiple streams of instructions from two different processes, MPI tasks,
or OpenMP/PThreads can execute concurrently.
The motivation for incorporating HT in the
chip architecture comes primarily from the business server market.
In web and database servers, it is common for processors to spend 30
percent or more of their time idle, waiting for resources (especially memory and
I/O). The impact of idle time can
be offset by overlapping the execution of multiple threads.
One hardware approach is to build two (or more) cores per chip to execute
different threads, as in the IBM Power4 and future Intel Itanium architectures.
HT is a different approach, designed to execute two independent
instruction streams concurrently in a single core. In both dual-core and HT
simultaneous-multithreading approaches, the memory components and paths from the
L2 cache to the memory are shared. In
dual-core chips, the L1 caches, the registers, buffers, and the instruction
pipelines are duplicated. In HT,
some register and buffers are replicated (state registers, queues, etc.), but
the instruction pipeline and the execution units are shared.
While this might seem prone to contention and congestion, there is a
performance benefit when instructions from two different threads can execute
simultaneously. This performance
benefit is maximized when threads are using different instruction units, e.g., when one thread is performing only integer arithmetic and the
other only floating point operations.
For HPC systems that use the new IA-32
processors such as the TACC Cray-Dell cluster, HT capability is built into the
processors and can be turned on or off. In
the commodity server market a large variety of processes stall on random memory
and I/O accesses; therefore, multiple simultaneous threads on a single processor
can provide large performance gains.
While
standard benchmarks (SPECfp, Linpack, STREAM, etc.) are a reasonable metric for
evaluating performance, benchmarking production applications should be in the
strategy for platform evaluation. In
addition, application developers should have an understanding of how the
technology operates at the instruction-set level (e.g., hyperthreading, multithreading, etc.) for formulating
efficient algorithms.
Section 2 presents an outline of the
architectural features of Hyper-Threading and an operational description of the
technology. Section 3 describes
computational kernels that can be used to characterize a memory subsystem, as
well as measure memory characteristics that are important to Hyper-Threading
(latencies and bandwidth sharing). The HT results of matrix-multiply operations
and two simple scientific codes are reported in Section 4.
In Section 5 we summarize the results, provide conclusions, and offer
insights to application developers.
Microprocessor performance has improved over
the years with branch prediction, out-of-order execution and superscalar
(multiple and concurrently operated) execution units.
Because the higher clock speeds of the newer processors require deeper
pipelines, there are more instructions “in flight”; and cache misses, branch
mispredictions, and interrupts [WK2]become
costly. The simultaneous
multi-threading of Hyper-Threading allows two threads to execute concurrently in
the processor without context switching (moving one instruction stream aside for
the other). This is accomplished
with a minimal amount of additional circuitry.
Hyper-Threading makes a single processor
appear as two logical processors by having two separate architectural states,
even though the same physical execution resources are shared.
At the application level there are two processors that can be assigned to
processes or threads. Figure
1 illustrates the state/core relationship.
The general-purpose registers, control
registers, and the Advanced Programmable Interrupt Controller (APIC), as well as
some machine state registers have been duplicated to form the two architectural
states. The logical processors
share the branch prediction, execution and cache trace units, control logic,
buses and caches. The circuitry and
logic “along” the instruction pipelines were designed to make sure one
thread can progress forward even though another may be stalled.
This “independent forward progress is guaranteed by making sure that
neither thread can consume an inhibiting share of the buffering queue entries.
Another design goal was to make sure that a single thread on an HT
processor can run as fast as a thread on a processor without this capability.
The execution trace cache (TC) is where the
decoded instructions (mops)
are stored and retrieved, it has replaced the L1 instruction cache seen in many
other microarchitectures. Its
operation provides a good example of how the processor pipeline arbitrates
between the threads. Two sets of
independent pointers are used to point to the “next” instruction of each
software thread. Every clock period
the thread choice is arbitrated, and any simultaneous access is granted to one
thread and followed by the other. If
one logical processor is stalled, 100% accessibility is given to the other
thread. The TC is 8-way set
associative, uses a least recently used (LRU) replacement policy, and is shared
by both threads. Each entry is
tagged with a processor ID.
The Front-end of the pipeline consists of the
ITLB (instruction table lookaside buffer), the prediction logic, streaming
buffers, TC and microcode ROM. The
duplicated and shared components are illustrated in Figure 2 below.
An instruction stream from each thread is guaranteed by reserving one
request slot for each thread. The ITLB and their pointers are replicated; the
global history array is shared, with each entry tagged with the processor ID.
Streaming buffers, as well as the queue before the TC and the mop
queue after the TC, have been duplicated.
The out-of-order execution engine consists of
the allocation, register renaming, scheduling pipe sections and execution units.
The partitioned (shared) and separate components are illustrated in
Figure 4. The mops
from each thread are alternately fetched from the mop
queue, and stalls on a thread when its resource limit has been met (e.g. store
buffer entries, etc.). A single
thread will have allocations request occurring every cycle. By imposing a limit (mostly half) on key buffers (126
re-order buffer entries, 128/128 integer and float physical registers an 48/24
load and store buffer entries) for each thread, fairness is preserved and
deadlocks are prevented. There are
two Register Alias Tables since each thread must keep track of its own state.
After allocation and renaming, the mops
are queued into two duplicated buffers (one for memory loads and stored, the
other for general operations). Next,
five schedulers distribute the mops
to the execution units. Up to 6 mops
can be dispatched in one CP. The
schedulers determine when the mops
are ready to be executed; that is when the dependent input register operands are
available. Each scheduler has a
buffer queue, and there no distinction made between thread ownership of the mops.
Deadlock is avoided by limiting the maximum number of queue entries for
each logical processor. The execution is also unaware of logical processor ownership.
The BIOS must be aware of HT, and provide the
kernel with information about the additional (logical) processor account.
Many vendors are shipping platforms with HT enabled by default; but it
can be disabled easily at boot time (e.g. using the boot argument noht).
HT enabled processors can be detected by observing the number of CPUs the
kernel sees (e.g., using “cat
/proc/cpuinfo”
or running top
on cluster systems).
Many scientific and engineering HPC
applications are particularly sensitive to the memory subsystem. While this was
less true years ago because traditional Cray supercomputers (Y-MP, C90, T90,
etc.) provided extraordinary memory bandwidth to keep functional units busy, the
gap between commodity processor speeds and memory speeds is large—and growing.
Therefore, applications are increasingly sensitive to memory latency and
bandwidth performance. In this section, we will explore the impact of HT on
memory latency and bandwidth. While these are early results on a new technology,
they provide insights on how HT will impact the performance of HPC applications
(Section 5).
Since HT was designed to allow one process to
use the processor while another process is stalled (waiting for resources), in
particular while waiting on memory or cache data, it is important to know the
relative times of memory and cache latencies.
The worst case of memory latency impacting code performance is when
memory references are random. In an array (IA) containing N array indices
(8-byte integers) randomly distributed, the timings of the following loop can be
used to report the number of clock periods for single, uncorrelated cache and
memory accesses:
I1
= IA(1)
DO
I = 2,N
I2 =
IA(I1)
I1 = I2
END DO
Figure 4 shows the measured latencies for
array sizes between 1KB and 2MB on a Dell 2650 dual-processor server with 2.4
GHz Intel Xeon processors using dual channels to 200 MHz DDR memory.
(This system is used for all measurements in this paper; the Cray system
to be installed in July 2003 will have faster processors and memory.)
Reads (Single CPU data in figure) from L1 (sizes up to 8KB) take
approximately 3 clock periods (CP) while reads from L2 (sizes between 8KB and 512KB[WK3])
take approximately 18 CP, as show in the figure insert. These single-processor
latencies are characteristics of the processor since the L1 and L2 caches are on
the chip die, and scale with the processor speed.
The latencies for memory reads are mainly dependent on the components of
the memory subsystem off the chip (northbridge bus speed, as well as the DIMM
memory speed and its CLAS rating). Approximately
450 CP are required to fetch a data element from memory.
High
sustainable memory bandwidth is a desirable capability of an HPC system.
There are many different benchmarks, such as STREAM4, that can
characterize the memory performance. Here
we use a simple memory-read loop that accesses two different floating point
arrays. The memory-read kernel for our measurements consists of a single loop
with the potential to support two streams through two independent arrays:
DO
I = 1,N
S = S
+ A(I)
T = T
+ B(I)
END DO
A minimal but reasonable set of compiler
options and loop structures were explored to derive an optimal bandwidth across
the caches and memory. (High
bandwidths are possible with more elaborate coding, but we wanted to use memory
access structures and compiler options that would be found in common codes and
makefiles, respectively.) In
Figure 5 the memory-read bandwidth is plotted for transfers between 4KB and 2MB.
The bandwidth from the L2 cache is constant at about 13 GB/sec
(~0.7Word/CP) over about 70 percent of it range (size), and decays to memory
speed beyond the “high end” of the cache size.
The bandwidth to memory is fairly constant beyond 1MB, with an average
value of 2.3 GB/sec (~0.1Word/CP).
The greatest opportunity for multiple
processes to overlap work in the execution units is when processes are stalling
on memory requests (or I/O). The
same memory-read measurements described above were used with 2, 3 and 4 threads
on separate arrays (distributed model) to determine how Hyper-Threading behaves
with a simple loop that has the highest degree of memory stalling.
Figures 4 shows latencies for a single thread on each processor (Dual CPU
curve). As expected, two threads
accessing memory randomly exhibit no contention on separate CPUs in the cache
region, and only 2% higher wait time from memory.
When three processes are executing on the
system, two of the processes share one CPU (using HT) while the third is
executed on the second processor. We
will refer to these processes as LP0 and LP1 (on the first processor, using HT)
and LP2 (on the second processor). Figure 6 shows latency curves for these
‘logical processors.’ The LP2 curve in the cache region is identical in form
and values to those in Figure 5; in the memory region there is a 26% increase in
the latency. For the processor executing both LP0 and LP1 using HT, the latency
in the memory region is about 33% larger than in Figure 4.
The HT mode has extremely large latencies in the L1 region (see insert in
Figure 1); in the L2 region the latencies are also higher, averaging about 25 CP
compared to 18 for single-process mode, but for
the benefit of two operations instead of one. [WK4]Also,
the effective size of the L2 cache has been reduced significantly, as shown by
the vertical lines at 128KB and 384KB. Even though the cache is 512KB, the
transition between pure L2 and memory access is not sharp. In HT mode additional
data, as well as additional code, must be retained in the L2 cache.
Execution of four tasks (processes) on a
dual-processor node is the simplest way to configure Hyper-Threading with MPI applications[WK5].
Figure 7 shows HT memory latency for 4 threads on independent arrays.
(The same behavior is expected in 4 MPI tasks executing the same loops.)
The four threads display a common behavior. Latencies to memory are about 700 CP; L2 latencies are about
25CP, and L1 behavior is sporadic, ranging between values seen for HT and non-HT
(one thread per processor) modes.
The asymptotic values of the memory accesses
in Figures 5 and 7 show that increasing the process count from two to four, to
double the “work” of random accesses, increase the time by only a factor of
1.5 (460CP per process for two processes compared to 700CP per logical process
for four processes). Also, the work
time on L2 cache elements only increases by a factor of 1.4 (18CP compared to
~25CP for 4 threads).
Random memory loads and stores provide a
large opportunity for multiple logical processors to overlap additional memory
accesses and operations in the execution units. On the other end of the memory bandwidth spectrum, highly
optimized bandwidth-limited applications that stream data into the core usually
leave no opportunity for any additional memory accesses.
In the case of the (vector) sum used in the bandwidth measurement above,
just a single operation (+) is employed on each double precision word fetched
from memory. A 2.4GHz system,
streaming in an array from memory with a bandwidth of 2.4 GB/sec, has about 7
cycles to perform other arithmetic or logical operations (discounting any
pipeline latencies[WK6]).
Therefore, even in bandwidth-limited applications, simple loops may not
be using the core units at full capacity. When
two processes execute simultaneously [WK7]on
two processors in a bus-based dual-processor system, they must share the
bandwidth to memory. [SMP systems such as the AMD Opteron (HyperTransport), SGI
Altix (NUMA), and IBM Power4 (Distributed Switch) don’t follow this rule.]
For multiple threads, separate arrays were
used in the memory-read loop for each thread to mimic a distributed memory
application. Figure 8 shows the effect of bandwidth sharing for two threads
(tasks) executing on two processors. The cache region bandwidths are identical
to those of the single-task executions in Figure 5. This is expected since each processor has its own cache. The
memory access region shows a drop in performance of 50% (from approximately 2.2
to 1.1 GB/sec) per processor when two processors read simultaneously. This is
also expected since the bus is shared; the total sustained memory bandwidth is
essentially ‘conserved.’
For 3 and 4 threads executing on the two
physical processors (using HT), the aggregate memory access bandwidth drops by
10% (2.2 to 2.0 GB/sec) and 23% (2.2 to 1.7GB/sec) as shown in Figure 9 and
Figure 10, compared to only two threads executing on the two processors (Figure
5). Both figures also reveal that
cache access bandwidths per process have dropped, but the aggregate bandwidth of
the four logical processes has increased to 14.8GB/sec from 12.8 for non-HT (one
process per processor) execution, a 16% performance boost.
A comparison of cache regions in Figures 5
and 10 reveals that dual-process executions effectively reduce the cache
available to each process. The
onset of pure cache access has shifted from 240KB in non-HT mode compared to
160KB in HT mode.
If globally visible arrays are used in the
memory-read loop (i.e., shared memory
model), the memory performance is quite different from the distributed model
with separate arrays described above. The
performance of this shared-memory sum in HT mode, on two processors, is shown in
Figure 11. The bandwidth per
process at the largest array size is 910 MB/sec. At first, this appears incorrect: the sum of the four
measured bandwidths is 3.6 GB/s, which is greater than the theoretical peak
bandwidth (3.2 GB/s). However, in a
shared memory model executing in HT mode, a memory request that is satisfied by
one task is available to the other task “free”; the data lines come at a
cost of “two for one” if the tasks are synchronized.
The synchronization must keep the accesses close enough, in time, so that
the second process fulfils its request from cache.
At 1.1 GB/sec, it only takes about 500 microseconds to fill up a 512KB
cache. In our timing routines, a
barrier at the beginning of each timed loop forces the loops to begin within 50
(often 10) microseconds and maintain “cache synchronization” throughout the
execution of the loop. The insert
in Figure 11 reveals that the effective cache size has been decreased, relative
to non-HT execution, as expected. However, the effective cache size in HT mode
execution of the shared-memory model is larger than for the distributed-memory
model (Figure 10). The decrease in
the effective per-process cache size can be attributed to the non-perfect
synchronization and extra code (text) used in the 2nd task.
Hyper-Threading was developed by Intel to
help improve the performance of applications. However, the primary impetus was
for increased performance in commercial server applications, which usually
execute serially and have very different characteristics compared to HPC
applications. The results in the previous section can be used to develop
insights into the performance of HPC applications, but we also present a few
early results from simple scientific applications to bolster these insights.
The BLAS library routines, instead of
hand-coded loops, are often used in compute-intensive applications to perform
basic linear algebra operations. The
matrix-matrix multiply routine, DGEMM, is common in many matrix transformations.
It is therefore important to measure the benefits of HT on this routine.
Some applications work on a single matrix
operation among all the processors, using a shared-memory parallel versions of
DGEMM. Figure 12 shows the DGEMM
scaling for the Intel MKL 5.1 DGEMM routine throughout a range of matrix sizes,
for one to four threads on our dual-processor node. For matrices of order 10 to 1000 the scaling is consistent.
All the curves approach an asymptotic limit for large matrix order.
The floating point performances for one and two processes are 3.2 Gflops
and 5.7 Gflops, respectively, for order 1000. However, the three and four
process curves only reach 4.7 Gflops. The
scaling from one to two processes is 1.75; the scaling from one process to three
or four processes is only 1.5.
Molecular
dynamics type calculations are often used to model the evolution of proteins,
DNA, and other macromolecular structures in a variety of chemical environments.
To evaluate the effects of HT on this genre of compute-intensive
calculations, the dynamics of a 256-particle argon lattice was calculated for
one picosecond, using a Verlet algorithm to advance the atomic positions and
velocities. The code was
parallelized using OpenMP. Table 1 shows the times for 1-4 threads on our
dual-processor system. Even with excellent scaling to two threads, HT provides
an additional 12% benefit in scaled performance when four threads are deployed.
Threads: |
1
|
2
|
3
(HT) |
4
(HT) |
Time (sec) |
7.95 |
4.06 |
3.89 |
3.63 |
Scaling |
1 |
1.96 |
2.04 |
2.19 |
Table 1: MD performance for serial,
dual-processor, and HT execution.
Lithographic modeling simulates the behavior
of various stages of the lithographic process, such as the photoresist
development, using a set of physical process parameters and a set of descriptive
models. A modeling program5
developed at The University of Texas uses Monte Carlo methods and is
parallelized with OpenMP. After 107
Monte Carlo iterations, each thread outputs the lattice configuration and
various Monte Carlo statistics to disk. HT improvements are reported in Table 2.
Threads: |
1
|
2
|
3
(HT) |
4
(HT) |
Time (sec) |
19.9 |
15.7 |
13.1 |
11.5 |
Scaling |
1 |
1.27 |
1.52 |
1.73 |
Table 2:
Monte Carlo lithography simulation times and scaling for serial,
dual-processor, and HT execution.
The scaling from one to two processors is not
as large as for the MD code. However, significant performance enhancement is
achieved using HT—more than for the MD code.
The incremental gains with each additional thread are nearly linear.
Adding two HT threads provides a performance boost of 36% over the non-HT
execution of two threads.
This work represents very early results,
since the final system has not been installed by Cray and even the test and
evaluation nodes have not been available and operational for long. However, the
simple latency and bandwidth measurements conducted can reveal the parameter
space for evaluating the performance enhancement of HT on HPC codes. As memory
latencies increase, the potential benefit of HT increases as well.
In Section 3 we see that applications characterized by random memory
access (e.g., employ gather/scatter
operations) can overlap their latencies and have the potential for up to 40%
increase in performance with HT if execution units are not oversubscribed. This
is not unexpected, when considering that HT was designed to benefit server
workloads of web or database applications consisting of many random memory
references. This technology can also have a positive impact in HPC codes that
solve sparse linear algebra problems.
However, HPC applications that use a
distributed memory model and also stream data from memory can experience a
significant degradation in performance with HT. The results of Section 3 show a
degradation up to approximately 25% in memory bandwidth according to our
distributed memory bandwidth measurements.
This is disturbing, since many HPC applications stream data from memory
and most HPC applications—especially those likely to run on Xeon
processors—are coded using MPI (or some other distributed programming
technique).
The results in Section 3 relevant to a shared
memory model imply positive impact, however. If the memory accesses can be
shared in cache by using shared-memory programming (e.g.,
OpenMP), there can be a benefit of as much as 60% in memory read performance,
especially for applications that need to share a sizeable fraction of the data
while also accessing disparate sections of memory. One should keep in mind the additional threads made possible
using HT must share the cache, so the effective cache size per thread is
decreased. Nevertheless,
memory-bound algorithms that use shared data can benefit from the additional
threads of HT, provided the sharing is synchronized to access the data by the
alternate thread once it is in cache and the code is not heavily cache-optimized
already.
The preliminary application results
demonstrate a range of possible performance impacts. The highly optimized DGEMM
matrix multiply routine degrades by almost 20% for large matrices when one or
two additional HT threads are applied. This routine is coded for shared memory,
but has been extensively cache-optimized. It is likely that the reduction in
effective cache offsets the benefit of using shared data.
The MD code, which was coded with OpenMP but in a style which mimics MPI-type
parallelism[WK8],
sees about 12% performance improvements due to HT. This is most likely due to
sharing of data again, but now the code is not heavily cache-optimized. Finally,
the lithography code sees significant performance enhancement—73%--with HT.
This code benefits quite well from HT because it has many random memory
references and performs a
significant amount of I/O. I/O stalls can be more severe than memory stalls;
though we did not explore the impact of HT on I/O intensive codes, it is not
surprising that the 60% potential improvement for random memory accesses would
be exceeded if there were also significant I/O operations.
In general, compute- or memory-bound
applications may not benefit from HT; however, applications that have memory or
I/O stalls, and are not extreme in the use of cache/execution units or in memory
streaming, can benefit from the Hyper-Threading Technology.
We look forward to exploring this further when the large Cray-Dell
cluster is installed in July 2003. Future work will include more detailed
analysis of HT impact on scientific algorithms and the benefits to I/O-bound
applications.
The authors would like to thank their
colleagues at TACC and in the HPCC Group at Dell for sharing their insights,
expertise, and wisdom6.
1.
Increasing
Processor Performance, by Implementing Deeper Pipelines,
Eric Sprangle & Doug Carmean, White Paper, Pentium Processor Architecture
Group, Intel Corporation (IEEE 2002)
2.
http://www.intel.com/technology/hyperthread/index.htm
3.
Intel
OpenMP C++/Fortran Compiler for Hyper-Threading Technology: Implementation and
Performance,
Xinmin Tian, Aart Bik, Milind Girkar, Paul Grey, Hideki Saito, Ernesto
Su, White Paper, http://developer.intel.com/
4.
STREAM
benchmarks, www.cs.virginia.edu/stream
5.
http://willson.cm.utexas.edu/Research/Sub_Files/Resist_Modeling/collaborations.htm
6.
A
Study of Hyper-Threading in High-Performance Computing Clusters,
Tau Long, Rizwan Ali, Jenwei Hsieh, and Christopher Stanton.
– www.dell.com/powersolutions
Drs. Kent Milfeld, Chona Guiang, and Avijit
Purkayastha are staff members in the HPC Group at TACC. Dr. John (“Jay”) R.
Boisseau is the manager of the HPC group and the director of TACC. The authors
have many years of experience in HPC, in particular using Cray supercomputing
systems. All of the authors can be reached at: Texas Advanced Computing Center,
CMS 1.154 / PRC (R8700), 10100 Burnet Rd., Austin TX 78758-4497 U.S.A.; or U.S.
phone number (512) 475-9411. Dr. Milfeld can be reached by e-mail at milfeld@tacc.utexas.edu.
[WK1]Provided that there are enough resources to fulfill their demands. Otherwise, one thread will wait for resources to get freed up.
[WK2]Is there supposed to be a comma between mispredictions and interrupts? Otherwise, I don’t know what a mispredictions interrupts is.
[WK3]Too many parentheses
[WK4]This looks like an important observation and should be put in a separate statement instead of a parenthesis.
[WK5]What does “easiest to configure” mean? A programmer could “just as easily” type “mpirun –np 3” as “mpirun –np 4”
[WK6]But isn’t 7 cycles THE pipeline latency for arithmetic operations?
[WK7]Two parallel processes executing serially? Huh?
[WK8]BTW, this isn’t true. The openMP version uses fine-grained parallelization.