numa_intro(3)numa_intro(3)NAMEnuma_intro - Introduction to NUMA support
DESCRIPTION
NUMA, or Non-Uniform Memory Access, refers to a hardware architectural
feature in modern multiprocessor platforms that attempts to address the
increasing disparity between requirements for processor speed and band‐
width and the bandwidth capabilities of memory systems, including the
interconnect between processors and memory. NUMA systems address this
problem by grouping resources--processors, I/O buses, and memory--into
building blocks that balance an appropriate number of processors and
I/O buses with a local memory system that delivers the necessary band‐
width. The local building blocks are combined into a larger system by
means of a system-level interconnect with a platform-specific topology.
The local processor and I/O components on a particular building block
can access their own “local” memory with the lowest possible latency
for a particular system design. The local building block can in turn
access the resources (processors, I/O, and memory) of remote building
blocks at the cost of increased access latency and decreased global
access bandwidth. The term “Non-Uniform Memory Access” refers to the
difference in latency between “local” and “remote” memory accesses that
can occur on a NUMA platform.
Overall system throughput and individual application performance is
optimized on a NUMA platform by maximizing the ratio of local resource
accesses to remote accesses. This is achieved by recognizing and pre‐
serving the “affinity” that processes have for the various resources on
the system building blocks. For this reason, the building blocks are
called “Resource Affinity Domains” or RADs.
RADs are supported only on a class of platforms known as Cache Coherent
NUMA, or CC NUMA, where all memory is accessible and cache coherent
with respect to all processors and I/O buses. The Tru64 UNIX operating
system includes enhancements to optimize system throughput and applica‐
tion performance on CC NUMA platforms for legacy applications as well
as those that use NUMA-aware APIs. System enhancements to support NUMA
are discussed in the following subsections. Along with system perfor‐
mance monitoring and tuning facilities, these enhancements allow the
operating system to make a “best effort” to optimize the performance of
any given collection of applications or application components on a CC-
NUMA platform.
NUMA Enhancements to Basic UNIX Algorithms and Default Behaviors
For NUMA, modifications to basic UNIX algorithms (scheduling, memory
allocation, and so forth) and to default behaviors maximize local
accesses transparently to applications. These modifications, which
include the following, directly benefit legacy and non-NUMA-aware
applications that were designed for uniprocessors or Uniform Memory
Access Symmetric Multiprocessors but run on CC NUMA platforms: Topol‐
ogy-aware placement of data
The operating system attempts to allocate memory for application
(and kernel) data on the RAD closest to where the data will be
accessed; or, for data that is globally accessed, the operating
system may allocate memory across the available RADs. When there
is insufficient free memory on optimal RADs, the memory alloca‐
tions for data may “overflow” onto nearby RADs. Replication of
read-only code and data
The operating system will attempt to make a local copy of read-
only text, such as shared library and program code. Kernel code
and kernel read-only data are replicated on all RADs at boot
time. If insufficient free local memory is available, the oper‐
ating system may choose to utilize a remote copy rather than
wait for free local memory. Memory affinity-aware scheduling
The operating system scheduler takes “cache affinity” into
account when choosing a processor to run a process thread on
multiprocessor platforms. Cache affinity assumes that a process
thread builds a “memory footprint” in a particular processor's
cache. On CC NUMA platforms, the scheduler also takes into
account the fact that processes will have memory allocated on
particular RADs, and will attempt to keep processes running on
processors that are in the same RAD as their memory footprints.
Load balancing
To minimize the requirement for remote memory allocation (over‐
flow), the scheduler will take into account memory availability
on a RAD as well as the processor load average for the RAD.
Although these two factors may at times conflict with one
another, the scheduler will attempt to balance the load so that
processes run where there are memory pages as well as processor
cycles available. This balancing involves both the initial
selection of a RAD at process creation and migration of pro‐
cesses or individual pages in response to changing loads as pro‐
cesses come and go or their resource requirements or access pat‐
terns change.
NUMA Enhancements to Application Programming Interfaces
Application programmers can use new or modified library routines to
further increase local accesses on CC NUMA platforms. Using these APIs,
programmers can write new applications or modify old ones to provide
additional information to the operating system or to take explicit con‐
trol over process, thread, memory object placement, or some combination
of these.
Following are tables that list the NUMA library routines that deal with
RADs and RAD sets, processes and threads, memory management, CPUs and
CPU sets, and NUMA Scheduling Groups. Routines are listed alphabeti‐
cally in each table, and some routines are listed in more than one ta‐
ble.
For information about NUMA types, structures, and symbolic values, see
numa_types(4). For information about NUMA Scheduling Groups, see
numa_scheduling_groups(4).
RADs and RAD Sets
───────────────────────────────────────────────────────────────────────
Function Purpose Library Reference Page
───────────────────────────────────────────────────────────────────────
nloc() Returns the RAD libnuma nloc(3)
set that is a
specified distance
from a resource.
rad_attach_pid() Attaches a process libnuma rad_attach_pid(3)
to a RAD (assigns
a home RAD but
allows execution
on other RADs).
rad_bind_pid() Binds a process to libnuma rad_attach_pid(3)
a RAD (assigns a
home RAD and
restricts execu‐
tion to the home
RAD).
rad_foreach() Scans a RAD set libnuma rad_foreach(3)
for members and
returns the first
member found.
rad_get_cur‐ Returns the call‐ libnuma rad_get_cur‐
rent_home() er's home RAD. rent_home(3)rad_get_cpus() Returns the set of libnuma rad_get_num(3)
CPUs that are in a
RAD.
rad_get_freemem() Returns a snapshot libnuma rad_get_num(3)
of the free memory
pages that are in
a RAD.
rad_get_info() Returns informa‐ libnuma rad_get_num(3)
tion about a RAD,
including its
state (online or
offline) and the
number of CPUs and
memory pages it
contains.
rad_get_max() Returns the number libnuma rad_get_num(3)
of RADs in the
system. **
rad_get_num() Returns the number libnuma rad_get_num(3)
of RAD's in the
caller's parti‐
tion. **
rad_get_physmem() Returns the number libnuma rad_get_num(3)
of memory pages
assigned to a RAD.
rad_get_state() Reserved for libnuma rad_get_num(3)
future use. (Cur‐
rently, RAD state
is always set to
RAD_ONLINE.)
radaddset() Adds a RAD to a libnuma radsetops(3)
RAD set.
radandset() Performs a logical libnuma radsetops(3)
AND operation on
two RAD sets,
storing the result
in a RAD set.
radcopyset() Copies the con‐ libnuma radsetops(3)
tents of one RAD
set to another RAD
set.
radcountset() Returns the mem‐ libnuma radsetops(3)
bers of a RAD set.
raddelset() Removes a RAD from libnuma radsetops(3)
a RAD set.
raddiffset() Finds the logical libnuma radsetops(3)
difference between
two RAD sets,
storing the result
in another RAD
set.
rademptyset() Initializes a RAD libnuma radsetops(3)
set such that no
RADs are included.
radfillset() Initializes a RAD libnuma radsetops(3)
set such that it
includes all RADs.
radisemptyset() Tests whether a libnuma radsetops(3)
RAD set is empty.
radismember() Tests whether a libnuma radsetops(3)
RAD belongs to a
given RAD set.
radorset() Performs a logical libnuma radsetops(3)
OR operation on
two RAD sets,
storing the result
in another RAD
set.
radsetcreate() Allocates a RAD libnuma radsetops(3)
set and sets it to
empty.
radsetdestroy() Releases the mem‐ libnuma radsetops(3)
ory allocated for
a RAD set.
radxorset() Performs a logical libnuma radsetops(3)
XOR operation on
two RAD sets,
storing the result
in another RAD
set.
───────────────────────────────────────────────────────────────────────
** On a partitioned system, the system and the partition are equiva‐
lent. In this case, the operating system returns information only for
the partition in which it is installed.
Processes and Threads
──────────────────────────────────────────────────────────────────────────────────
Function Purpose Library Reference Page
──────────────────────────────────────────────────────────────────────────────────
nfork() Creates a child libnuma nfork(3)
process that is an
exact copy of its
parent process. See
also the table entry
for rad_fork().
nmadvise() Tells the system what libnuma nmadvise(3)
behavior to expect
from a process with
respect to referenc‐
ing mapped files and
shared memory
regions.
nsg_attach_pid() Attaches a process to libnuma nsg_attach_pid(3)
a NUMA scheduling
group.
nsg_detach_pid() Detaches a process libnuma nsg_attach_pid(3)
from a NUMA schedul‐
ing group.
pthread_nsg_attach() Attaches a thread to libpthread pthread_nsg_attach(3)
a NUMA scheduling
group.
pthread_nsg_detach() Detaches a thread libpthread pthread_nsg_detach(3)
from a NUMA schedul‐
ing group.
pthread_rad_attach() Attaches a thread to libpthread pthread_rad_attach(3)
a RAD set.
pthread_rad_bind() Attaches a thread to libpthread pthread_rad_attach(3)
a RAD set and
restricts its execu‐
tion to the home RAD.
pthread_rad_detach() Detaches a thread libpthread pthread_rad_detach(3)
from a RAD set.
rad_attach_pid() Attaches a process to libnuma rad_attach_pid(3)
a RAD (assigns a home
RAD but allows execu‐
tion on other RADs).
rad_bind_pid() Binds a process to a libnuma rad_attach_pid(3)
RAD (assigns a home
RAD and restricts
execution to the home
RAD).
rad_fork() Creates a child libnuma rad_fork(3)
process on a RAD that
optionally does not
inherit the RAD
assignment of its
parent. See also the
table entry for
nfork().
──────────────────────────────────────────────────────────────────────────────────
Memory Management
──────────────────────────────────────────────────────────────────────
Function Purpose Library Reference Page
──────────────────────────────────────────────────────────────────────
memalloc_attr() Returns the memory libnuma memal‐
allocation policy for loc_attr(3)
a RAD set specified
by its virtual
address.
nacreate() Sets up an arena for libc amalloc(3)
memory allocation for
use with the amal‐
loc() function.. An
arena is used in mul‐
tithreaded programs
when there is a need
for thread-specific
heap memory alloca‐
tion.
nmadvise() Tells the system what libnuma nmadvise(3)
behavior to expect
from a process with
respect to referenc‐
ing mapped files and
shared memory
regions.
nmmap() Maps an open file (or libnuma nmmap(3)
anonymous memory)
onto the address
space for a process
by using a specified
memory allocation
policy.
nshmget() Returns or creates libnuma nshmget(3)
the ID for a shared
memory region.
──────────────────────────────────────────────────────────────────────
CPUs and CPU Sets
───────────────────────────────────────────────────────────────────────
Function Purpose Library Reference Page
───────────────────────────────────────────────────────────────────────
cpu_foreach() Enumerates the members libc cpu_foreach(3)
of a CPU set.
cpu_get_current() Returns the identifier libc cpu_get_cur‐
of the current CPU on rent(3)
which the calling
process is running.
cpu_get_info() Returns CPU informa‐ libc cpu_get_info(3)
tion for the system.
**
cpu_get_max() Returns the number of libc cpu_get_info(3)
CPU slots available in
the caller's parti‐
tion. **
cpu_get_num() Returns the number of libc cpu_get_info(3)
available CPUs.
cpu_get_rad() Returns the RAD iden‐ libnuma cpu_get_rad(3)
tifier for a CPU.
cpuaddset() Adds a CPU to a CPU libc cpusetops(3)
set.
cpuandset() Performs a logical AND libc cpusetops(3)
operation on the con‐
tents of two CPU sets,
storing the result in
a third CPU set.
cpucopyset() Copies the contents of libc cpusetops(3)
one CPU set to another
CPU set.
cpucountset() Returns the number of libc cpusetops(3)
CPUs in a CPU set.
cpudelset() Deletes a CPU from a libnuma cpusetops(3)
CPU set.
cpudiffset() Finds the logical dif‐ libnuma cpusetops(3)
ference between two
CPU sets, storing the
result in a third CPU
set.
cpuemptyset() Initializes a CPU set libnuma cpusetops(3)
such that it includes
no CPUs.
cpufillset() Initializes a CPU set libnuma cpusetops(3)
such that it includes
all CPUs.
cpuisemptyset() Tests whether a CPU libnuma cpusetops(3)
set is empty.
cpuismember() Tests whether a CPU is libnuma cpusetops(3)
a member of a particu‐
lar CPU set.
cpuorset() Performs a logical OR libnuma cpusetops(3)
operation on the con‐
tents of two CPU sets,
storing the result in
a third CPU set.
cpusetcreate() Allocates a CPU set libnuma cpusetops(3)
and sets it to empty.
cpusetdestroy() Releases the memory libnuma cpusetops(3)
allocated to a CPU
set.
cpuxorset() Performs a logical XOR libnuma cpusetops(3)
operation on the con‐
tents of two CPU sets,
storing the result in
a third CPU set.
───────────────────────────────────────────────────────────────────────
** On a partitioned system, the system and the partition are equiva‐
lent. In this case, the operating system returns information only for
the partition in which it is installed.
NUMA Scheduling Groups
─────────────────────────────────────────────────────────────────────────────────
Function Purpose Library Reference Page
─────────────────────────────────────────────────────────────────────────────────
nsg_attach_pid() Attaches a process libnuma nsg_attach_pid(3)
to a NUMA scheduling
group.
nsg_destroy() Removes a NUMA libnuma nsg_destroy(3)
scheduling group and
deallocates its
structures.
nsg_detach_pid() Detaches a process libnuma nsg_attach_pid(3)
from a NUMA schedul‐
ing group.
pthread_nsg_attach() Attaches a thread to libpthread pthread_nsg_attach(3)
a NUMA scheduling
group.
pthread_nsg_detach() Detaches a thread libpthread pthread_nsg_detach(3)
from a NUMA schedul‐
ing group.
nsg_get() Returns the status libnuma nsg_get(3)
of a NUMA scheduling
group.
nsg_get_nsgs() Returns a list of libnuma nsg_get_nsgs(3)
NUMA scheduling
groups that are
active.
nsg_get_pids() Returns a list of libnuma nsg_get_pids(3)
processes attached
to a NUMA scheduling
group.
nsg_init() Looks up (and possi‐ libnuma nsg_init(3)
bly creates) a NUMA
scheduling group.
nsg_set() Sets group ID, user libnuma nsg_set(3)
ID, and permissions
for a NUMA schedul‐
ing group.
pthread_nsg_get() Returns a list of libpthread pthread_nsg_get(3)
threads attached to
a NUMA scheduling
group.
─────────────────────────────────────────────────────────────────────────────────
NUMA Enhancements to System Utilities and Deamons
A number of system commands display RAD-specific information or perform
RAD-specific operations. The following list briefly describes the NUMA
options supported by system utilities and daemons: The runon -r command
executes an application on a specific RAD. The vmstat -r command dis‐
plays virtual memory statistics for a specific RAD. The netstat -R
command displays network routing tables for each RAD. The ps -o RAD
command includes RAD binding in the information displayed about pro‐
cesses running on the system. The hwmgr -view hier command displays
the RAD location of CPUs and devices. In this case, in place of a RAD
identifier, the command identifies the contruct in hardware that corre‐
sponds to a RAD. When run on a GS80, GS160, or GS320 AlphaServer plat‐
form, the command shows the hierarchy of CPUs and devices within QBBs.
When run on an ES80 or GS1280 AlphaServer platform, the command shows
the hierarchy of CPUs and devices within PIDs (processing unit IDs).
The sched_stat -R command also displays the RAD location of system
CPUs. In addition, this command shows the relative distance (number of
hops) between CPUs. The -t and -u options on the nfsd command allow
customization of the number of TCP and UCP server threads, respec‐
tively, that are spawned per RAD. This feature allows the NFS server to
automatically scale the number of TCP and UCP server threads according
to the size of the system. The -r option on the inetd command allows
customization of the RAD locations on which to start Internet server
child daemons. By default, one child deamon is started on each RAD.
The route -R command of the kdbx kernel debugger displays network route
tables for all RADs.
SEE ALSO
NUMA Overview
The NUMA Overview is a web-only document that includes a complete NUMA
programming example. Starting with Tru64 UNIX Version 5.1, this web-
only document can be accessed through the version-specific web pages
for Tru64 UNIX documentation. Links to documentation sets for different
product versions are available at the following URL:
http://www.Tru64UNIX.compaq.com/docs/pub_page/doc_list.html
numa_intro(3)