NUMA(3) Linux Programmer's Manual NUMA(3)NAME
numa - NUMA policy library
SYNOPSIS
#include <numa.h>
cc ... -lnuma
int numa_available(void);
int numa_max_node(void);
int numa_preferred(void);
long numa_node_size(int node, long *freep);
long long numa_node_size64(int node, long long *freep);
nodemask_t numa_all_nodes;
nodemask_t numa_no_nodes;
int numa_node_to_cpus(int node, unsigned long *buffer, int bufferlen);
void nodemask_zero(nodemask_t *mask);
void nodemask_set(nodemask_t *mask, int node);
void nodemask_clr(nodemask_t *mask, int node);
int nodemask_isset(const nodemask_t *mask, int node);
int nodemask_equal(const nodemask_t *a, const nodemask_t b);
void numa_set_interleave_mask(nodemask_t *nodemask);
nodemask_t numa_get_interleave_mask(void);
void numa_bind(nodemask_t *nodemask);
void numa_set_preferred(int node);
void numa_set_localalloc(void);
void numa_set_membind(nodemask_t *nodemask);
nodemask_t numa_get_membind(void);
void *numa_alloc_interleaved_subset(size_t size, nodemask_t *nodemask);
void *numa_alloc_interleaved(size_t size);
void *numa_alloc_onnode(size_t size, int node);
void *numa_alloc_local(size_t size);
void *numa_alloc(size_t size);
void numa_free(void *start, size_t size);
int numa_run_on_node_mask(nodemask_t *nodemask);
int numa_run_on_node(int node);
nodemask_t numa_get_run_node_mask(void);
void numa_interleave_memory(void *start, size_t size, nodemask_t *node‐
mask);
void numa_tonode_memory(void *start, size_t size, int node);
void numa_tonodemask_memory(void *start, size_t size, nodemask_t *node‐
mask);
void numa_setlocal_memory(void *start, size_t size);
void numa_police_memory(void *start, size_t size);
int numa_distance(int node1, int node2);
void numa_set_bind_policy(int strict);
void numa_set_strict(int strict);
void numa_error(char *where);
void numa_warn(int number, char *where, ...);
extern int numa_exit_on_error;
DESCRIPTION
The libnuma library offers a simple programming interface to the NUMA
(Non Uniform Memory Access) policy supported by the Linux kernel. On a
NUMA architecture some memory areas have different latency or bandwidth
than others.
Available policies are page interleaving (i.e., allocate in a round-
robin fashion from all, or a subset, of the nodes on the system), pre‐
ferred node allocation (i.e., preferably allocate on a particular
node), local allocation (i.e., allocate on the node on which the thread
is currently executing), or allocation only on specific nodes (i.e.,
allocate on some subset of the available nodes). It is also possible
to bind threads to specific nodes.
Numa memory allocation policy is a per-thread attribute, but is inher‐
ited by children.
For setting a specific policy globally for all memory allocations in a
process and its children it is easiest to start it with the numactl(8)
utility. For more finegrained policy inside an application this library
can be used.
All numa memory allocation policy only takes effect when a page is
actually faulted into the address space of a process by accessing it.
The numa_alloc_* functions take care of this automatically.
A node is defined as an area where all memory has the same speed as
seen from a particular CPU. A node can contain multiple CPUs. Caches
are ignored for this definition.
This library is only concerned about nodes and their memory and does
not deal with individual CPUs inside these nodes (except for
numa_node_to_cpus )
Before any other calls in this library can be used numa_available()
must be called. If it returns -1, all other functions in this library
are undefined.
numa_max_node() returns the highest node number available on the cur‐
rent system. If a node number or a node mask with a bit set above the
value returned by this function is passed to a libnuma function, the
result is undefined.
numa_node_size() returns the memory size of a node. If the argument
freep is not NULL, it used to return the amount of free memory on the
node. On error it returns -1.numa_node_size64() works the same as
numa_node_size() except that it returns values as long long instead of
long. This is useful on 32-bit architectures with large nodes.
Some of these functions accept or return a nodemask. A nodemask has
type nodemask_t. It is an abstract bitmap type containing a bit set of
nodes. The maximum node number depends on the architecture, but is not
larger than numa_max_node(). What happens in libnuma calls when bits
above numa_max_node() are passed is undefined. A nodemask_t should
only be manipulated with the nodemask_zero(), nodemask_clr(), node‐
mask_isset(), and nodemask_set() functions. nodemask_zero() clears a
nodemask_t. nodemask_isset() returns true if node is set in the passed
nodemask. nodemask_clr() clears node in nodemask. nodemask_set() sets
node in nodemask. The predefined variable numa_all_nodes has all
available nodes set; numa_no_nodes is the empty set. nodemask_equal()
returns non-zero if its two nodeset arguments are equal.
numa_preferred() returns the preferred node of the current thread.
This is the node on which the kernel preferably allocates memory,
unless some other policy overrides this.
numa_set_interleave_mask() sets the memory interleave mask for the cur‐
rent thread to nodemask. All new memory allocations are page inter‐
leaved over all nodes in the interleave mask. Interleaving can be
turned off again by passing an empty mask (numa_no_nodes). The page
interleaving only occurs on the actual page fault that puts a new page
into the current address space. It is also only a hint: the kernel will
fall back to other nodes if no memory is available on the interleave
target. This is a low level function, it may be more convenient to use
the higher level functions like numa_alloc_interleaved() or
numa_alloc_interleaved_subset().
numa_get_interleave_mask() returns the current interleave mask.
numa_bind() binds the current thread and its children to the nodes
specified in nodemask. They will only run on the CPUs of the specified
nodes and only be able to allocate memory from them. This function is
equivalent to calling numa_run_on_node_mask(nodemask) followed by
numa_set_membind(nodemask). If threads should be bound to individual
CPUs inside nodes consider using numa_node_to_cpus and the
sched_setaffinity(2) syscall.
numa_set_preferred() sets the preferred node for the current thread to
node. The preferred node is the node on which memory is preferably
allocated before falling back to other nodes. The default is to use
the node on which the process is currently running (local policy).
Passing a -1 argument is equivalent to numa_set_localalloc().
numa_set_localalloc() sets a local memory allocation policy for the
calling thread. Memory is preferably allocated on the node on which
the thread is currently running.
numa_set_membind() sets the memory allocation mask. The thread will
only allocate memory from the nodes set in nodemask. Passing an argu‐
ment of numa_no_nodes or numa_all_nodes turns off memory binding to
specific nodes.
numa_get_membind() returns the mask of nodes from which memory can cur‐
rently be allocated. If the returned mask is equal to numa_no_nodes or
numa_all_nodes, then all nodes are available for memory allocation.
numa_alloc_interleaved() allocates size bytes of memory page inter‐
leaved on all nodes. This function is relatively slow and should only
be used for large areas consisting of multiple pages. The interleaving
works at page level and will only show an effect when the area is
large. The allocated memory must be freed with numa_free(). On error,
NULL is returned.
numa_alloc_interleaved_subset() is like numa_alloc_interleaved() except
that it also accepts a mask of the nodes to interleave on. On error,
NULL is returned.
numa_alloc_onnode() allocates memory on a specific node. This function
is relatively slow and allocations are rounded up to the system page
size. The memory must be freed with numa_free(). On errors NULL is
returned.
numa_alloc_local() allocates size bytes of memory on the local node.
This function is relatively slow and allocations are rounded up to the
system page size. The memory must be freed with numa_free(). On
errors NULL is returned.
numa_alloc() allocates size bytes of memory with the current NUMA pol‐
icy. This function is relatively slow and allocations are rounded up
to the system page size. The memory must be freed with numa_free().
On errors NULL is returned.
numa_free() frees size bytes of memory starting at start, allocated by
the numa_alloc_* functions above.
numa_run_on_node() runs the current thread and its children on a spe‐
cific node. They will not migrate to CPUs of other nodes until the node
affinity is reset with a new call to numa_run_on_node_mask(). Passing
-1 permits the kernel to schedule on all nodes again. On success, 0 is
returned; on error -1 is returned, and errno is set to indicate the
error.
numa_run_on_node_mask() runs the current thread and its children only
on nodes specified in nodemask. They will not migrate to CPUs of other
nodes until the node affinity is reset with a new call to
numa_run_on_node_mask(). Passing numa_all_nodes permits the kernel to
schedule on all nodes again. On success, 0 is returned; on error -1 is
returned, and errno is set to indicate the error.
numa_get_run_node_mask() returns the mask of nodes that the current
thread is allowed to run on.
numa_interleave_memory() interleaves size bytes of memory page by page
from start on nodes nodemask. This is a lower level function to inter‐
leave not yet faulted in but allocated memory. Not yet faulted in
means the memory is allocated using mmap(2) or shmat(2), but has not
been accessed by the current process yet. The memory is page inter‐
leaved to all nodes specified in nodemask. Normally numa_alloc_inter‐
leaved() should be used for private memory instead, but this function
is useful to handle shared memory areas. To be useful the memory area
should be several megabytes at least (or tens of megabytes of hugetlbfs
mappings) If the numa_set_strict() flag is true then the operation will
cause a numa_error if there were already pages in the mapping that do
not follow the policy.
numa_tonode_memory() put memory on a specific node. The constraints
described for numa_interleave_memory() apply here too.
numa_tonodemask_memory() put memory on a specific set of nodes. The
constraints described for numa_interleave_memory() apply here too.
numa_setlocal_memory() locates memory on the current node. The con‐
straints described for numa_interleave_memory() apply here too.
numa_police_memory() locates memory with the current NUMA policy. The
constraints described for numa_interleave_memory() apply here too.
numa_node_to_cpus() converts a node number to a bitmask of CPUs. The
user must pass a long enough buffer. If the buffer is not long enough
errno will be set to ERANGE and -1 returned. On success 0 is returned.
numa_set_bind_policy() specifies whether calls that bind memory to a
specific node should use the preferred policy or a strict policy. The
preferred policy allows the kernel to allocate memory on other nodes
when there isn't enough free on the target node. strict will fail the
allocation in that case. Setting the argument to specifies strict, 0
preferred. Note that specifying more than one node non strict may only
use the first node in some kernel versions.
numa_set_strict() sets a flag that says whether the functions allocat‐
ing on specific nodes should use use a strict policy. Strict means the
allocation will fail if the memory cannot be allocated on the target
node. Default operation is to fall back to other nodes. This doesn't
apply to interleave and default.
numa_distance() reports the distance in the machine topology between
two nodes. The factors are a multiple of 10. It returns 0 when the
distance cannot be determined. A node has distance 10 to itself.
Reporting the distance requires a Linux kernel version of 2.6.10 or
newer.
numa_error() is a weak internal libnuma function that can be overridden
by the user program. This function is called with a char * argument
when a libnuma function fails. Overriding the weak library definition
makes it possible to specify a different error handling strategy when a
libnuma function fails. It does not affect numa_available().
The num_error() function defined in libnuma prints an error on stderr
and terminates the program if numa_exit_on_error is set to a non-zero
value. The default value of numa_exit_on_error is zero.
numa_warn() is a weak internal libnuma function that can be also over‐
ridden by the user program. It is called to warn the user when a lib‐
numa function encounters a non-fatal error. The default implementation
prints a warning to stderr.
The first argument is a unique number identifying each warning. After
that there is a printf(3)-style format string and a variable number of
arguments.
THREAD SAFETY
numa_set_bind_policy and numa_exit_on_error are process global. The
other calls are thread safe.
Memory policy set for memory areas is shared by all threads of the
process. Memory policy is also shared by other processes mapping the
same memory using shmat(2) or mmap(2) from shmfs/hugetlbfs. It is not
shared for disk backed file mappings right now although that may change
in the future.
COPYRIGHT
Copyright 2002, 2004, Andi Kleen, SuSE Labs. libnuma is under the GNU
Lesser General Public License, v2.1.
SEE ALSOget_mempolicy(2), getpagesize(2), mbind(2), mmap(2), set_mempolicy(2),
shmat(2), numactl(8), sched_setaffinity(2)SuSE Labs May 2004 NUMA(3)