Unicode(5)Unicode(5)NAME
Unicode, unicode, universal.utf8, UCS-2, UCS-4, UTF-8, UTF-16, UTF-32,
iso10646 - Support for the Unicode and ISO/IEC 10646 standards
DESCRIPTION
The operating system provides locales and codeset converters that sup‐
port the following standards: The Unicode Standard, Version 3.0, Uni‐
code, Inc., 2000 The Unicode Standard, Version 3.1, Unicode, Inc., 2001
Information Technology-Universal Multiple-Octet Coded Character Set,
ISO/IEC 10646:2001
The Basic Multilingual Plane defined by this standard is identi‐
cal with the main body of Unicode character encoding.
These standards define generalized character encoding rules that can be
applied to characters in most native language scripts. The Unicode
standard specifies a universal character set (UCS). Version 3.0 of the
Unicode standard contains definitions for 49,194 characters and also
includes a Private Use Area for vendor- or user-defined characters.
Version 3.1 of the Unicode standard adds 44,946 new character defini‐
tions, incorporates UTF-32 (32-bit encoding) into the standard, and
adds three new planes beyond the 16-bit codespace of Plane 0 (Basic
Multilingual Plane). Plane 1 (Supplementary Multilingual Plane) con‐
tains code positions U+10000 to U+1FFFF; Plane 2 (Supplementary Ideo‐
graphic Plane) contains code positions U+20000 to U+2FFFF; Plane 14
(Supplementary Special-Purpose Plane) contains code positions U+E0000
to U+EFFFF.
See the Unicode web site at http://www.unicode.org/ for more informa‐
tion on the Unicode standard. See the Unicode ReadMe document in
/usr/share/unidata/, which describes the Unicode standard version cur‐
rently supported on the operating system.
The following list summarizes the main features of the Unicode charac‐
ter set: Characters have properties, such as base, numeric, spacing,
combination, and directionality. The Unicode standard provides rules
for ordering characters with different properties so that parsing of
character sequences is unambiguous. The relationship between Unicode
characters and the glyphs in the native language script that users see,
type, or print is not necessarily one-to-one. A glyph may be mapped to
a single abstract character or to a composed character. Conversely,
more than one glyph can be mapped to a character. Certain sequences of
Unicode characters in a text stream are transformed into other charac‐
ters, called composed characters. The ISO 8859-1 character set occu‐
pies the first 256 code positions (and the ASCII character set the
first 128 positions) of the UCS.
The Unicode and ISO/IEC 10646 standards specify a universal repertoire
of characters that can be used by all major languages and that allow
character units to be processed for all languages under the same set of
rules. Therefore, system support for the universal character set does
not need to include multiple algorithms (one or more per language) for
converting between file code and internal process code. However, the
two different character sizes (16-bit or 32-bit) that the standards
support require different parsing schemes for data input and output.
Universal character encoding that an implementation parses in 16-bit
units (2 octets) is known as UCS-2. Universal character encoding that
an implementation parses in 32-bit units (4 octets) is known as UCS-4.
This is the canonical ISO/IEC 10646 encoding that is in use on systems
that can support the larger data unit size.
Because UCS-2 is a subset of UTF-16, the operating system supports
UCS-2 with UTF-16 codeset converters. The operating system supports
UCS-4 with both codeset converters and locales. (Keep in mind that
UCS-2 cannot be used to encode characters outside of the Basic Multi‐
lingual Plane.)
In terms of locales, the operating system supports both Unicode and
dense code. The two types of locales differ in their manner of wide
character encoding support. See l10n_intro(5) for information comparing
the two locale types and for information on switching between Unicode
and dense code locales.
The Unicode and ISO/IEC 10646 standards define a number of transforma‐
tion formats for the universal character set (UTF-8 and UTF-32 are the
preferred transformation formats for the operating system): UTF-8, the
standard method for transforming UCS-4 process encoding into a sequence
of 8-bit bytes and ensuring interchange transparency for characters in
C0 code positions (0 to 31), the SPACE (32) character, and the DEL
(127) character
The operating system supports UTF-8 with both codeset converters
and locales. UTF-7, an obsolete interchange format for environ‐
ments that strip the eighth bit from each byte
The operating system does not support UTF-7. UTF-1, an obsolete
interchange format that is similar to UTF-8 but also ensures
interchange transparency of characters in C1 code positions (128
to 159)
The operating system does not support UTF-1. UTF-16, which uses
the surrogate character extension technique defined by Version
2.0 and later of the Unicode standard and represents characters
in 16-bit units
UTF-16 is a superset of UCS-2. As with UCS-2, UTF-16 encodes
characters in the range U+0000 to U+FFFF as single 16-bit units.
For characters in the range U+10000 to U+10FFFF, UTF-16 trans‐
forms them into a surrogate pair. The result of this transforma‐
tion is that the high surrogate (the first of the pair) is in
the range U+D800 to U+DBFF, while the low surrogate (the second
part of the pair) is in the range U+DC00 to U+DFFF. These two
16-bit values represent a single character.
Although UTF-16 does not support representation of the entire
UCS-4 code space (including private-use ranges for character
values above U+10FFFF), it does supports all characters that
have been currently defined for the languages covered by both
standards.
Byte orientation in file code can differ and, depending on the
platform on which the file was generated, can be little-endian
(LE) or big-endian (BE). UTF-16 uses a byte order mark (BOM),
which is not part of the file text data, to indicate byte orien‐
tation. The code point of the BOM is U+FEFF. The Unicode stan‐
dard also defines UTF-16LE and UTF-16BE, which are specific to
the little-endian and big-endian orientations, respectively, and
do not include a byte order mark.
The operating system supports UTF-16, UTF-16LE, and UTF-16BE
through codeset converters. The codeset converter name, UCS-2 is
recognized as an alias for UTF-16*, but with a restricted reper‐
toire of characters.
Note
By default, the operating system uses UTF-16 rather than
UTF-16LE or UTF-16BE.
In an input file, the software first looks for a BOM. If a BOM
is not found, the converter assumes UTF-16BE. This means that
you must explicitly specify UTF-16LE to the converter (convert
files manually) when UTF-16LE applies to an input file.
For an output file, the converter automatically inserts a BOM.
This means that you must explicitly specify UTF-16LE or UTF-16BE
(convert files manually) when you want conversion output to be
UTF-16LE or UTF-16BE rather than UTF-16. UTF-32 allows charac‐
ter representation in 4-byte encoding units
UTF-32 is a restricted subset of UCS-4. UTF-32 is restricted in
values to the range U+0000 to U+10FFFF, which precisely matches
the range of character values defined by UTF-16. Like UTF-16,
UTF-32 does not support private-use ranges for character values
above U+10FFFF.
UTF-32 uses a BOM to indicate little-endian or big-endian byte
orientation. The Unicode standard also defines UTF-32LE and
UTF-32BE, which are specific to the little-endian and big-endian
orientations, respectively, and do not include a BOM. As with
UTF-16, big-endian is the default byte order when a BOM is not
generated.
UTF-32 is almost the same as UCS-4, so you can use UCS-4 codeset
converters to process UTF-32. UCS-4 converter software includes
support for UTF-32, UTF-32LE, or UTF-32BE.
Codeset Conversion
Codeset converters are available to convert data in all the major
encoding formats that the operating system supports to and from UCS-2,
UTF-16, UCS-4, and UTF-8. If the worldwide support subsets are
installed on your system, you can enter the following commands to find
the names of these converters: % cd /usr/lib/nls/loc/iconv % ls | grep
UTF % ls | grep UCS
Among the converters listed, you will find some that handle conversion
of data in the code-page format used on PC systems. See code_page(5)
for more information about converting between codeset and code-page
formats. You can use all codeset converters with the iconv command and
associated library functions.
Note
The mapping of Korean Hangul characters changed between Version 1.1 and
Version 2.0 of the Unicode standard. By default, UTF-16, UCS-4, and
UTF-8 conversion assumes Version 2.0 character mapping for Hangul char‐
acters. Therefore, if data is in Version 1.1 format, you must first
convert the data to Version 2.0 format before converting from UTF-16,
UCS-4, or UTF-8 to an entirely different format.
The format of a codeset converter name is from-codeset_to-codeset. In
converter names, the Version 1.1 codeset formats for UCS-2, UCS-4, and
UTF-8 are represented by UNICODE-1-1, UNICODE-1-1-UCS-4, and UNI‐
CODE-1-1-UTF-8, respectively. The Version 2.0 codeset names are repre‐
sented by UTF-16, UCS-4, and UTF-8.
For example, if Korean data is currently in UCS-4 Version 1.1 format,
the data must first be processed by the UNICODE-1-1-UCS-4_UCS-4 con‐
verter before being processed by the UCS-4_deckorean converter.
See iconv_intro(5) for general information on codeset conversion.
Locales
The following locales use UTF-32 as internal processing code: univer‐
sal.UTF-8
This locale is used by applications. It converts data in UTF-8
file format to UCS-4 process code and can be used to test any
UCS-4 character to determine if it is included in one of the
following classes defined for the LC_CTYPE category: alnum,
alpha, blank, cntrl, digit, graph, lower, print, punct, space,
upper, or xdigit.
In the universal.UTF-8 locale, the LC_MESSAGES, LC_MONETARY,
LC_NUMERIC, and LC_TIME category definitions match those for the
POSIX (C) locale. language_territory.UTF-8
These locales limit classification information to the characters
in a particular native language, make country-specific data
available to the application, and assume file data follows UTF-8
encoding rules. The operating system locales that support the
euro monetary symbol use either the UTF-8 or ISO8859-15 codeset.
See euro(5) for more information.
Note
The X locale database file used by applications running in the
universal.UTF-8, en_US.UTF-8, or Asian locales (Chinese, Japa‐
nese, or Korean) contains font definitions that include all the
fonts used with the operating system. This enables applications
under en_US.UTF-8 to display all the font characters installed
with Worldwide Language Software (WLS). Applications under the
Asian locales display all the font characters installed with
WLS, except for ISO8859-2, -4, -5, -7, -8, -9, -15, and TACTIS.
native_locale_name
These locales are installed in the default Unicode path,
/usr/i18n/lib/nls/ucsloc/ and use UTF-32 as internal processing
code. However, they differ in the following ways: The file code
is specified by the codeset portion (for example, ISO8859-1) of
native_locale_name. Classification information is not provided
for the full set of UTF-32 characters, but only for those in a
particular native language (for example, French). Country-spe‐
cific data is also available to the application. The LC_COL‐
LATE, LC_MESSAGES, LC_MONETARY, LC_NUMERIC, and LC_TIME category
definitions match those defined in native_locale_name.
native_locale_name@ucs4
These locales are installed in /usr/i18n/lib/nls/loc/ and are
the same as the native_locale_name locales installed in
/usr/i18n/lib/nls/ucsloc/ except that they are not a complete
set of locales and will not be enhanced in future versions of
the operating system. They are provided for compatibility with
existing applications. You cannot select @ucs4 locales from the
CDE login menu; you must specify the locale name in the LANG
environment variable.
CDE desktop users can select locales by choosing names followed by
(Unicode) from the CDE language menu at session startup. In this case,
the locale setting applies by default to all applications run during
the CDE session.
Unicode Character Database
For the convenience of programmers, the source file for the Unicode
character database is available on line. This source file is the one
used to build the locales provided in optional software subsets
included with the operating system product. When the locales are
installed on your system, both the Unicode character database and an
associated ReadMe file are also installed in the /usr/share/unidata
directory. The ReadMe file discusses the character properties sup‐
ported by Unicode.
Font Support
The operating system provides the following types of bitmap fonts for
UCS characters: Public domain Unicode fonts:
-etl-fixed-medium-r-normal--14-140-72-72-c-70-iso10646-1-etl-
fixed-medium-r-normal--16-160-72-72-c-80-iso10646-1 -etl-fixed-
medium-r-normal--24-240-72-72-c-120-iso10646-1 Composite fonts
that the libfr_FGC font renderer creates by combining fonts
available for other codesets Two sets of monospaced fonts (a
16x18 pixel set and a 24x24 pixel set) for UTF-8 locales with
the following CDE font aliases (where -n is -1, -2, -3, -4, -5,
-7, -8,- 9, or -15):
-dt-interface-*-*-*-*-*-*-*-*-*-*-*-iso8859-n@mono -dt-inter‐
face-*-*-*-*-*-*-*-*-*-*-*-iso10646-1@mono
These fonts currently cover only a subset of the characters in UCS.
Each of the ETL public domain fonts supports about 1000 characters, but
does not include any characters for Chinese, Japanese, or Korean. The
composite fonts created by the font renderer are generated only from
fonts available for the ISO 8859-1 (Latin-1) and ISO 8859-15 (Latin-9)
codesets.
See iso8859-1(5) and iso8859-15(5) for the names of fonts available for
Latin-1 and Latin-9 characters. The Latin-9 fonts, which include glyphs
for the euro character, provide the best support for the language_ter‐
ritory.UTF-8 locales, which also support this character.
See i18n_printing(5) and wwpsof(8) for information on printer support
and converting bitmap font encoding to PostScript.
SEE ALSO
Commands: locale(1), wwpsof(8)
Others: ascii(5), code_page(5), iso8859-1(5), iso8859-15(5),
i18n_intro(5), i18n_printing(5), iconv_intro(5), l10n_intro(5)
Using International Software
Unicode(5)