iconv_intro(5)iconv_intro(5)NAME
iconv_intro, iconv - Introduction to codeset conversion
DESCRIPTION
Conversion of character encoding from one coded character set (codeset)
to another is an operation that often has to be performed by the oper‐
ating system and some applications. For example, the man command sup‐
ports codeset conversion to allow one set of reference page files to
meet the needs of locales that support the same language and territory
but different codesets (see man(1)).
The following commands and library interfaces give users and applica‐
tion developers direct access to codeset conversion operations: The
iconv command converts characters in a data file from one codeset to
another (see iconv(1)). The iconv(), iconv_open(), and iconv_close()
functions convert a string of characters from one codeset to another
(see iconv(3), iconv_open(3), and iconv_close(3)). The iconv command
uses these interfaces to convert characters.
There are two types of codeset converters: algorithmic and table. Algo‐
rithmic converters, which reside in the /usr/lib/nls/loc/iconv direc‐
tory, are shared libraries with a predefined entry point for invocation
by functions in the libiconv.so library. Algorithmic converters are
needed for the conversion of multibyte codesets, in part because table
converters cannot handle the required number of character values and
also because some of these codesets require complex handling (see
NOTES). Algorithmic converters are supplied as part of the operating
system product; the internal interfaces that they require are not pub‐
lished for external use.
Table converters, which reside in the /usr/lib/nls/loc/iconvTable
directory, can be created by using the genxlt command (see genxlt(1)).
These converters can support single-byte codesets and up to 256 encoded
character values.
Names of codeset converters are in the following form:
from-codeset_to-codeset
For example, the following converter converts values from Super DEC
Kanji to Japanese Extended UNIX Code:
sdeckanji_eucJP
The codeset converters produce an invalid character error in response
to characters that cannot be converted from the source codeset to the
destination codeset. This error is always produced for character codes
that are invalid in the source codeset. However, if the error results
from characters that are valid in the source codeset but have no coun‐
terparts in the destination codeset, you can eliminate the error by
defining the ICONV_DEFSTR environment variable to specify a substitute
output string. See the ENVIRONMENT VARIABLES section for more informa‐
tion about using the ICONV_DEFSTR variable.
It is possible to convert data directly between two codesets or by way
of an intermediate codeset, such as UTF-16, UCS-4, or UTF-8. For con‐
version of Chinese characters, be aware that the results of converting
a Traditional Chinese codeset directly to a Simplified Chinese codeset
may not be the same as the results of converting Traditional Chinese
first to UTF-16, UCS-4, or UTF-8 and then to Simplified Chinese.
ENVIRONMENT VARIABLES
Some codeset converters require more complex algorithms than can be
provided through tables. The following environment variables provide
control over conversion behavior for different kinds of codeset con‐
verters:
Controls the behavior for the many-to-one value conversions for conver‐
sion of Traditional Chinese (except for Traditional Chinese encoded in
Telecode) to Simplified Chinese. The valid settings for this environ‐
ment variable are as follows: Specifies that the preferred mapping
value (the first one in the one-to-many mapping list) is always taken.
The batch setting is the ICONV_ACTION default. Specifies that all the
possible values are printed to the standard output, enclosed by braces
({ }), so that the user can later manually edit the converted file and
select the one to use. Specifies that all the possible values are
printed to the standard output except for punctuation symbols, for
which only the preferred mapping value is printed. As is true for conv-
all, the conv_all_nosym setting prints value choices enclosed by braces
so that the converted file can later be edited. Sets byte ordering for
UTF-16 or UCS-4 (UTF-32) converters only. Valid values are little-
endian or big-endian.
If ICONV_NOBOM is set to a non-null value, the default byte
ordering is big-endian. If ICONV_NOBOM is not set, the default
byte ordering is little-endian. Setting the ICONV_BYTEORDER and
ICONV_NOBOM environment variables may be necessary when produc‐
ing UTF-16 or UCS-4 output that will be processed by codeset
converters on platforms other than Tru64 UNIX. Defines the
default string to be substituted in output for valid input char‐
acters that cannot be converted from the source codeset to the
destination codeset. The variable value can be an arbitrary
string or a code number. If the value is a code number (for
example, 10, 07, 0x10, or, for Unicode converters, U+1234), the
corresponding character in the output codeset (to-codeset) is
printed.
For a given type of codeset conversion, a matching ICONV_DEF‐
STR_from-codeset_to-codeset variable has precedence over the
ICONV_DEFSTR variable without the from-codeset_to-codeset suf‐
fix. When defining the variable with the suffix, replace from-
codeset_to-codeset with the name of the codeset converter to
which the variable applies. The ICONV_DEFSTR variable (defined
without the suffix) is used by a converter when no ICONV_DEF‐
STR_from-codeset_to-codeset variable has been defined specifi‐
cally for the type of conversion being done.
If these variables are not defined or are set to the null
string, the characters that cannot be converted are skipped and
have no representation in converted output.
The following converter-specific restrictions apply to
ICONV_DEFSTR* variables: ICONV_DEFSTR* environment variables do
not work for converters that convert between Japanese codesets
or between Korean codesets. For converters that handle UTF-16,
UCS-4 or UTF-8 format, the only valid variable value is a code
number (such as U+1234 or 0x10) or a string whose value is a
single ASCII character (such as ?). For these converters, any
string value other than a single ASCII character is ignored and
any characters that cannot be converted have no representation
in output. For converters that handle output in UTF-16, UCS-4
or UTF-8 format, characters that cannot be converted and for
which no valid ICONV_DEFSTR* value has been defined produce an
error condition that aborts the conversion process. Disables
generation of the byte-order mark at the beginning of UTF-16 or
UCS-4 (UTF-32) output. A valid setting is any value other than
a null string. If ICONV_NOBOM is set, big-endian is established
as the default byte ordering and BOM generation is disabled. If
ICONV_NOBOM is not set, little-endian is established as the
default byte ordering and BOM generation is enabled.
Codeset converters that process UTF-16 or UCS-4 data on plat‐
forms other than Tru64 UNIX usually require the byte-order mark.
The ICONV_NOBOM and ICONV_BYTEORDER environment variables pro‐
vide you with the means to control the generation of a byte-
order mark and byte ordering. Thus, you can establish codeset
conversion that is appropriate to the requirements of other
platforms or is compatible with output produced by codeset con‐
verters that were included in versions of Tru64 UNIX prior to
Version 4.0D. Activates phrase conversion for converters that
convert from a Traditional Chinese codeset (except for Tradi‐
tional Chinese encoded in Telecode) to a Simplified Chinese
codeset or the reverse. When phrase conversion is activated, a
whole phrase in Traditional Chinese is converted to a different
phrase in Simplified Chinese or the reverse.
If ICONV_PHRCONV is set to mark, the converted phrases are be
bracketed by [ and ] to highlight the conversion result for vis‐
ual checking.
The phrase conversion databases in the /usr/share/phrdb direc‐
tory are normal text files with the same file names as those of
the algorithmic converters in /usr/lib/nls/loc/iconv/*. These
phrase conversion databases contain entries for phrase conver‐
sion pairs.
FILES
Algorithmic converters Table converters Phrase conversion databases
SEE ALSO
Commands: genxlt(1), iconv(1), phrase(1)
Functions: iconv(3), iconv_close(3), iconv_open(3)
Others: i18n_intro(5), l10n_intro(5)iconv_intro(5)