Tchar I18N Text Abstraction

来源:百度文库 编辑:神马文学网 时间:2024/04/29 03:00:15

Tchar I18N Text Abstraction

This documention describes how to write C software that will supportinternational charsets and compile and run the resulting programs ondifferent platforms such as Linux, Microsoft Windows, BSD, or MacOS/Xwith little or no modification. In short, macros and typedefs are usedto abstract the character type and all functions that operate on it.This will permit the software to be compiled using plain 8 bit,multi-byte, or wide character encodings. Very little extra work isnecessary to benifit from this technique although there are pitfallsthat will be described in detail.

Like most modules in Libmba, the ideas and code can be extracted oradapted to meet the needs of your application and do not bind you to aparticular "environment" or library. Although the text.h typedefs and macros are used throughout the libmba package, users can simply choose to ignore them and pass char * to functions that accept tchar * (or wchar_t * if libmba has been compiled with -DUSE_WCHAR).

Unicode, Charsets, and Character Encodings

To use this technique successfully it is essential to understand thateach non-ASCII character may occupy a variable number of bytes inmemory. Examples will be given below that illustrate why this isimportant but first some background information about Unicode,charsets, and character encodings might be useful.

Consider the Russian character called CYRILLIC CAPITAL LETTER GHEwhich looks like an upside down 'L' has a Unicode value of U+0413. Thischaracter's value will be different depending on which charset is beingdiscussed but Unicode is the international standard superset ofvirtually all charsets so unless otherwise specified Unicode will beused to describe characters throughout this documentation. The numberof bytes that the U+0413 character occupies depends on the characterencoding being used. Notice that a charset and a character encoding aredifferent things.

Charsets A charset (or characterset) is a map that defines whichnumeric value represents a particular character. In the Unicode charsetCYRILLIC CAPITAL LETTER GHE is the numeric value 0x0413 (written U+0413 in the Unicode standard convention). Some example charsets are ISO-8859-1, CP1251, and GB2312.

Character Encodings A character encoding defines how the numericvalues representing characters are serialized into a sequence of bytesso that it can be operated on in memory, stored on a disk, ortransmitted over a network. For example, at least two bytes of memorywould be required to hold the value 0x0413. If this value were simply stored as the byte 0x04 followed by the byte 0x13this encoding would be UCS-2BE meaning Unicode Character Set, 2 bytes,and big-endian. The following are examples of some other Unicodecharacter encodings:

  • UCS-4 simply serializes each Unicode value into a sequence of 4 bytes. The largest possible Unicode value can be encoded in 32 bits. This is not a variable width encoding. If the string "hello", which is 6 characters if the '\0' terminator is included, was encoded in UCS-4 it would occupy 6 * 4 or 24 bytes of memory. Glibc encodes wide characters (wchar_t) using this encoding.
  • UCS-2 is like UCS-4 but only 2 bytes are used to encode each character. This is a much more reasonable use of memory but it cannot encode every value of the Unicode charset. This is not a variable width encoding. If the string "hello" including a '\0' terminator was encoded in UCS-2 it would occupy 6 * 2 or 12 bytes of memory.
  • UTF-16 is like UCS-2 but it uses certain bits to indicate that an additional pair of bytes follows which permits values of more 16 bits to be encoded. So most characters are encoded in 2 bytes but 4 bytes may be used if necessary. Therefore, UTF-16 is a variable width encoding. The Microsoft Windows platform uses UTF-16LE (the LE means little-endian) almost exclusively to represent Unicode strings. For a description of UTF-16 read the description of UTF-16 at czyborra.com.
  • UTF-8 is like UTF-16 because it uses certain bits (actually just the highest bit) to indicate that additional bytes follow which permits values of more than 8 bits to be encoded. The number of additional bytes varies and may be as many as 6. This is called a "multi-byte sequence". Therefore UTF-8 is also a variable width encoding. It cannot be assumed that non-ASCII characters will occupy one byte of memory. It is not correct to calculate the number of characters in the string by subtracting a pointer to the beginning of the string from a pointer to the end.

    For example the sequence of bytes representing the Unicode character U+0413 in UTF-8 are 0xD0 followed by 0x93. This multi-byte sequence was determined by using a hexedit program to create a file called ucs2.bin containing 2 bytes; 0x04 followed by 0x13. As described before, this encoding is UCS-2BE. To convert the file from UCS-2BE to UTF-8 the command $ iconv -f UCS-2BE -t UTF-8 ucs2.bin > utf8.bin was used followed by hexedit again to view the results.

    UTF-8 is the premier character encoding used on Unix and Unix-like platforms. For a complete description of UTF-8 read the UTF-8 and Unicode FAQ for Unix/Linux.

I18N Text Handling

There are primarily three techniques for managing I18N strings in a C program.
  • Plain C Strings Traditionally C strings used the plain char type with an 8 bit charset which does not require a variable width encoding. Charactersets other than ASCII or ISO-8859-1 (a.k.a. Latin1) can be used if a different codepage is set but the program is then largely committed to that one charset. The Microsoft Windows codepage for Russian is called CP1251 or WinCryllic. The hexcode for CYRILLIC CAPITAL LETTER GHE in CP1251 is 0xC3.
  • Wide Character Strings To permit multiple international charsets to be used within the same program, wide character strings were introduced that defined the larger wchar_t type. This character type does not define the charset or encoding used by the host C library however all of the platforms mentioned previosly use the Unicode charset.
  • Multi-Byte Strings Because wchar_t is larger than one byte, many programs would require significant modification to be converted to use wide characters. The UTF-8 encoding was devised specifically to permit the traditional char pointer strings to be used which reduces the complexity of internationalizing existing programs and kernel structures.

The Tchar Text Abstraction

To write a program that will compile and run without modification onLinux, Windows, and a variety of other platforms is a matter ofabstracting the techniques listed above used by each platform. Linuxuses both multi-byte and wide character encodings. Windows uses widecharacters however it is important to note that Windows does notsupport a UTF-8 locale so if Unicode is desired wide character stringsare the only option. Most other Unix and Unix-like systems supportmulti-byte strings as well as possibly wide character strings todifferent degrees. Programs written using the technique described herewill still premit using runtime defined codepages using the standard setlocale(3) mechanism.

The Tchar Type

The idea behind this technique is to use a typedef for the character type that resolves to either plain char or wchar_t. In this way the character type identifier does not change in the source code.
  #ifdef USE_WCHAR
typedef wchar_t tchar;
#else
typedef unsigned char tchar;
#endif

Abstracting String Functions

In addition to the character type, all functions that operate on it will need to be abstracted with macros that reference tchar rather than char or wchar_t. Consider the strncpy function. It uses the plain char type. Fortunately the major string functions have a wide character equivalent that usually has the same signature but accepts the wchar_t type.
  char *strncpy(char *dest, const char *src, size_t n);
wchar_t *wcsncpy(wchar_t *dest, const wchar_t *src, size_t n);
From the above signatures it can be seen that the only difference isthe character type. The number, order, and meaning of the parametersare the same. This permits the function to be abstracted with macros asfollows:
  #ifdef USE_WCHAR
#define tcsncpy wcsncpy
#else
#define tcsncpy strncpy
#endif
To use this function is now a matter of substituting all instances of strncpy or wcsncpy with tcsncpy.Depending on how the program is compiled, code that uses thesefunctions will support wide character or multi-byte strings (but notboth at the same time). See the Text Module API Documentation for a complete list of macros in text.h.

There are of course many other functions that operate on strings.Fortunately most standard C library function have wide characterversions that are reasonably consistent about identifier names. Anidentifier that begins with str will likey have a wide character version that begins with wcs. Other functions like vswprintf are not so obvious and depending the the system being used there will certainly be omissions or incompatablities (e.g. the vsnprintf counterpart wide character function is vswprintf without the n even though it accepts an nparameter). If a function does not have a man page or if the compilerissues a warning it does not necessarily mean the function does notexist on your system. For example, with the GNU C library it may benecessary to specify C99 or define _XOPEN_SOURCE=500 to indicate a UNIX98 environment is desired. Check your C library documentation (e.g. /usr/include/features.h). Check the POSIX documentation on the OpenGroup website. On my RedHat Linux 7.3 system the wcstol and several other conversion functions are not documented. It is necessary to specify -std=c99 or define -D_ISOC99_SOURCE with gcc to trigger it to export that symbol.

Variable Width Encodings

Unicode on Unix and Unix-like systems is supported using UTF-8. OnMicrosoft Windows UTF-16LE is used. As explained previously these arevariable width encodings. Each character can occupy a variable numberof bytes in memory. The question is; when does this require specialprocessing in your code?

A good example of when UTF-8 string handling requires specialhanlding is when each character needs to be examined individually.Consider the example of caseless comparison of two strings. They cannotsimply be compared element by element. Each character must be decodedto their wide character value and converted to upper or lowercase forthe comparison to be valid. Below is just such a function:

  /* Case insensitive comparison of two UTF-8 strings
*/
int
utf8casecmp(const unsigned char *str1, const unsigned char *str1lim,
const unsigned char *str2, const unsigned char *str2lim)
{
int n1, n2;
wchar_t ucs1, ucs2;
int ch1, ch2;
mbstate_t ps1, ps2;

memset(&ps1, 0, sizeof(ps1));
memset(&ps2, 0, sizeof(ps2));
while (str1 < str1lim && str2 < str2lim) {
if ((*str1 & 0x80) && (*str2 & 0x80)) { /* both multibyte */
if ((n1 = mbrtowc(&ucs1, str1, str1lim - str1, &ps1)) < 0 ||
(n2 = mbrtowc(&ucs2, str2, str2lim - str2, &ps2)) < 0) {
PMNO(errno);
return -1;
}
if (ucs1 != ucs2 && (ucs1 = towupper(ucs1)) != (ucs2 = towupper(ucs2))) {
return ucs1 < ucs2 ? -1 : 1;
}
str1 += n1;
str2 += n2;
} else { /* neither or one multibyte */
ch1 = *str1;
ch2 = *str2;

if (ch1 != ch2 && (ch1 = toupper(ch1)) != (ch2 = toupper(ch2))) {
return ch1 < ch2 ? -1 : 1;
} else if (ch1 == '\0') {
return 0;
}
str1++;
str2++;
}
}

return 0;
}
Thisis a fairly pathological example. In practice this is probably asdifficult as it gets. For example, if the objective is to search for acertain ASCII character such as a space or '\0' termniator, it is notnecessary to decode a Unicode value at all. It might even be reasonableto use isspace and similar functions (but probably not ispunct for example). This will require some experimenting and research.

Another example of when using a variable width encoding requiresspecial handling in your code is when calculating the number of bytesfrom the string required to occupy at most a certin number of dispaypositions in a terminal window. In this case it is necessary to converteach character to it's Unicode value and then use the wcwidth(3) function. When the total of values returned by wcwidth(3) equals or exceeds the desired number of columns the number of bytes traversed in the substring is known.

Potential Problems

This technique is not perfect. The wide character functions were notdesigned with this technique in mind. The prototypes are largely thesame only for the sake of consistency. It is important to understandwhere problems can occur and understand how to correctly fix or avoidthem.
  • Wide Character I/O It is not possible to mix wide character I/O functions like wprintf, fgetwc, and fputws with regular I/O functions. If a wide character I/O function is used the associated stream will switch into wide mode. Attempting to use both will result in erroneous behaviour (e.g. ESPIPE Illegal seek). All I/O could be performed in wide mode but on non-Microsoft platforms it can be awkward to performa all I/O entirely in wide character mode. Note however this restriction only applies to functions that cause I/O on a stream. For example, the swprintf function is ok with non-wide I/O. For non-Microsoft platforms is recommended that the wide character I/O functions simply be avoided. They ultimately just convert wide characters to multi-byte sequences and if an unexpected encoding is encountered it will be more difficult to detect and perform corrective action.

    Ultimately if the target code is reading and writing plain text to sockets or files on the filesystem the text will probably need to be converted to and from a well defined encoding like the locale dependant encoding with wcsrtombs(3) and mbsrtowcs(3). Currently the libmba text module does not define macros for the wide character I/O functions but that may changed in the future. See src/cfg.c for a good example of converting between wide character strings and the multi-byte encoding in files and the environment.

  • The File System Another form of wide character I/O pertains to the handling of file and directory names. The encoding used to read and write path names depends on the operating system and filesystem. On Linux for example, wide character strings cannot be used as filenames. Linux requires that the multi-byte encoding be used. This means that any wide character pathname must be converted to the locale dependant encoding using wcsrtombs(3) and any pathname read from the operating system may need to be converted to the wide character encoding using mbsrtowcs(3).

    The below source illustrates how a wide character pathname could be converted to the multi-byte encoding for passing to fopen(3).
        /* Open a file using a wide character path name.
    */
    FILE *
    wcsfopen(const wchar_t *path, const char *mode)
    {
    char dst[PATH_MAX + 1];
    size_t n;
    mbstate_t mb;

    memset(&mb, 0, sizeof(mbstate_t));
    if ((n = wcsrtombs(dst, &path, PATH_MAX, &mb)) == (size_t)-1) {
    return NULL;
    }
    if (n >= PATH_MAX) {
    errno = E2BIG;
    return NULL;
    }

    return fopen(dst, mode);
    }

    Currently, libmba modules that support the tchar abstraction do not accept wide character pathnames but that may change in the future.

    Note that Unicode pathnames are supported by Unix and Unix-like systems that support the UTF-8 multi-byte encoding. Just call setlocale(LC_CTYPE, "en_US.UTF-8") first. Or export LCTYPE=en_US.UTF-8 in the environment and call setlocale(LC_CTYPE, ""). To test such a program it will be necessary to see I18N text printed somewhere. The following is a worthwhile exercise:
        $ wget http://www.columbia.edu/kermit/utf8.html
    $ xterm -u8 -fn '-misc-fixed-*-*-*-*-20-*-*-*-*-*-iso10646-*'
    $ LANG=en_US.UTF-8 cat utf8.html

    This downloads a file with a wide range of UTF-8 encoded text in it, launches an xterm in UTF-8 mode with a Unicode font, and runs cat in the UTF-8 locale to print the contents of utf8.html to the terminal window. Some newer Linux systems use the UTF-8 locale by default now so the above setup may not be necessary.

  • Format Specifiers To format a string with snprintf the format specifier %s is used. For this text abstraction to work completely the equivalent wide character function swprintf would have to use %s as well for the format specifier for wide character strings. It does not. Both snprintf and swprintf use %s to specify regular strings and %ls to specify wide character strings. This means that even though stprint resolves to either snprintf or swprintf the format specifiers need to be different depending on the arguments to stprintf. This will require some conditional preprocesing such as the following example.
        #if defined(USE_WCHAR)
    if ((n = swprintf(path, L"/var/spool/mail/%ls", username)) == -1) {
    #else
    if ((n = snprintf(path, "/var/spool/mail/%s", username)) == -1) {
    #endif

  • Prototype Mistatch It is not uncommon for prototype mismatches to occur. Some examples are:
    • If non-wide characters are used tchar is unsigned which mismatches with functions that take signed char **. With gcc these generate warnings like:
          tests/TcharAll.c:100: warning: comparison of distinct pointer types lacks a cast
      tests/TcharAll.c:161: warning: passing arg 2 of `strtod' from incompatible pointer type

    • The constant TEOF is defined as either EOF which is signed or WEOF which is unsigned. This can provoke the compiler to emit type mismatch warnings.
    • Some wide character functions exibit behavior different from that of their counterpart. For example swprintf will return -1 if the n parameter is not large enough to accomodate the result. However the snprint function will return the length of the result regardless of whether or not the n parameter was large enough (although I believe the later behavor was introduced with C99 which is quite bazaar).
  • Simple ErrorsBe dilligent when manipulating text directly now that characters occupy more than 1 byte. Frequently this just means multiplying some value by the size of an element such as when calulating the number of bytes occupied by a run of text (or use tmemcpy):
        siz = (src - start + 1) * sizeof *src;

There are most certainly other problems and incompatabilies that Ihave omitted here. If you encounter any such example, please drop me amail.

TCHAR in Microsoft Windows

For programmers that have used the variety of string handling functionson the Microsoft Windows platform this character abstraction techniqueshould look familar. It is indeed the same. The abstract character typein the Win32 environment is named TCHAR in uppercase rather than lower and the string functions are prefixed with _tcs like _tcsncpy rather than tcsncpybut after macro processing the resulting code is the same. Theidentifier names where chosen to be the same as those found on Windows(minus a few Windows coding conventions that clash with Unix/Linuxconventions) simply becuase the Windows platform is very popular andthere was no practical reason to use different names. The exception isthat USE_WCHAR is used to signal that wide characters should be used rather than _UNICODE because on Unix and Unix-like systems multi-byte strings support Unicode in the UTF-8 locale which would make the _UNICODE macro somewhat inaccurate.