Bytes vs. Characters - the "Character-based" Functions

by Rick Fillman
May 7, 2003
Print Friendly Version

String parsing is nearly always intended to parse characters, rather than bytes. While this distinction is unimportant when the string data in question consists of single-byte characters (as in English and European alphabets), the distinction becomes paramount when considering Asian alphabets such as those used in China and Japan, where characters are usually represented by a two-byte combination.

So, can your application handle double-byte characters? Due to the architecture of the 32-bit versions of dBASE software, your dBASE PLUS application is generally prepared for handling Japanese or Chinese data without special modification. (The same is true for dB2K and Visual dBASE 7.x applications.) But there could one or more glitches in processing of double-byte character strings (DBCS) depending on your code. Will string parsing of DBCS be a problem for your application?

The potential culprits are in the original xBASE string parsing functions such as:

Len()
substr()
at()
rat()
left()
right()
stuff()
like()

These functions are byte-based. They process data byte by byte. While this is fine for single-byte alphabets, if applied to DBCS strings, characters might be broken in half or length values may be incorrect. Consider the following example:

The string is assigned a value of two Kanji characters followed by the letter ‘b’. The len() function returns 5. Although this is the correct byte-length, it is unlikely that this is the desired result.

Early versions of dBASE were re-engineered for the Japanese market. A parallel set of character-based functions was added to deal with DBCS. These functions are character sensitive. As such, they correctly count and parse characters (whether single or double byte, in the 'multi-byte' stream). The “character” version of the functions were given distinct names, with an added “C”, as follows:

cLen()
substrC()
atC()
ratC()
leftC()
rightC()
stuffC()
likeC()

Until the appearance of the 32-bit Windows version of dBASE, the “C” versions of these functions were of concern primarily to dBASE programmers in Japan. American and European versions of dBASE didn’t include these functions. However, beginning with Visual dBASE 7, the special Asian builds of the dBASE product were eliminated in favor of a single set of binary executables that could be used anywhere in the world. In today’s dBASE software, you can find both versions of the function.

Let’s return to the example of the two Kanji plus one English letter.

Now, using the cLen() function instead of the old Len() function, the desired character count of 3 is returned. This is the correct number of characters in the string.

Everyone should use the character versions of these functions, and abandon the use of the byte-based ones (with the rare exceptions where byte values are actually desired.) Why? Because they offer “superset” functionality – in other words they work exactly as you expect with single-byte character strings (now) as well as properly for DBCS. Thus, if you code using the character-based methods, your applications are prepared for handling Japanese or Chinese data, should the need arise.

There is an alternate method to coaxing character-based evaluation of strings. There is a set command that can be used as follows:

set dbcsstr on

After setting dbcsstr to on, functions substr(), at(), rat(), left(), right(), and stuff() will perform character-based rather than byte-based parsing. The Len() function, however, does not respect the dbcsstr flag. Thus, to ensure proper function for world alphabets, cLen() is the only alternative. Or is it?

Do you use String object methods? String objects were surfaced with with the release of 32-bit dBASE. Their length and parsing methods are character-based. No special consideration is needed for world alphabets. Perhaps you are thinking using the string objects involves several lines of code. Consider the following:

? myString.length // a property
? myString.substring( 0,2 )

A String object is implied, and thus all its properties and methods are available. You don't even need to assign the string to a variable. Something as direct as:

is supported. Take a look at all thirty-seven methods that are available with Strings.

Finally, there is one additional function to consider - isdbcs(). When handling characters, if you need to know whether any one character is single or double-byte, this function may be used. For example:

isdbcs( <string value> )
<String>.isDBCS( ) // as a method of String object

This returns true when the first character presented in the string is a double-byte character. Look for dBASE, Inc. to add descriptions of these functions to the product documentation in the future. For the trivia lovers, there are some functional equivalents of isdbcs() that may be of interest (and which will probably remain undocumented.) These are iskanji(), ishangul(), ischina() and each are equal to the more generic isdbcs().