原创 How it used to be: ASCII, EBCDIC, ISO, and Unicode (Part 2)

 2011-10-18 16:47  3435 22 22 分类: 消费电子

EBCDIC Code(s)

The ASCII code discussed above was quickly adopted by the majority of American computer manufacturers, and was eventually turned into an international standard (see also the discussions on ISO and Unicode later in this paper.) However, IBM already had its own six-bit code called BCDIC (Binary Coded Decimal Interchange Code). Thus, IBM decided to go its own way, and it developed a proprietary 8bit code called the Extended Binary Coded Decimal Interchange Code (EBCDIC).

Pronounced "eb-sea-dick" by some and "eb-sid-ick" by others, EBCDIC was first used on the IBM 360 computer, which was presented to the market in 1964. As was noted in our earlier discussions, one of the really nice things about ASCII is that all of the alpha characters are numbered sequentially. In turn, this means that we can perform programming tricks like saying "char = 'A' + 23" and have a reasonable expectation of ending up with the letter 'X'. To cut a long story short, if you were thinking of doing this with EBCDIC ... don't. The reason we say this is apparent from the table shown in Figure 3.

Figure 3: EBCDIC character codes.

A brief glance at this illustration shows just why EBCDIC can be such a pain to use – the alphabetic characters don't have sequential codes. That is, the letters 'A' through 'I' occupy codes $C1 to $C9, 'J' through 'R' occupy codes $D1 to $D9, and 'S' through 'Z' occupy codes $E2 to $E9 (and similarly for the lowercase letters). Thus, performing programming tricks such as using the expression ('A' + 23) is somewhat annoying with EBCDIC. Another nuisance is that EBCDIC doesn't contain all of the ASCII codes, which makes transferring text files between the two representations somewhat problematical.

Once again, in addition to the standard alphanumeric characters ('a'...'z', 'A'...'Z' and '0'...'9'), punctuation characters (comma, period, semi-colon, ...), and special characters ('!', '#', '%', ...), EBCDIC includes a lot of strange mnemonics, such as ACK, NAK, and BEL, which were designed for communications purposes. Some of these codes are still used today, while others are, generally speaking, of historical interest only. A slightly more detailed breakdown of these codes is presented in Figure 4 for your edification and delight.

Figure 4: EBCDIC control codes.

As one final point of interest, different countries have different character requirements, such as the á, ê, and ü characters. Due to the fact that IBM sold its computer systems around the world, it had to create multiple versions of EBCDIC. In fact, 57 different national variants were eventually wending their way across the planet. (A "standard" with 57 variants! You can only imagine how much fun everybody had when transferring files from one country to another).

ISO and Unicode

Upon its introduction, ASCII quickly became a de facto standard around the world. However, the original ASCII didn't include all of the special characters (such as á, ê, and ü) that are required by the various languages that employ the Latin alphabet. Thus, the International Organization for Standardization (ISO) in Geneva, Switzerland, undertook to adapt the ASCII code to accommodate other languages.

In 1967, the organization released its recommendation ISO 646. Essentially, this left the original 7bit ASCII code "as was", except that ten character positions were left open to be used to code for so-called "national variants."

ISO 646 was a step along the way toward internationalization. However, it didn't satisfy everyone's requirements; in particular (as far as the ISO was concerned), it wasn't capable of handling all of the languages in use in Europe, such as the Arabic, Cyrillic, Greek, and Hebrew alphabets. Thus, the ISO created its standard 2022, which described the ways in which 7bit and 8bit character codes were to be structured and extended.

The principles laid down in ISO 2022 were subsequently used to create the ISO 8859-1 standard. Unofficially known as "Latin-1", ISO 8859 is widely used for passing information around the Internet in Western Europe to this day. Full of enthusiasm, the ISO then set about defining an "all-singing all-dancing" 32bit code called the Universal Coded Character Set (UCS). Now known as ISO/IEC DIS 10646 Version 1, this code was intended to employ escape sequences to switch between different character sets. The result would have been able to support up to 4,294,967,296 characters, which would have been more than sufficient to address the world's (possibly the universe's) character coding needs for the foreseeable future.

However, starting in the early 1980s, American computer companies began to consider their own solutions to the problem of supporting multilingual character sets and codes. This work eventually became known as the Unification Code, or Unicode for short. Many people preferred Unicode to the ISO 10646 offering on the basis that Unicode was simpler. After a lot of wrangling, the proponents of Unicode persuaded the ISO to drop 10646 Version 1 and to replace it with a Unicode-based scheme, which ended up being called ISO/IEC 10646 Version 2.