Background of Ekushey Feb. :: UNESCO Proclamation :: Ekushey Padak 2004
language Martyrs
:: Unicode & Bengali Alphabet ::  Article ::  Rare Pictures

 

Bangla the International Mother Language Day

   
 

Where Is My Character?

If you are trying to find a specific character in Unicode, the first place to go is to the code charts. These are in the published The Unicode Standard 4.0 (Addison-Wesley Pub Co; ISBN 0-321-18578-1). You can also find characters in the online code charts. The code charts are organized into blocks, which are groupings of related characters.

For each character defined in Unicode you will find an assigned code point: a hexadecimal number that is used to represent that character in computer data.

Location

You may not find the character in what you think is the obvious spot. While the characters in Unicode are grouped into blocks, this is only a rough grouping because characters can be categorized many different ways. In particular, punctuation and symbols are applicable across a very wide range of usages and scripts (writing systems). Even the notion of a script itself is not well-defined; text in a given language may make use of characters from multiple scripts. For example, the digits 0-9 are in widespread use; the Devanagari danda is used across many Indic scripts.

Thus you may need to look in several locations to find your character. You may find the Character Index in The Unicode Standard 4.0 helpful for this. You can also do a text search in the online Unicode NamesList. For example, suppose you were searching for a Japanese kome. By opening up the NamesList in your browser, and searching for "Japanese kome", you would find it under the entry:

203B REFERENCE MARK
= Japanese kome
= Urdu paragraph separator
x (tibetan ku ru kha bzhi mig can - 0FBF)

For Han characters (Chinese, Japanese, and Korean), you can use the Han Radical-Stroke Index in Chapter 17 of The Unicode Standard 4.0. You may also find it useful to to a text search in the Unihan file in the Unicode character database, which contains a lot of cross-reference material on Han characters.

Some of the technical reports contain characters organized in different ways. You may sometimes find that useful in finding your character. For example, see Collation charts, Script charts, Case Mapping charts, or Normalization charts. If you know what legacy character encoding your character is in, you might be able to find it in the draft ICU character mapping tables.

Variant Shapes

You may not find a character simply because the charts do not specify the exact shape; they only provide a representative shape for identification. For example, a lowercase Cyrillic p could appear with any of the following character shapes (also called glyphs). The second is customary for italic in Russia, and the third is customary for italic in Serbia:

Cyrillic p Russian Italic Serbian Italic

Characters may also take on different shapes in different contexts. So, for example, the Arabic character hah may have four different basic shapes.

Representative shape in code chart Possible shapes in context

The character you are looking for may be represented as a sequence of code points in Unicode. Here are examples of such characters, and their representation as a sequence of code points.

Character Code Points Linguistic Usage
0063 0068 Slovak, traditional Spanish
0074 02B0 Native American languages
0078 0323
019B 0313
00E1 0328 Lithuanian
0069 0307 0301
30C8 309A Ainu in kana transcription

Similarly, you won't find the Indic half-forms in the code charts, since they are formed with a consonant + halant (virama). For example:

Representative shapes in code chart Display appearance

Other Devanagari ligatures such as ksha are coded with sequences Table 9-2 on page 229 of Section 9.1 Devanagari. For example:

Representative shapes in code chart Display appearance

In addition, the joining control characters can be used to request specific appearances, as in Table 9-1. For example:

Representative shapes in code chart Display appearance

If you have questions about such sequences that are not answered in the relevant block description of the Unicode Standard, you may be able to get a clarification if you ask on the general Unicode public e-mail list.

Duplicates

In some rare instances, you will find apparently identical characters. In most cases, if not all, this is to maintain compatibility with the original source standards for Unicode: vendor, national, and international character standards in wide usage in 1990. For example, there are duplicate encodings in the following case:

Capital letter A with ring
Angstrom sign

There are also particular shapes of characters that are given separate code points in Unicode, such as the shapes of the Arabic character hah listed above. These were also added to Unicode because of pre-existing standards.

For compatibility with pre-existing standards, there are characters that are equivalently represented either as sequences of code points or as a single code point called a composite character. For example, the i with 2 dots in naïve could be presented either as i + diaeresis (0069 0308) or as the composite character i + diaeresis (00EF).

There are other cases where the order of two combining characters does not matter. For example, the pair of combining characters acute and dot-below can occur with either one first; both alternate orders are equivalent. The rules for when order is significant is precisely spelled out by the Unicode Standard.

Due to the requirements for uniqueness — especially on the Internet — Unicode provides for a unique format, called Form C. This format always picks one of the equivalent code points (or sequences of code points) and not the other. It also picks a specific order where there are alternatives. For more information, see UTR #15: Unicode Normalization Forms.

In a very few cases, Unicode separates glyphs as distinct characters on the basis of whether they are treated as letters or not. For example, the following characters are distinguished on this basis, even though the range of possible shapes are the same.

Modifier letter prime. Is treated as a letter. Used to transcribe the "soft" sign in Cyrillic.
Prime. Treated as a punctuation mark or symbol. Used in mathematics, and as a symbol for minutes (fractions of degrees).

In those rare cases where this occurs, to decide which character to use you should consult the text of the Unicode Standard.

Submissions

Finally, your character may not yet be encoded in Unicode. There is a well defined submission process for new characters or scripts. This process verifies that the proposed character is in fact a candidate for encoding. In some cases, this process may not be straightforward: for example, Egyptian hieroglyphs have not yet been encoded because there is not yet general agreement on the exact repertoire of characters.

Because the Unicode Standard and ISO 10646 are synchronized in character codes, both organizations need to agree to the encoding of new characters. This process can require some time before a new character is accepted into the standard, and some time beyond that before it is fully supported in products.

   
 

SDNP HOME

 

TOP