| |
Latest Version
Unicode 4.0.0
Unicode 4.0.0 is a
major version of the
Unicode Standard. The text of the standard has been extensively rewritten to
improve its structure and clarity.
The Unicode Standard, Version
4.0, together with the online Unicode Standard Annexes and the Unicode
Character Database, defines Version 4.0 of the Unicode Standard. The book
gives the general principles, requirements for conformance, and guidelines
for implementers, followed by character code charts and names. This book can
be ordered online.
A complete specification of the
contributory files for Unicode 4.0.0 is found on
Enumerated Versions. Version 4.0.0 of the Unicode Standard should be
referenced as:
The Unicode Consortium. The
Unicode Standard, Version 4.0.0, defined by: The Unicode Standard,
Version 4.0 (Boston, MA, Addison-Wesley, 2003. ISBN 0-321-18578-1)
Online Edition
The text of The Unicode
Standard, Version 4.0, as well as the final character code charts, is
available online via the navigation links on this page. These files may not
be printed. The
Unicode
4.0 Web Bookmarks page has links to all sections of the online text.
Major additions to Version 4.0
since Version 3.0 include:
- major changes to the
introductory and conformance chapters, and extensive revisions to the
discussion of punctuation, symbols, and format characters
- extensive additions of CJK
characters to cover dictionaries and historic usage
- many new symbols for
mathematical and technical publication
- many individual characters
such as currency symbols were added to other scripts, including Indic,
Khmer, Latin, Greek, Arabic, and Syriac
- substantially improved
specification of conformance requirements, incorporating the character
encoding model
- encoding of supplementary
characters
- formalized policies for
stability of the standard
- clarification of semantics of
special characters, including the byte order mark
- major expansion of Unicode
Character Database properties and of specifications for text boundaries and
casing
- more minority scripts,
including Limbu, Tai Le, Osmanya, and Philippine scripts
- more historic scripts,
including Linear B, Cypriot, and Ugaritic
- tightened definition of
encoding terms, including UTF-32
- substantial improvements to
the script descriptions, particularly for Indic scripts and Khmer.
New Characters
1,226 new character assignments
were made to the Unicode Standard, Version 4.0 (over and above what was in
Unicode 3.2). These additions include currency symbols, additional Latin and
Cyrillic characters, the Limbu and Tai Le scripts; Yijing Hexagram symbols,
Khmer symbols, Linear B syllables and ideograms, Cypriot, Ugaritic, and a new
block of variation selectors (especially for future CJK variants). Double
diacritic characters were added for dictionary use.
These new characters extend the
set of modern currency symbols, and represent a greater coverage of minority
and historical scripts. The following table shows the allocation of code
points in Unicode 4.0.0. For more information on the specific characters, see
the file
DerivedAge.txt in the Unicode
Character Database.
| Graphic |
96,248 |
| Format |
134 |
| Control |
65 |
| Private Use |
137,468 |
| Surrogate |
2,048 |
| Noncharacter |
66 |
| Reserved |
878,083 |
The character repertoire
corresponds to ISO/IEC 10646:2003. For more details of character counts, see
Appendix D,
Changes from Unicode Version 3.0.
Unicode Character Database
Unicode Version 4.0.0 introduced
the concept of provisional properties, clarified the relationships between
properties, and provided precisely defined fallback properties for characters
not explicitly defined in the data files. The documentation was coalesced
into
UCD.html, with a combined list of
Properties.
Other property changes include:
- Prefix Format Control.
U+06DD arabic end of ayah and
U+070F syriac abbreviation mark
were reclassified and have significantly different behavior as prefix
format control characters. The new characters U+0600..U+0603 were given
this behavior as well.
- New Properties. The
Hangul Syllable Type and identifier Other_ID_Start properties were added.
The Unicode Radical Stroke property was classified as informative; all
other Unihan properties were classified as provisional.
PropertyValueAliases also adds block names.
- Numeric Properties.
CJK numeric values added; the properties Decimal Number (Nd) and the
Numeric Type decimal digit were aligned in value.
- Default Ignorables.
Added Hangul Filler characters, U+00AD
soft hyphen, CGJ, and ZWS
- Soft Hyphen. U+00AD
soft hyphen was also changed
to General Category Cf. Its semantics were clarified: it marks a position
for hyphenation, rather than being itself a hyphen character. (The Hyphen
property itself was stabilized, and thus not changed to reflect this.)
- Modifier Letters. The
General Category of U+02B9..U+02BA, U+02C6..U+02CF changed to General
Category Lm.
- Grapheme_Extend. The
halfwidth katakana marks, and most combining marks (except as needed
for canonical equivalence) were removed.
- Mongolian Vowel Separator.
U+180E mongolian vowel separator
was changed to General Category Zs.
- Deprecated Characters.
Two Khmer characters, U+17A3 khmer
independent vowel qaq and U+17D3
khmer sign bathamasat, were
deprecated. Four others are strongly discouraged.
- Enclosing combining marks.
The scope has been defined more clearly.
- ZWJ. The semantics
with cursive scripts has been revised.
- Normalization Corrections.
There were corrections for characters U+2F868; U+2F874; U+2F91F;
U+2F95F; U+2F9BF.
For more information, see the
file
UCD.html in the Unicode Character
Database.
Conformance
Chapter 3 was substantially
improved by incorporating the Unicode Character Encoding Model, resulting in
fully specified definitions and conformance requirements of UTF-8, UTF-16,
and UTF-32. As a part of this, the related concept of Unicode String is
defined, which is a sequence of code units for internal processing; a
sequence that is not necessarily a valid Unicode Encoding Form.
Clearer terminology was
introduced for code points assignments, including the seven main categories
given in the above table. The conformance status of UAXes, UTSes and UTRs was
also clarified. In addition:
- Identifiers. A
structure for ensuring backwards-compatible programming language
identifiers was introduced using the new property Other_ID_Start. There is
also an alternate definition for complete stability of identifiers.
- Bidi. The bidi
algorithm was updated and moved to UAX #9 (see below).
- Line Breaking and
Boundaries. U+00AD soft hyphen was reclassified. Text boundaries were
clarified.
- Case Folding. The text
from UAX #21, “Case Mappings,” was incorporated and updated for case
folding and other new properties. The definition of titlecase uses word
boundaries, and there is a clearer definition of string functions:
- isUpper(), isLower(),
isTitle(), isFold()
- toUpper(), toLower(),
toTitle(), toFold()
Unicode Standard Annexes
The following Unicode Standard
Annex was added:
- UAX #29: Text Boundaries
- Now contains information on
text boundary conditions formerly published in Chapter 5 of The Unicode
Standard, Version 3.0.
- Provides default
definitions for grapheme cluster ('user character'), word, and sentence
boundaries
The following Unicode Standard
Annexes were updated:
- UAX #9, The Bidirectional
Algorithm
- Now contains information on
the bidirectional algorithm formerly published in Chapter 3 of The
Unicode Standard, Version 3.0.
- Canonically equivalence is
now preserved (a data change, not algorithm change)
- Shaping is done after
reordering, but not across directional boundaries
- There were clarifications
of: ZWJ, ZWNJ, and intermediate level processing
- UAX #14, Line Breaking
Properties
- Negative numbers and dates
with hyphens will not break across lines
- Word-Joiner will link any
characters (except hard line breaks)
- The behavior of soft hyphen
is clarified (it marks an opportunity for breaking, not specific graphic
appearance)
- The rules for GL are
relaxed: SP and ZW override GL
- There are new property
values: NL, WJ
- UAX #15: Unicode
Normalization Forms
- There is a description of
Stable Code Points, and the notation NFC(x) and isNFC(x)
- Annex 12: Corrigenda was
rewritten for clarity, and to describe the use of Normalization
Corrections.
- Annex 13: Canonical
Equivalence was added
- UAX #11: East Asian Width
- Extended the range for the
default property value to 30000–3FFFD.
The following Unicode Technical
Report was upgraded in status to a Unicode Standard Annex:
- UAX #24: Script Names
- Added notes on the
stability of Q names, the usage of Mn, Me characters, and scripts with
regard to spoofing.
- Added Braille.
The following Standard Annexes
were superseded as a result of their incorporation into the text of this
book:
- UAX #13: Unicode Newline
Guidelines
- UAX #19: UTF-32
- UAX #21: Case Mappings
- UAX #27: Unicode 3.1
- UAX #28: Unicode 3.2
Errata
Errata incorporated into Unicode
4.0 are listed by date in a
separate table. For corrigenda and errata after the release of Unicode
4.0, see the list of current Updates
and Errata. |