Unicode

Goal

The original goal of Unicode was «to unify the many hundreds of conflicting ways to encode characters, replacing them with a single, universal standard.»

Standard

The Unicode standard aims at being

Universal
Efficient
Unambiguous

On 2021-09-14, Version 14.0 of the Unicode® Standard was announced.

This version consists of

The core specification	Available as a single PDF document.
The code charts	Representative glyphs for all Unicode characters, online here.
The Unicode Standard Annexes	Normative information about particular aspects of the standard.
The Unicode Character Database (UCD)	Normative and informative data for implementers of the standard.

The Unicode standard Version 14.0 contains 144697 characters.

The standard does not only deal with characters but also covers aspects of text manipulation such as

Dividing words
Breaking lines
Formatting of numbers, dates, times etc. according to a locale
Writing left-to-right and right-to-left
Specialities of written asian languages
Security particularities like similarly looking characters

Although the standard comes with representative glyphs, it does not define glyph images. The standard concentrates on interpreting characters, not drawing them.

The latest version of the Unicode Standard is located at http://www.unicode.org/versions/latest/.

Properties of Unicode characters

Unicode is not only the definition of the characters, but also assigns properties to characters that are dependent on the region where these characters are used.

Such properties include:

How characters are sorted
What a lower case character's uppercase version is, and vice vera

The following properties can be set (think: true) or absent (false) in each character.

Cased Letter (set if any of Uppercase, Lowercase or Titlecase Letter is set)
Uppercase Letter
Lowercase Letter
Titlecase Letter
Modifier Letter
Other Letter
Mark
Nonspacing Mark
Spacing Mark
Enclosing Mark
Number
Decimal Number (also Digit)
Letter Number
Other Number
Punctuation (also Punct)
Connector Punctuation
Dash Punctuation
Open Punctuation
Close Punctuation
Initial Punctuation (Behaves either as Open Punctuation or Close Punctuation, depending on usage)
Final Punctuation (Behaves either as Open Punctuation or Close Punctuation, depending on usage)
Other Punctuation
Symbol
Math Symbol - not all math symbols are visible.
Currency Symbol
Modifier Symbol
Other Symbol
Separator
Space Separator
Line Separator
Paragraph Separator
Other
Control (also Cntrl)
Format
Surrogate
Private Use
Unassigned

Some regular expression implementation support matching characters with a given property with \p{…}, for example: \p{Block=Greek}, \p{Script=Armenian}, \p{General_Category=Uppercase_Letter}, \p{White_Space}, \p{Alphabetic}, \p{Math}, \p{Bidi_Class=Right_to_Left}, \p{Word_Break=A_Letter }, \p{Numeric_Value=10}, \p{Hangul_Syllable_Type=Leading_Jamo} \p{Sentence_Break=SContinue}. (These examples were copied from comment of this Stackoverflow question).

Extended Grapheme Clusters (Logical Characters)

Unicode considers a character with a thingy (circles, graves etc, such as é) to consist of two characters: the base character (e) and the thingy.

Cases

Three cases: Upper-, lower and titlecase.

Changing the case in a string might change the string's length.

ª has no uppercase version.

^a and ^A are letters and are lowercase, but they are not lowercase letters.

The case might be locale dependent.

Interesting stuff

Some interesting particularities, imho, include

In Dutch, ij is considered to be one vowel and a captialized word starting with these two letters capitalized both: het IJsselmeer or IJs smelt bij 0 graden Celsius.
Turkic languages (Turkish (tr), Azerbaijani (az), Crimean Tatar (crh), Volga Tatar (tt) and Bashkir (ba)) have an i with and without dot: lowercase i and ı, and uppercase İ and I (in english, the uppercase I has no dot while the lowercase i has one).
In German, the uppercase of the ß (esszet) is SS (Straße — STRASSE)
In Greek, vowels lose their accent in uppercase (ά - Α), except for the disjunctive eta (ή - Ή)
In Greek, Σ is the uppercase version of both σ and ς.

Scripts

Latin, Cyrillic, Greek, Hiragana, Katakan …

Interesting Characters

ZERO WIDTH SPACE	0x200b, sometimes abbreviated with ZWS. HTML: `&#8203` or `&ZeroWidthSpace;`
ZERO WIDTH NON-JOINER	0x200c, sometimes abbreviated with WSNJ. HTML: `‌` or `&zwnj;`
ZERO WIDTH JOINER	0x200d, sometimes abbreviated with ZWJ. HTML: `‍` or `&zwj;`
REPLACEMENT CHARACTER	� (Hex: fffc, Dec: 65533): used to replace an incoming character whose value is unknown or unrepresentable in Unicode.

Miscellaneous Symbols and Pictographs

Letterlike symbols

HTML Entities

A Unicode code point can be embedded into a HTML document with the entity notation &#x….

<!DOCTYPE html>
<html>
<head>
  <title>Greek Small Letter Alpha</title>
  <style>
     * { font-size: 72px }
  </style>
</head>
<body>

   &#x03B1;

</body>
</html>

Entering Unicode characters

In Windows, the registry value of EnableHexNumpad under HKEY_CURRENT_USER\Control Panel\Input Method can be set to 1 which allows to enter Unicode characters with the Alt key.

Links

The Unicode Character Database consists of a number of data files listing Unicode character properties and related data. It also includes data files containing test data for conformance to several important Unicode algorithms.

Unicode Character Code Charts