Search notes:

Unicode

At its heart, Unicode defines a character set that (hopefully) contains every Unicode Character someone can possibly think of. Thus, its goal is to UNIfy all enCODEings (=UNICODE) of the world.
It allegedly consists of 1114112 codepoints.
The same character set is also defined by ISO 10646.
With Unicode, there is not a one to one relationship anymore between characters and bytes.

Goal

The original goal of Unicode was «to unify the many hundreds of conflicting ways to encode characters, replacing them with a single, universal standard.»

Standard

The Unicode standard aims at being
On 2021-09-14, Version 14.0 of the Unicode® Standard was announced.
This version consists of
The core specification Available as a single PDF document.
The code charts Representative glyphs for all Unicode characters, online here.
The Unicode Standard Annexes Normative information about particular aspects of the standard.
The Unicode Character Database (UCD) Normative and informative data for implementers of the standard.
The Unicode standard Version 14.0 contains 144697 characters.
The standard does not only deal with characters but also covers aspects of text manipulation such as
Although the standard comes with representative glyphs, it does not define glyph images. The standard concentrates on interpreting characters, not drawing them.
The latest version of the Unicode Standard is located at http://www.unicode.org/versions/latest/.

Properties of Unicode characters

Unicode is not only the definition of the characters, but also assigns properties to characters that are dependent on the region where these characters are used.
Such properties include:
The following properties can be set (think: true) or absent (false) in each character.
Some regular expression implementation support matching characters with a given property with \p{…}, for example: \p{Block=Greek}, \p{Script=Armenian}, \p{General_Category=Uppercase_Letter}, \p{White_Space}, \p{Alphabetic}, \p{Math}, \p{Bidi_Class=Right_to_Left}, \p{Word_Break=A_Letter }, \p{Numeric_Value=10}, \p{Hangul_Syllable_Type=Leading_Jamo} \p{Sentence_Break=SContinue}. (These examples were copied from comment of this Stackoverflow question).

Extended Grapheme Clusters (Logical Characters)

Unicode considers a character with a thingy (circles, graves etc, such as é) to consist of two characters: the base character (e) and the thingy.

Cases

Three cases: Upper-, lower and titlecase.
Changing the case in a string might change the string's length.
ª has no uppercase version.
a and A are letters and are lowercase, but they are not lowercase letters.
The case might be locale dependent.

Interesting stuff

Some interesting particularities, imho, include
  • In Dutch, ij is considered to be one vowel and a captialized word starting with these two letters capitalized both: het IJsselmeer or IJs smelt bij 0 graden Celsius.
  • Turkic languages (Turkish (tr), Azerbaijani (az), Crimean Tatar (crh), Volga Tatar (tt) and Bashkir (ba)) have an i with and without dot: lowercase i and ı, and uppercase İ and I (in english, the uppercase I has no dot while the lowercase i has one).
  • In German, the uppercase of the ß (esszet) is SS (Straße — STRASSE)
  • In Greek, vowels lose their accent in uppercase (ά - Α), except for the disjunctive eta (ή - Ή)
  • In Greek, Σ is the uppercase version of both σ and ς.

Scripts

Latin, Cyrillic, Greek, Hiragana, Katakan …

Interesting Characters

ZERO WIDTH SPACE 0x200b, sometimes abbreviated with ZWS. HTML: &#8203 or ​
ZERO WIDTH NON-JOINER 0x200c, sometimes abbreviated with WSNJ. HTML: ‌ or ‌
ZERO WIDTH JOINER 0x200d, sometimes abbreviated with ZWJ. HTML: ‍ or ‍
REPLACEMENT CHARACTER � (Hex: fffc, Dec: 65533): used to replace an incoming character whose value is unknown or unrepresentable in Unicode.
Miscellaneous Symbols and Pictographs
Letterlike symbols

HTML Entities

A Unicode code point can be embedded into a HTML document with the entity notation &#x….
<!DOCTYPE html>
<html>
<head>
  <title>Greek Small Letter Alpha</title>
  <style>
     * { font-size: 72px }
  </style>
</head>
<body>

   &#x03B1;

</body>
</html>

Entering Unicode characters

In Windows, the registry value of EnableHexNumpad under HKEY_CURRENT_USER\Control Panel\Input Method can be set to 1 which allows to enter Unicode characters with the Alt key.

See also

Java class java.lang.String - getBytes
PerlModule: Namespace Unicode
The Perl function ord returns the code point of a given character (on most systems, that is) - it's reverse function is chr.
The JavaScript method String.fromCharCode creates a string from one or more code points.
Unicode related JavaScript code snippets.
ISO 10646
Find Unicode by visual appearance
Superscript letters
WinAPI: Definition of TCHAR, TEXT etc. depending on UNICODE
collation
The Excel worksheet function unichar()
charmap.exe offers some limited functionality to search for a Unicode character by its name.
Python:

Links

The Unicode Character Database consists of a number of data files listing Unicode character properties and related data. It also includes data files containing test data for conformance to several important Unicode algorithms.
Unicode Character Code Charts

Index