At its heart, Unicode defines a character set that (hopefully) contains every Unicode Character someone can possibly think of. Thus, its goal is to UNIfy all enCODEings (=UNICODE) of the world.
The same character set is also defined by ISO 10646.
With Unicode, there is not a one to one relationship anymore between characters and bytes.
Goal
The original goal of Unicode was «to unify the many hundreds of conflicting ways to encode characters, replacing them with a single, universal standard.»
Standard
The Unicode standard aims at being
Universal
Efficient
Unambiguous
On 2021-09-14, Version 14.0 of the Unicode® Standard was announced.
Normative and informative data for implementers of the standard.
The Unicode standard Version 14.0 contains 144697 characters.
The standard does not only deal with characters but also covers aspects of text manipulation such as
Dividing words
Breaking lines
Formatting of numbers, dates, times etc. according to a locale
Writing left-to-right and right-to-left
Specialities of written asian languages
Security particularities like similarly looking characters
Although the standard comes with representative glyphs, it does not define glyph images. The standard concentrates on interpreting characters, not drawing them.
Unicode is not only the definition of the characters, but also assigns properties to characters that are dependent on the region where these characters are used.
Such properties include:
How characters are sorted
What a lower case character's uppercase version is, and vice vera
The following properties can be set (think: true) or absent (false) in each character.
Cased Letter (set if any of Uppercase, Lowercase or Titlecase Letter is set)
Uppercase Letter
Lowercase Letter
Titlecase Letter
Modifier Letter
Other Letter
Mark
Nonspacing Mark
Spacing Mark
Enclosing Mark
Number
Decimal Number (also Digit)
Letter Number
Other Number
Punctuation (also Punct)
Connector Punctuation
Dash Punctuation
Open Punctuation
Close Punctuation
Initial Punctuation (Behaves either as Open Punctuation or Close Punctuation, depending on usage)
Final Punctuation (Behaves either as Open Punctuation or Close Punctuation, depending on usage)
Other Punctuation
Symbol
Math Symbol - not all math symbols are visible.
Currency Symbol
Modifier Symbol
Other Symbol
Separator
Space Separator
Line Separator
Paragraph Separator
Other
Control (also Cntrl)
Format
Surrogate
Private Use
Unassigned
Some regular expression implementation support matching characters with a given property with \p{…}, for example: \p{Block=Greek}, \p{Script=Armenian}, \p{General_Category=Uppercase_Letter}, \p{White_Space}, \p{Alphabetic}, \p{Math}, \p{Bidi_Class=Right_to_Left}, \p{Word_Break=A_Letter }, \p{Numeric_Value=10}, \p{Hangul_Syllable_Type=Leading_Jamo} \p{Sentence_Break=SContinue}. (These examples were copied from comment of this Stackoverflow question).
Extended Grapheme Clusters (Logical Characters)
Unicode considers a character with a thingy (circles, graves etc, such as é) to consist of two characters: the base character (e) and the thingy.
Cases
Three cases: Upper-, lower and titlecase.
Changing the case in a string might change the string's length.
ª has no uppercase version.
a and A are letters and are lowercase, but they are not lowercase letters.
In Dutch, ij is considered to be one vowel and a captialized word starting with these two letters capitalized both: het IJsselmeer or IJs smelt bij 0 graden Celsius.
Turkic languages (Turkish (tr), Azerbaijani (az), Crimean Tatar (crh), Volga Tatar (tt) and Bashkir (ba)) have an i with and without dot: lowercase i and ı, and uppercase İ and I (in english, the uppercase I has no dot while the lowercase i has one).
In German, the uppercase of the ß (esszet) is SS (Straße — STRASSE)
In Greek, vowels lose their accent in uppercase (ά - Α), except for the disjunctive eta (ή - Ή)
In Greek, Σ is the uppercase version of both σ and ς.
Scripts
Latin, Cyrillic, Greek, Hiragana, Katakan …
Interesting Characters
ZERO WIDTH SPACE
0x200b, sometimes abbreviated with ZWS. HTML: ​ or ​
ZERO WIDTH NON-JOINER
0x200c, sometimes abbreviated with WSNJ. HTML: ‌ or ‌
ZERO WIDTH JOINER
0x200d, sometimes abbreviated with ZWJ. HTML: ‍ or ‍
REPLACEMENT CHARACTER
� (Hex: fffc, Dec: 65533): used to replace an incoming character whose value is unknown or unrepresentable in Unicode.
The Unicode Character Database consists of a number of data files listing Unicode character properties and related data. It also includes data files containing test data for conformance to several important Unicode algorithms.