Essential characteristics of UTF-8
One of the main characteristics of UTF-8 is that preserves the full range of the US-
ASCII characters. Thus, when UTF-8 was introduced, it was already compatible with software and
data that processed ASCII characters exclusively. (See also
Adoption of ISO 10646).
UTF-8 encodes uses a varying number of bytes to encode the individual Universal Character Set (UCS).
The byte-values C0
, C1
, F5
and FF
occur never in UTF-8 encoded text.
The boundaries between characters are easily found from any byte in a UTF-8 stream.
Boyer-Moore fast search algorithm can be applied.
BOM (Byte Order Mark)
The
BOM for UTF-8 is
ef
bb
bf
, however, the
Unicode standard does not recommend it.
Even though the BOM is not recommend for UTF-8,
PowerShell scripts check for such a BOM and especialy in conjunction with
COM, it is often very benefical to use the BOM.