Search notes:

UTF-8

UTF-8 is an unicode encoding that uses 1 to 4 bytes to represent a Unicode Character.

Essential characteristics of UTF-8

One of the main characteristics of UTF-8 is that preserves the full range of the US-ASCII characters. Thus, when UTF-8 was introduced, it was already compatible with software and data that processed ASCII characters exclusively. (See also Adoption of ISO 10646).
UTF-8 encodes uses a varying number of bytes to encode the individual Universal Character Set (UCS).
The byte-values C0, C1, F5 and FF occur never in UTF-8 encoded text.
The boundaries between characters are easily found from any byte in a UTF-8 stream.
Boyer-Moore fast search algorithm can be applied.

Byte format types

from to byte sequence
0x00 0x7f 0xxxxxxx
0x80 0x07ff 110xxxxx 10xxxxxx
0x800 0xffff 1110xxxx 10xxxxxx 10xxxxxx
0x00010000 0010ffff 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

BOM (Byte Order Mark)

The BOM for UTF-8 is ef bb bf, however, the Unicode standard does not recommend it.
Even though the BOM is not recommend for UTF-8, PowerShell scripts check for such a BOM and especialy in conjunction with COM, it is often very benefical to use the BOM.

Misc

UTF-8 was developed by Ken Thompson

See also

RFC 3629: UTF-8, a transformation format of ISO 10646
Python PEP 686 makes UTF-8 the default encoding for files, stdio and pipes.
iconv
Perl and utf8
SAS utf-8 sessions
The Windows code page number for UTF-8 is 65001 (see for example the cmd.exe command chcp).

Index