There's No Such Thing As Plain Text • Dylan Beattie • YOW! 2023

Explore the evolution of character representation from ASCII to Unicode, and how Unicode's unique identifier, language-agnostic design, and various encoding and normalization methods shape the digital landscape.

Key takeaways

ASCII was designed to be a 7-bit encoding, but was later expanded to 8 bits
The first Unicode character set was released in 1991 and contained over 7,000 characters
Unicode characters are represented using a unique identifier, known as a code point
The Unicode Consortium was established in 1991 to maintain and expand the Unicode standard
ASCII and Unicode have different approaches to character representation, with ASCII using a simple substitution model and Unicode using a more complex and nuanced approach
Unicode is designed to be language-agnostic, meaning that it can be used to represent characters from any language
Unicode has different levels of normalization, including NFC (compose), NFD (decompose), NFKC (compose), and NFKD (decompose), which determine how characters are represented and combined
Unicode also has different types of encoding, including UTF-8, UTF-16, and UTF-32, which determine how characters are represented in bytes
UTF-8 is a variable-length encoding that uses a combination of 1, 2, 3, or 4 bytes to represent characters
UTF-16 and UTF-32 are fixed-length encodings that use 2 or 4 bytes to represent characters
Unicode also has different character classification, including letter, digit, punctuation, and symbol, which determines how characters are treated in different contexts
Unicode has different character properties, including bidirectional, line break, and grapheme cluster, which determine how characters are displayed and interacted with

There's No Such Thing As Plain Text • Dylan Beattie • YOW! 2023

More talks