There's No Such Thing As Plain Text • Dylan Beattie • YOW! 2023

Explore the evolution of character representation from ASCII to Unicode, and how Unicode's unique identifier, language-agnostic design, and various encoding and normalization methods shape the digital landscape.

Key takeaways
  • ASCII was designed to be a 7-bit encoding, but was later expanded to 8 bits
  • The first Unicode character set was released in 1991 and contained over 7,000 characters
  • Unicode characters are represented using a unique identifier, known as a code point
  • The Unicode Consortium was established in 1991 to maintain and expand the Unicode standard
  • ASCII and Unicode have different approaches to character representation, with ASCII using a simple substitution model and Unicode using a more complex and nuanced approach
  • Unicode is designed to be language-agnostic, meaning that it can be used to represent characters from any language
  • Unicode has different levels of normalization, including NFC (compose), NFD (decompose), NFKC (compose), and NFKD (decompose), which determine how characters are represented and combined
  • Unicode also has different types of encoding, including UTF-8, UTF-16, and UTF-32, which determine how characters are represented in bytes
  • UTF-8 is a variable-length encoding that uses a combination of 1, 2, 3, or 4 bytes to represent characters
  • UTF-16 and UTF-32 are fixed-length encodings that use 2 or 4 bytes to represent characters
  • Unicode also has different character classification, including letter, digit, punctuation, and symbol, which determines how characters are treated in different contexts
  • Unicode has different character properties, including bidirectional, line break, and grapheme cluster, which determine how characters are displayed and interacted with