Saturday, 15 January 2022

Unicode Trivia U+03C2

Block: U+0370..03FF "Greek and Coptic"

The modern lowercase (minuscule) Greek alphabet is encoded in the Unicode range U+03B1 to U+03C9  in the "Greek and Coptic" block:


The uppercase versions of these letters are in the range U+0391 to U+03A9:


Tofu alert! Something nasty happens between the capitals rho "Ρ" and sigma "Σ". Here are those ranges rendered in a table:

The grey square is U+03A2: a "reserved" codepoint. This character is also reserved in earlier character sets (e.g. ISO/IEC 8859-7 of 1987) so it's not an anomaly of Unicode. It's the gap where "GREEK CAPITAL LETTER FINAL SIGMA" would sit if it actually existed.

Consider the titlecase Greek word for "Stasis":


If we convert this to uppercase by using the following in the Chrome browser's console


we get


as expected. All three sigmas (initial "Σ", medial "σ" and final "ς") get mapped to capital sigma "Σ":


The UCD lowercase mapping of U+03A3 "GREEK CAPITAL LETTER SIGMA" only mentions U+03C3 "GREEK SMALL LETTER SIGMA". So one would think (like I naively did) that converting "ΣΤΑΣΙΣ" to lowercase would produce


but if we use the browser console again


we actually get


(with a final sigma) which is correct but pleasantly unexpected.

The official reason the string mapping is correct is that final sigmas are "special" according to the Unicode standard. There's a file in the UCD named SpecialCasing.txt. Below is the relevant snippet from that text file:

# Special case for final form of sigma
03A3; 03C2; 03A3; 03A3; Final_Sigma; # GREEK CAPITAL LETTER SIGMA

This rule kicks in when the "Final_Sigma" condition is true independent of language.

In reality, every implementation of a lowercase mapping function must have special logic to handle final sigmas. Seriously.

As an example, here's the relevant functionality from the Chrome browser source code

// Really special case 1: upper case sigma.  This letter
// converts to two different lower case sigmas depending on
// whether or not it occurs at the end of a word.
if (next != 0 && Letter::Is(next)) {
  result[0] = 0x03C3;
} else {
  result[0] = 0x03C2;

This is a gotcha that's obviously bitten people more than once. Even the Unicode Consortium acknowledges it's a can of wormς

No comments:

Post a Comment