Monday 14 February 2022

Unicode Trivia U+0E74

Codepoint: U+0E74 "THAI PHONETIC ORDER VOWEL SIGN SARA MAI MALAI"
Block: U+0E00..0E7F "Thai"

Huh? According to Unicode's own lookup utility, U+0E74 is an unassigned codepoint. But that wasn't always the case. Back in Unicode 1.0.0 (October 1991) it was U+0E74 "THAI PHONETIC ORDER VOWEL SIGN SARA MAI MALAI":

[source]

Alas, codepoints U+0E70 to U+0E74 only lasted until Unicode 1.0.1 (June 1992) when they were deleted. This was the only time a non-zero patch version (i.e. "major.minor.patch" where patch ≠ 0) of Unicode was officially released. The stability policy means that another patch release is highly unlikely and the removal of codepoints impossible:

Encoding Stability (since Unicode 2.0)

Once a character is encoded, it will not be moved or removed. 

This policy ensures that implementers can always depend on each version of the Unicode Standard being a superset of the previous version. The Unicode Standard may deprecate the character (that is, formally discourage its use), but it will not reallocate, remove, or reassign the character.

So why was U+0E74 "THAI PHONETIC ORDER VOWEL SIGN SARA MAI MALAI" and its siblings removed? According to the Notice, it was to bring the Unicode and ISO 10646 standards back in line; U+0E74 was never added to ISO 10646. According to the minutes of a May 1992 meeting of the Unicode Technical Committee:

The UTC has noticed the requirement to remove 5 THAI characters (U+0E70 - U+0E74) and 5 LAO characters (U+0EF0 - U+0EF4). In the interest of the merger between ISO 10646 and Unicode the UTC authorizes its representatives attending the SC2/WG2 meeting in Korea to be flexible on this subject.

The juxtaposition of "authorizes" and "flexible" made me smile.

It appears that Thai Phonetic Order Vowel Signs were redundant and could cause ambiguity:

Nowadays, the Thai syllable ไตร, normatively pronounced /trai/, is only encoded <U+0E44 THAI CHARACTER SARA AI MAIMALAI, U+0E15 THAI CHARACTER TO TAO, U+0E23 THAI CHARACTER RO RUA>, and the character U+0E3A is always visible when used; for most routine purposes it is little different to U+0E38 THAI CHARACTER SARA U.  However, in Unicode 1.0[.0], while <U+0E44, U+0E15, U+0E23> was rendered as at present, the same visible string could also be encoded as <U+0E15, U+0E3A, U+0E23, U+0E74 THAI PHONETIC ORDER VOWEL SIGN SARA MAI MALAI> - no glyph would be rendered for U+0E3A.

I think that's implying that the sequence <... U+0E74> could just as easily be encoded as <U+0E44 ...>. The original glyph charts suggest that too:

[source]

Of course, if someone legitimately used U+0E74 in a document between October 1991 and June 1992, their document would become officially invalid or corrupt after June 1992.

No comments:

Post a Comment