chilliant: Unicode Trivia U+0DA5

Codepoint: U+0DA5 "SINHALA LETTER TAALUJA SANYOOGA NAAKSIKYAYA"
Block: U+0D80..0DFF "Sinhala"

As Richard Gillam says in "Unicode Demystified" (2003), page 330:

The Unicode Sinhala block runs from U+0D80 to U+0DFF. It does not follow the ISCII order, partly because the ISCII standard doesn't include a code page for Sinhala and partly because Sinhala includes a lot of sounds (and, thus, letters) that aren't present in any of the Indian scripts. The basic setup of the block is the same: anusvara and visarga first, followed by independent vowels, consonants, dependent vowels, and punctuation. Unlike in the ISCII-derived blocks, the al-lakuna (virama) precedes the dependent vowels, rather than following them.

The order of codepoints (or of text made up of codepoints) can be thought of in at least three ways:

The order of codepoints within the character set, e.g. Unicode ("codepoint order")
The order of letters in an 'alphabet', e.g. Sinhala abugida ("alphabet order")
The typical order of words in a language's dictionary ("collation order")

As an example, we'll consider the letters (and only the standalone letters) from the Sinhala block (U+0D80..0DFF).

In codepoint order, these are:

18 independent vowels (U+0D85..0D96)
41 consonants (U+0D9A..0DC6)

The alphabet order (according to sites such as Omniglot) is the same as the codepoint order. This was presumably a factor in the ordering of the codepoints when the block was added to Unicode 3.0 in 1999.

However, in "collation order" these 59 letters (along with their Sinhalese and Romanized phonetic names) are:

U+0D85 = "අ" = AYANNA = vowel a
U+0D86 = "ආ" = AAYANNA = vowel aa
U+0D87 = "ඇ" = AEYANNA = vowel ae
U+0D88 = "ඈ" = AEEYANNA = vowel aae
U+0D89 = "ඉ" = IYANNA = vowel i
U+0D8A = "ඊ" = IIYANNA = vowel ii
U+0D8B = "උ" = UYANNA = vowel u
U+0D8C = "ඌ" = UUYANNA = vowel uu
U+0D8D = "ඍ" = IRUYANNA = vowel vocalic r
U+0D8E = "ඎ" = IRUUYANNA = vowel vocalic rr
U+0D8F = "ඏ" = ILUYANNA = vowel vocalic l
U+0D90 = "ඐ" = ILUUYANNA = vowel vocalic ll
U+0D91 = "එ" = EYANNA = vowel e
U+0D92 = "ඒ" = EEYANNA = vowel ee
U+0D93 = "ඓ" = AIYANNA = vowel ai
U+0D94 = "ඔ" = OYANNA = vowel o
U+0D95 = "ඕ" = OOYANNA = vowel oo
U+0D96 = "ඖ" = AUYANNA = vowel au
U+0D9A = "ක" = ALPAPRAANA KAYANNA = consonant ka
U+0D9B = "ඛ" = MAHAAPRAANA KAYANNA = consonant kha
U+0D9C = "ග" = ALPAPRAANA GAYANNA = consonant ga
U+0D9D = "ඝ" = MAHAAPRAANA GAYANNA = consonant gha
U+0D9E = "ඞ" = KANTAJA NAASIKYAYA = consonant nga
U+0D9F = "ඟ" = SANYAKA GAYANNA = consonant nnga
U+0DA0 = "ච" = ALPAPRAANA CAYANNA = consonant ca
U+0DA1 = "ඡ" = MAHAAPRAANA CAYANNA = consonant cha
U+0DA2 = "ජ" = ALPAPRAANA JAYANNA = consonant ja
U+0DA5 = "ඥ" = TAALUJA SANYOOGA NAAKSIKYAYA = consonant jnya
U+0DA3 = "ඣ" = MAHAAPRAANA JAYANNA = consonant jha
U+0DA4 = "ඤ" = TAALUJA NAASIKYAYA = consonant nya
U+0DA6 = "ඦ" = SANYAKA JAYANNA = consonant nyja
U+0DA7 = "ට" = ALPAPRAANA TTAYANNA = consonant tta
U+0DA8 = "ඨ" = MAHAAPRAANA TTAYANNA = consonant ttha
U+0DA9 = "ඩ" = ALPAPRAANA DDAYANNA = consonant dda
U+0DAA = "ඪ" = MAHAAPRAANA DDAYANNA = consonant ddha
U+0DAB = "ණ" = MUURDHAJA NAYANNA = consonant nna
U+0DAC = "ඬ" = SANYAKA DDAYANNA = consonant nndda
U+0DAD = "ත" = ALPAPRAANA TAYANNA = consonant ta
U+0DAE = "ථ" = MAHAAPRAANA TAYANNA = consonant tha
U+0DAF = "ද" = ALPAPRAANA DAYANNA = consonant da
U+0DB0 = "ධ" = MAHAAPRAANA DAYANNA = consonant dha
U+0DB1 = "න" = DANTAJA NAYANNA = consonant na
U+0DB3 = "ඳ" = SANYAKA DAYANNA = consonant nda
U+0DB4 = "ප" = ALPAPRAANA PAYANNA = consonant pa
U+0DB5 = "ඵ" = MAHAAPRAANA PAYANNA = consonant pha
U+0DB6 = "බ" = ALPAPRAANA BAYANNA = consonant ba
U+0DB7 = "භ" = MAHAAPRAANA BAYANNA = consonant bha
U+0DB8 = "ම" = MAYANNA = consonant ma
U+0DB9 = "ඹ" = AMBA BAYANNA = consonant mba
U+0DBA = "ය" = YAYANNA = consonant ya
U+0DBB = "ර" = RAYANNA = consonant ra
U+0DBD = "ල" = DANTAJA LAYANNA = consonant la
U+0DC0 = "ව" = VAYANNA = consonant va
U+0DC1 = "ශ" = TAALUJA SAYANNA = consonant sha
U+0DC2 = "ෂ" = MUURDHAJA SAYANNA = consonant ssa
U+0DC3 = "ස" = DANTAJA SAYANNA = consonant sa
U+0DC4 = "හ" = HAYANNA = consonant ha
U+0DC5 = "ළ" = MUURDHAJA LAYANNA = consonant lla
U+0DC6 = "ෆ" = FAYANNA = consonant fa

Spot the anomaly? Well, U+0DA5 "SINHALA LETTER TAALUJA SANYOOGA NAAKSIKYAYA" is out of order.

U+0DA5 from r12a

As an English speaker, the codepoint order, alphabet order and collation order of the letters "A" to "Z" are identical; so having subtle anomalies like this feels jarring. So jarring, in fact, that I checked it against three different sources (Unicode CLDR, MySQL and dictionary.gov.lk) to make sure I hadn't made a transcription error.

It's a bit like having the English alphabet "ABCDEFGHIJKLMNOPQRSTUVWXYZ" but listing words in an English dictionary in a different order, such as "ABCDEFGHIJKLPMNOQRSTUVWXYZ".

You only really need to nail down the order of letters of an writing system when you start creating reference dictionaries. However, as the Sinhala Dictionary Compilation Institute says, this didn't happen until British colonial rule of what became Sri Lanka. It's impossible to imagine that the British compilers didn't impose some of their preconceptions on the process and therefore muddied the ordering waters.

As Richard Gillam pointed out, Sinhala has a large number of letters and U+0DA5 "SINHALA LETTER TAALUJA SANYOOGA NAAKSIKYAYA" is one of those that doesn't fit into the canonical Brahmic consonant ordering utilised by ISCII.

A survey by Weerasinghe, Herath and Gamage (2006) supplies many definitions of Sinhalese "dictionary order" in current use. Indeed, even if Unicode CLDR collation is adopted as a single de facto standard, the collation tailoring metadata is considered "live", and therefore liable to change anyway.

chilliant

Friday, 11 February 2022

Unicode Trivia U+0DA5

No comments:

Post a Comment

Search This Blog

Blog Archive

About Me

Links