Friday, 11 February 2022

Unicode Trivia U+0DA5

Codepoint: U+0DA5 "SINHALA LETTER TAALUJA SANYOOGA NAAKSIKYAYA"
Block: U+0D80..0DFF "Sinhala"

As  Richard Gillam says in "Unicode Demystified" (2003), page 330:

The Unicode Sinhala block runs from U+0D80 to U+0DFF. It does not follow the ISCII order, partly because the ISCII standard doesn't include a code page for Sinhala and partly because Sinhala includes a lot of sounds (and, thus, letters) that aren't present in any of the Indian scripts. The basic setup of the block is the same: anusvara and visarga first, followed by independent vowels, consonants, dependent vowels, and punctuation. Unlike in the ISCII-derived blocks, the al-lakuna (virama) precedes the dependent vowels, rather than following them.

The order of codepoints (or of text made up of codepoints) can be thought of in at least three ways:

  1. The order of codepoints within the character set, e.g. Unicode ("codepoint order")
  2. The order of letters in an 'alphabet', e.g. Sinhala abugida ("alphabet order")
  3. The typical order of words in a language's dictionary ("collation order")

As an example, we'll consider the letters (and only the standalone letters) from the Sinhala block (U+0D80..0DFF).

In codepoint order, these are:

  • 18 independent vowels (U+0D85..0D96)
  • 41 consonants (U+0D9A..0DC6)

The alphabet order (according to sites such as Omniglot) is the same as the codepoint order. This was presumably a factor in the ordering of the codepoints when the block was added to Unicode 3.0 in 1999.

However, in "collation order" these 59 letters (along with their Sinhalese and Romanized phonetic names) are:

  1. U+0D85 = "අ" = AYANNA = vowel a
  2. U+0D86 = "ආ" = AAYANNA = vowel aa
  3. U+0D87 = "ඇ" = AEYANNA = vowel ae
  4. U+0D88 = "ඈ" = AEEYANNA = vowel aae
  5. U+0D89 = "ඉ" = IYANNA = vowel i
  6. U+0D8A = "ඊ" = IIYANNA = vowel ii
  7. U+0D8B = "උ" = UYANNA = vowel u
  8. U+0D8C = "ඌ" = UUYANNA = vowel uu
  9. U+0D8D = "ඍ" = IRUYANNA = vowel vocalic r
  10. U+0D8E = "ඎ" = IRUUYANNA = vowel vocalic rr
  11. U+0D8F = "ඏ" = ILUYANNA = vowel vocalic l
  12. U+0D90 = "ඐ" = ILUUYANNA = vowel vocalic ll
  13. U+0D91 = "එ" = EYANNA = vowel e
  14. U+0D92 = "ඒ" = EEYANNA = vowel ee
  15. U+0D93 = "ඓ" = AIYANNA = vowel ai
  16. U+0D94 = "ඔ" = OYANNA = vowel o
  17. U+0D95 = "ඕ" = OOYANNA = vowel oo
  18. U+0D96 = "ඖ" = AUYANNA = vowel au
  19. U+0D9A = "ක" = ALPAPRAANA KAYANNA = consonant ka
  20. U+0D9B = "ඛ" = MAHAAPRAANA KAYANNA = consonant kha
  21. U+0D9C = "ග" = ALPAPRAANA GAYANNA = consonant ga
  22. U+0D9D = "ඝ" = MAHAAPRAANA GAYANNA = consonant gha
  23. U+0D9E = "ඞ" = KANTAJA NAASIKYAYA = consonant nga
  24. U+0D9F = "ඟ" = SANYAKA GAYANNA = consonant nnga
  25. U+0DA0 = "ච" = ALPAPRAANA CAYANNA = consonant ca
  26. U+0DA1 = "ඡ" = MAHAAPRAANA CAYANNA = consonant cha
  27. U+0DA2 = "ජ" = ALPAPRAANA JAYANNA = consonant ja
  28. U+0DA5 = "ඥ" = TAALUJA SANYOOGA NAAKSIKYAYA = consonant jnya
  29. U+0DA3 = "ඣ" = MAHAAPRAANA JAYANNA = consonant jha
  30. U+0DA4 = "ඤ" = TAALUJA NAASIKYAYA = consonant nya
  31. U+0DA6 = "ඦ" = SANYAKA JAYANNA = consonant nyja
  32. U+0DA7 = "ට" = ALPAPRAANA TTAYANNA = consonant tta
  33. U+0DA8 = "ඨ" = MAHAAPRAANA TTAYANNA = consonant ttha
  34. U+0DA9 = "ඩ" = ALPAPRAANA DDAYANNA = consonant dda
  35. U+0DAA = "ඪ" = MAHAAPRAANA DDAYANNA = consonant ddha
  36. U+0DAB = "ණ" = MUURDHAJA NAYANNA = consonant nna
  37. U+0DAC = "ඬ" = SANYAKA DDAYANNA = consonant nndda
  38. U+0DAD = "ත" = ALPAPRAANA TAYANNA = consonant ta
  39. U+0DAE = "ථ" = MAHAAPRAANA TAYANNA = consonant tha
  40. U+0DAF = "ද" = ALPAPRAANA DAYANNA = consonant da
  41. U+0DB0 = "ධ" = MAHAAPRAANA DAYANNA = consonant dha
  42. U+0DB1 = "න" = DANTAJA NAYANNA = consonant na
  43. U+0DB3 = "ඳ" = SANYAKA DAYANNA = consonant nda
  44. U+0DB4 = "ප" = ALPAPRAANA PAYANNA = consonant pa
  45. U+0DB5 = "ඵ" = MAHAAPRAANA PAYANNA = consonant pha
  46. U+0DB6 = "බ" = ALPAPRAANA BAYANNA = consonant ba
  47. U+0DB7 = "භ" = MAHAAPRAANA BAYANNA = consonant bha
  48. U+0DB8 = "ම" = MAYANNA = consonant ma
  49. U+0DB9 = "ඹ" = AMBA BAYANNA = consonant mba
  50. U+0DBA = "ය" = YAYANNA = consonant ya
  51. U+0DBB = "ර" = RAYANNA = consonant ra
  52. U+0DBD = "ල" = DANTAJA LAYANNA = consonant la
  53. U+0DC0 = "ව" = VAYANNA = consonant va
  54. U+0DC1 = "ශ" = TAALUJA SAYANNA = consonant sha
  55. U+0DC2 = "ෂ" = MUURDHAJA SAYANNA = consonant ssa
  56. U+0DC3 = "ස" = DANTAJA SAYANNA = consonant sa
  57. U+0DC4 = "හ" = HAYANNA = consonant ha
  58. U+0DC5 = "ළ" = MUURDHAJA LAYANNA = consonant lla
  59. U+0DC6 = "ෆ" = FAYANNA = consonant fa
Spot the anomaly? Well, U+0DA5 "SINHALA LETTER TAALUJA SANYOOGA NAAKSIKYAYA" is out of order.
U+0DA5 from r12a

As an English speaker, the codepoint order, alphabet order and collation order of the letters "A" to "Z" are identical; so having subtle anomalies like this feels jarring. So jarring, in fact, that I checked it against three different sources (Unicode CLDR, MySQL and dictionary.gov.lk) to make sure I hadn't made a transcription error.

It's a bit like having the English alphabet "ABCDEFGHIJKLMNOPQRSTUVWXYZ" but listing words in an English dictionary in a different order, such as "ABCDEFGHIJKLPMNOQRSTUVWXYZ".

You only really need to nail down the order of letters of an writing system when you start creating reference dictionaries. However, as the Sinhala Dictionary Compilation Institute says, this didn't happen until British colonial rule of what became Sri Lanka. It's impossible to imagine that the British compilers didn't impose some of their preconceptions on the process and therefore muddied the ordering waters.

As Richard Gillam pointed out, Sinhala has a large number of letters and U+0DA5 "SINHALA LETTER TAALUJA SANYOOGA NAAKSIKYAYA" is one of those that doesn't fit into the canonical Brahmic consonant ordering utilised by ISCII.

survey by Weerasinghe, Herath and Gamage (2006) supplies many definitions of Sinhalese "dictionary order" in current use. Indeed, even if Unicode CLDR collation is adopted as a single de facto standard, the collation tailoring metadata is considered "live", and therefore liable to change anyway.


No comments:

Post a Comment