Friday 15 April 2022

Unicode Trivia U+10FB

Codepoint: U+10FB "GEORGIAN PARAGRAPH SEPARATOR"
Block: U+10A0..10FF "Georgian"

The Georgian scripts are encoded in four letter forms in three Unicode blocks:

The four rows are:

  1. Asomtavruli is the oldest form, dating from the fifth century CE
  2. Nuskhuri dates from the ninth century CE
  3. Mkhedruli is the current Georgian script
  4. Mtavruli is the uppercase version of Mkhedruli

In the original "Georgian" block, the codepoints U+10A0..10C5 encode the uppercase of the old ecclesiastical alphabet, Asomtavruli (row 1). The codepoints U+10D0..10F0 encode the the lowercase of the modern secular alphabet, Mkhedruli (row 4). The latter is used for almost all text, including at the beginning of sentences and names.

However, don't be tempted to mash together uppercase Asomtavruli with lowercase Mkhedruli to get a bicameral script. That problem wasn't "solved" until the addition of the later "Georgian Extended" and "Georgian Supplement" blocks. More on that in later posts. For modern Georgians, this isn't really a problem at all; writing uses only one case.

In old texts, the "჻" symbol (U+10FB GEORGIAN PARAGRAPH SEPARATOR) was used at the end of the last line of a paragraph. Its use was presumably similar to that of the pilcrow "¶" but at the end of the paragraph, not at the beginning. Alas, the Georgian script didn't get its own full stop; it must share it with the Armenian one, "։" (U+0589 ARMENIAN FULL STOP)

ISO 10586:1996 encodes 42 characters of the Georgian script in a 7-bit character set, including the paragraph mark at 0x4F.

There is an interesting annex in the standard, part of which I'll include below:

Annex A: Development of the Georgian script

Armenian and Georgian, two of the multitudinous tongues spoken in the Caucasian Region, are vehicles of millennial civilizations. Both languages present peculiar phonetic resemblances in spite of their completely different origins. Georgian, or Grusinian, is a member of the Kartvelian language family. Armenian is a member of the Indo-European language family. Each language has its own alphabet, which resemble one another, since the alphabets developed from the same source.

According to one tradition, these two alphabets were invented circa A.D. 406 by the Armenian monk, missionary and theologian Mesrop Mast’oc’ (ca. A.D. 360 to A.D. 439), who also invented an alphabet for the now extinct language Albani (or Caucasian Albanian). According to another tradition, the Georgian script was invented circa A.D. 300 by the Georgian king, Parnavaz. Some scholars allege that it was invented many centuries earlier. The origin of, and the relations between, the three forms of the script are also still in dispute.

More likely, the Georgian script was derived, as was the Armenian script, from a Semitic alphabet, the Pahlavi script, used in Persia in the 4th century. It was developed under a strong Greek influence (by Mast’oc’ or perhaps one of his disciples) into an alphabet enabling the Georgian people to spell their language, with its wealth of sounds in a simple and phonemic way. Owing to phonetic evolution, a few letters became superfluous. In former times, the Georgian alphabet was also used in writing Ossetic and Abkhaz. The oldest inscription in Georgian dates back to the 5th century. The oldest manuscripts date from the 8th century. The period from A.D. 980 to A.D. 1220 is considered the golden age of Georgian literature.

Wednesday 13 April 2022

Unicode Trivia U+1090

Codepoint: U+1090 "MYANMAR SHAN DIGIT ZERO"
Block: U+1000..109F "Myanmar"

The "Myanmar" Unicode block contains glyphs used in various regional writing systems including the Burmese and Shan scripts. In this post, I'm going to play fast and loose with the script names and just call them "Burmese" and "Shan". Unicode Technical Note 11 describes some of the intricacies involved with the various scripts, weighing in at a healthy 67 pages.

The "Myanmar" block contains two sets of digits: one for Burmese (second row below) and one for Shan (bottom row):

If you have appropriate fonts installed, these are the Burmese digits (U+1040..1049):

၀၁၂၃၄၅၆၇၈၉

and Shan digits  (U+1090-1099):

႐႑႒႓႔႕႖႗႘႙

Burmese digits have the advantage (over Hindu-Arabic and Shan digits) of having ascenders and descenders which help to differentiate them. They are very similar to "Tai Tham Hora" digits (U+1A80..1A89). See here.

The Shan script supposedly evolved from the Burmese, but their digits are markedly different. To my eyes, they appear to resemble the Hindu-Arabic digits, but the "8" and "9" are inexplicably similar.

The Burmese language has words for very large numbers: powers of ten up t107 and then increasing multiplicatively by factors of 107 up to 10140 ("athinche", fittingly this is a synonym for "countless number"). I could find no reason why the names progress in multiples of 107. Most other languages use 103 (e.g. English "thousand", "million", "billion", etc.) or sometimes 102 (e.g. Indian "lakh", "crore", "arab", etc.)

Another curiosity is that the tonal pronunciation of digits changes depending on the denary position of the digit within the number. The tone generally changes from "low" to "creaky" (no, really!) for digits in the 101102 and 103 places.

Monday 11 April 2022

Unicode Trivia U+0F33

Codepoint: U+0F33 "TIBETAN DIGIT HALF ZERO"
Block: U+0F00..0FFF "Tibetan"

The Unicode Character Database has a field named "Numeric_Value" (abbreviated to "nv"). For the vast majority of the 144,697 used codepoints in Unicode 14.0.0 (in fact, precisely 142,890) this field holds the value "NaN" meaning that the codepoint does not represent a numeric value.

Other values for "nv", with the number of codepoints having that value in parentheses, are shown below, in approximate order of frequency.

First, the denary digits. The distribution is not flat because of the irregularity of CJK ideographs representing small numbers and the lack of a "zero" digit in some writing systems:

  • "1" (141)
  • "2" (140)
  • "3" (141)
  • "4" (132)
  • "5" (130)
  • "6" (114)
  • "7" (113)
  • "8" (109)
  • "9" (113)
  • "0" (84)

Next, multiples of ten:

  • "10" (62)
  • "20" (36)
  • "30" (19)
  • "40" (18)
  • "50" (29)
  • "60" (13)
  • "70" (13)
  • "80" (12)
  • "90" (12)

Next, powers of ten. Characters for trillions are using in Japan and Taiwan (U+5146) and in the Pahawh Hmong script (U+16B61):

  • "100" (35)
  • "1000" (22)
  • "10000" (13)
  • "100000" (5)
  • "1000000" (1)
  • "10000000" (1)
  • "100000000" (3)
  • "10000000000" (1)
  • "1000000000000" (2)

Next, sequential values up to twenty:

  • "11" (8)
  • "12" (8)
  • "13" (6)
  • "14" (6)
  • "15" (6)
  • "16" (7)
  • "17" (7)
  • "18" (7)
  • "19" (7)

Next, blocks of circled numbers:

  • "21" (1)
  • "22" (1)
  • "23" (1)
  • "24" (1)
  • "25" (1)
  • "26" (1)
  • "27" (1)
  • "28" (1)
  • "29" (1)

  • "31" (1)
  • "32" (1)
  • "33" (1)
  • "34" (1)
  • "35" (1)
  • "36" (1)
  • "37" (1)
  • "38" (1)
  • "39" (1)

  • "41" (1)
  • "42" (1)
  • "43" (1)
  • "44" (1)
  • "45" (1)
  • "46" (1)
  • "47" (1)
  • "48" (1)
  • "49" (1)

Next, multiples of 100. We can see the importance of 500 in ancient counting systems (e.g. "D" in Roman numerals)

  • "200" (6)
  • "300" (7)
  • "400" (7)
  • "500" (16)
  • "600" (7)
  • "700" (6)
  • "800" (6)
  • "900" (7)

Next, multiples of 1000:

  • "2000" (5)
  • "3000" (4)
  • "4000" (4)
  • "5000" (8)
  • "6000" (4)
  • "7000" (4)
  • "8000" (4)
  • "9000" (4)

Next, multiples of 10,000:

  • "20000" (4)
  • "30000" (4)
  • "40000" (4)
  • "50000" (7)
  • "60000" (4)
  • "70000" (4)
  • "80000" (4)
  • "90000" (4)

Next, multiples of 100,000 (e.g. "lakh"):

  • "200000" (2)
  • "300000" (1)
  • "400000" (1)
  • "500000" (1)
  • "600000" (1)
  • "700000" (1)
  • "800000" (1)
  • "900000" (1)

Next, multiples of 10,000,000 (e.g. "crore"):

  • "20000000" (1)

Next are two large numbers from cuneiform (base 60):

  • "216000" (1)
  • "432000" (1)

Next, we start the rational fractions (e.g. "half"):

  • "1/2" (18)

Next, the quarters:

  • "1/4" (13)
  • "3/4" (8)

Next, the eighths:

  • "1/8" (7)
  • "3/8" (1)
  • "5/8" (1)
  • "7/8" (1)

Next, the sixteenths:

  • "1/16" (6)
  • "3/16" (5)

Next, the thirty-seconds:

  • "1/32" (1)

Next, the sixty-fourths:

  • "1/64" (1)
  • "3/64" (1)

Next, the thirds (strangely, there's an Ancient Greek "⅔" U+10177, but not for "⅓"):

  • "1/3" (5)
  • "2/3" (6)

Next, the fifths:

  • "1/5" (3)
  • "2/5" (1)
  • "3/5" (1)
  • "4/5" (1)

Next, the sixths:

  • "1/6" (3)
  • "5/6" (2)

Next, a seventh:

  • "1/7" (1)

Next, a ninth:

  • "1/9" (1)

Next, the twelfths (Meroitic cursive fractions, not reduced):

  • "1/12" (1)
  • "2/12" (1)
  • "3/12" (1)
  • "4/12" (1)
  • "5/12" (1)
  • "6/12" (1)
  • "7/12" (1)
  • "8/12" (1)
  • "9/12" (1)
  • "10/12" (1)
  • "11/12" (1)

Next, a collection of (mostly Tamil and Malayalam) fractions we've seen already:

  • "1/320" (2)
  • "1/160" (2)
  • "1/80" (1)
  • "1/40" (2)
  • "3/80" (2)
  • "1/20" (2)
  • "1/10" (3)
  • "3/20" (2)

Finally, a collection of what can only be described as "strange halves":

  • "3/2" (1)
  • "5/2" (1)
  • "7/2" (1)
  • "9/2" (1)
  • "11/2" (1)
  • "13/2" (1)
  • "15/2" (1)
  • "17/2" (1)
  • "-1/2" (1)

These last nine all belong to the "Tibetan / Digits minus Half" group of codepoints (U+0F2A to U+0F33), including the wonderfully perplexing U+0F33 "TIBETAN DIGIT HALF ZERO".

[source]

This character supposedly has a numeric value of "-1/2" or "-0.5", and is the only codepoint (so far) with a negative "nv".

As Andrew West points out, there is much confusion (and little evidence) surrounding the numeric values of these codepoints. The glyphs seem to appear on postage stamps, but if the Royal Mail was in the habit of issuing stamps with a denomination of minus ½p, they quickly go out of business. If you went into a Post Office and asked for one million "-½p" stamps, the teller would be obliged to give you a huge tome of stamps and £5,000.