tag:blogger.com,1999:blog-51817468710865415752024-03-22T11:07:54.376+00:00chilliantThe personal blog of Ian TaylorIan Taylorhttp://www.blogger.com/profile/06869762490434824010noreply@blogger.comBlogger228125tag:blogger.com,1999:blog-5181746871086541575.post-48893299761202492362024-03-22T11:07:00.000+00:002024-03-22T11:07:22.722+00:00Unicode Numeral Systems 2<p><a href="https://cs.lmu.edu/~ray/">Ray Toal</a> recently enlightened me on the existence of two interesting number systems: <a href="https://en.wikipedia.org/wiki/Kaktovik_numerals">Kaktovik numerals </a>and <a href="https://en.wikipedia.org/wiki/Cistercian_numerals">Cistercian numerals</a>.</p><p>I haven't updated my <a href="https://chilliant.com/numerals.html">list of numeral systems</a> since Unicode 14.0.0 (see original <a href="https://chilliant.blogspot.com/2021/09/unicode-numeral-systems.html">blog post</a>), so I thought I'd revisit the whole "<a href="https://chilliant.com/universe.html">Universe</a>" project and update it to Unicode 15.1.0.</p><p>The vigesimal Kaktovik numerals <i>are</i> supported by Unicode 15.0.0 (<a href="https://chilliant.com/universe/universe.html#C+1D2C0">U+1D2C0 to U+1D2D3</a>), but, at the time of writing, Google Noto font support is still shuffling along the pipe, so they are difficult to display:</p><p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEjwm8Vt1zC6-9ACo_PratB0RD6Ep2fWLxXJqSXUXWm4yA3rvEnPnCPKqPO3wwfG39aSC3XFBzVBMAuMYR_zo6qt4c9EHLA68N1T-8gyb-2d_t5-O0UWJv7DLjMjP1ry3gUMrVI4i0jDHONdzxMSANYiRKdB7IKlrdTwxNuWsE24vb8NMNXButoNlc_mYr0" style="margin-left: auto; margin-right: auto;"><img alt="" data-original-height="480" data-original-width="366" height="240" src="https://blogger.googleusercontent.com/img/a/AVvXsEjwm8Vt1zC6-9ACo_PratB0RD6Ep2fWLxXJqSXUXWm4yA3rvEnPnCPKqPO3wwfG39aSC3XFBzVBMAuMYR_zo6qt4c9EHLA68N1T-8gyb-2d_t5-O0UWJv7DLjMjP1ry3gUMrVI4i0jDHONdzxMSANYiRKdB7IKlrdTwxNuWsE24vb8NMNXButoNlc_mYr0" width="183" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">[<a href="https://commons.wikimedia.org/wiki/File:Kaktovik_digit_table.svg">source</a>]</td></tr></tbody></table><br />Cistercian numerals are interesting as they are (sort of) base-10000, but they have not been allocated a Unicode range, although the <a href="https://www.unicode.org/L2/L2020/20290-cistercian-digits.pdf">proposal</a> dates back to 2020.</p><p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEg9z8CnvK-7_XJrG0timAtk8O34yD3aRqVBAvJZAUn3W7yr34n2YnRlWhGbt1ac2_COkFGwgf3qbPtOSoX25TrQP5sVBOS_cZmec6sR57ZdppqY22DUSkpJaBnDxpRTob-Jc08hxtIZ57vtZipf0M9DB9VzEgwDmGxhqX8JqqwFrbm005IVYqB2Q9KK2bs" style="margin-left: auto; margin-right: auto;"><img alt="" data-original-height="480" data-original-width="604" height="240" src="https://blogger.googleusercontent.com/img/a/AVvXsEg9z8CnvK-7_XJrG0timAtk8O34yD3aRqVBAvJZAUn3W7yr34n2YnRlWhGbt1ac2_COkFGwgf3qbPtOSoX25TrQP5sVBOS_cZmec6sR57ZdppqY22DUSkpJaBnDxpRTob-Jc08hxtIZ57vtZipf0M9DB9VzEgwDmGxhqX8JqqwFrbm005IVYqB2Q9KK2bs" width="302" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">[<a href="https://commons.wikimedia.org/wiki/File:Cistercian_digits_(vertical).svg">source</a>]</td></tr></tbody></table><br />The <a href="https://www.hsablonniere.com/a-clock-based-on-cistercian-numerals--hptit8/">Cistercian numeral clock</a> tickled my fancy.</p><p><br /></p>Ian Taylorhttp://www.blogger.com/profile/06869762490434824010noreply@blogger.com0tag:blogger.com,1999:blog-5181746871086541575.post-12544179491420007222022-04-15T09:27:00.003+00:002022-04-15T09:27:46.442+00:00Unicode Trivia U+10FB<p><b>Codepoint:</b> U+10FB "GEORGIAN PARAGRAPH SEPARATOR"<br /><b>Block:</b> U+10A0..10FF "Georgian"</p><p><span style="font-family: inherit;">The Georgian scripts are encoded in four letter forms in three Unicode blocks</span>:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEig9aI5jCcUwmxyeLUBQXBVVadWqRS67ZaYnEwvLns_U-1RrJQckc0H5JnWzdMoy3X4V2u3Cpr9nKatwLc2Tz99TVFZt7unm5MrbSQam18HD_LR0a7oWrrMmukdaapc0dxTgWcrdNmIngTl63hcEgEPhHLTvOEVAaLH3BzsdDLZu4_SBFtqgxDBYjJL/s1090/georgian.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="133" data-original-width="1090" height="78" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEig9aI5jCcUwmxyeLUBQXBVVadWqRS67ZaYnEwvLns_U-1RrJQckc0H5JnWzdMoy3X4V2u3Cpr9nKatwLc2Tz99TVFZt7unm5MrbSQam18HD_LR0a7oWrrMmukdaapc0dxTgWcrdNmIngTl63hcEgEPhHLTvOEVAaLH3BzsdDLZu4_SBFtqgxDBYjJL/w640-h78/georgian.png" width="640" /></a></div><p>The four rows are:</p><p></p><ol style="text-align: left;"><li>Asomtavruli is the oldest form, dating from the fifth century CE<br /></li><li>Nuskhuri dates from the ninth century CE</li><li>Mkhedruli is the current Georgian script</li><li>Mtavruli is the uppercase version of Mkhedruli</li></ol><p></p><p>In the original "Georgian" block, the codepoints U+10A0..10C5 encode the uppercase of the old ecclesiastical alphabet, Asomtavruli (row 1). The codepoints U+10D0..10F0 encode the the lowercase of the modern secular alphabet, Mkhedruli (row 4). The latter is used for almost all text, including at the beginning of sentences and names.</p><p>However, don't be tempted to mash together uppercase Asomtavruli with lowercase Mkhedruli to get a <a href="https://en.wikipedia.org/wiki/Letter_case#Bicameral_script">bicameral script</a>. That problem wasn't "solved" until the addition of the later "Georgian Extended" and "Georgian Supplement" blocks. More on that in later posts. For modern Georgians, this isn't really a problem at all; <a href="https://scriptsource.org/cms/scripts/page.php?item_id=script_detail&key=Geor">writing uses only one case</a>.</p><p>In old texts, the "჻" <a href="https://r12a.github.io/scripts/georgian/ka.html#index_punctuation">symbol</a> (U+10FB GEORGIAN PARAGRAPH SEPARATOR) was used at the end of the last line of a paragraph. Its use was presumably similar to that of the <a href="https://en.wikipedia.org/wiki/Pilcrow">pilcrow</a> "¶" but at the end of the paragraph, not at the beginning. Alas, the Georgian script didn't get its own full stop; <a href="https://www.unicode.org/L2/L2001/01040-georgian-joe.pdf">it must share it</a> with the Armenian one, "։" (U+0589 ARMENIAN FULL STOP)</p><p><a href="https://www.unicode.org/L2/L2001/01238-10586-1996.pdf">ISO 10586:1996</a> encodes 42 characters of the Georgian script in a 7-bit character set, including the paragraph mark at 0x4F.</p><p>There is an interesting annex in the standard, part of which I'll include below:</p><p></p><blockquote><p style="text-align: center;"><b>Annex A: Development of the Georgian script</b></p><p><i>Armenian and Georgian, two of the multitudinous tongues spoken in the Caucasian Region, are vehicles of millennial civilizations. Both languages present peculiar phonetic resemblances in spite of their completely different origins. Georgian, or Grusinian, is a member of the Kartvelian language family. Armenian is a member of the Indo-European language family. Each language has its own alphabet, which resemble one another, since the alphabets developed from the same source.</i></p><p><i>According to one tradition, these two alphabets were invented circa A.D. 406 by the Armenian monk, missionary and theologian Mesrop Mast’oc’ (ca. A.D. 360 to A.D. 439), who also invented an alphabet for the now extinct language Albani (or Caucasian Albanian). According to another tradition, the Georgian script was invented circa A.D. 300 by the Georgian king, Parnavaz. Some scholars allege that it was invented many centuries earlier. The origin of, and the relations between, the three forms of the script are also still in dispute.</i></p><p><i>More likely, the Georgian script was derived, as was the Armenian script, from a Semitic alphabet, the Pahlavi script, used in Persia in the 4th century. It was developed under a strong Greek influence (by Mast’oc’ or perhaps one of his disciples) into an alphabet enabling the Georgian people to spell their language, with its wealth of sounds in a simple and phonemic way. Owing to phonetic evolution, a few letters became superfluous. In former times, the Georgian alphabet was also used in writing Ossetic and Abkhaz. The oldest inscription in Georgian dates back to the 5th century. The oldest manuscripts date from the 8th century. The period from A.D. 980 to A.D. 1220 is considered the golden age of Georgian literature.</i></p></blockquote>Ian Taylorhttp://www.blogger.com/profile/06869762490434824010noreply@blogger.com0tag:blogger.com,1999:blog-5181746871086541575.post-72120906652715187762022-04-13T17:52:00.003+00:002022-04-13T17:55:26.693+00:00Unicode Trivia U+1090<p><b>Codepoint:</b> U+1090 "MYANMAR SHAN DIGIT ZERO"<br /><b>Block:</b> U+1000..109F "Myanmar"</p><p><span style="font-family: inherit;">The "Myanmar" Unicode block contains glyphs used in various regional writing systems including the Burmese and Shan scripts. In this post, I'm going to play fast and loose with the script names and just call them "Burmese" and "Shan". <a href="https://www.unicode.org/notes/tn11/UTN11_4.pdf">Unicode Technical Note 11</a> describes some of the intricacies involved with the various scripts, weighing in at a healthy 67 pages.</span></p><p><span style="font-family: inherit;">The "Myanmar" block contains two sets of digits: one for Burmese (second row below) and one for Shan (bottom row):</span></p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj2MjSONO5OnquUFoCW_dwxXlxUoh841f8ElYbXPxYN6pEgQrOPKHtbwDQ_lnbBeolt_axTPmq3bZbx2UIkaalpB7hCfTPFwfC9qVCTLskS6GWOr0G64D0ANde2MCO4LG81y1e8t-RMvkwpg7XUohL1OFMTHWCv-trXdBCuR_YIsz1DgnsXlcEu-UDY/s331/myanmar.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="100" data-original-width="331" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj2MjSONO5OnquUFoCW_dwxXlxUoh841f8ElYbXPxYN6pEgQrOPKHtbwDQ_lnbBeolt_axTPmq3bZbx2UIkaalpB7hCfTPFwfC9qVCTLskS6GWOr0G64D0ANde2MCO4LG81y1e8t-RMvkwpg7XUohL1OFMTHWCv-trXdBCuR_YIsz1DgnsXlcEu-UDY/s16000/myanmar.png" /></a></div><p style="clear: both; text-align: center;"></p><p style="-webkit-text-stroke-width: 0px; color: black; font-family: "Times New Roman"; font-size: medium; font-style: normal; font-variant-caps: normal; font-variant-ligatures: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration-color: initial; text-decoration-style: initial; text-decoration-thickness: initial; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px;"></p><p></p><p style="-webkit-text-stroke-width: 0px; color: black; font-style: normal; font-variant-caps: normal; font-variant-ligatures: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration-color: initial; text-decoration-style: initial; text-decoration-thickness: initial; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px;"><span style="font-family: inherit;">If you have appropriate fonts installed, these are the Burmese digits (U+1040..1049):</span></p><p style="orphans: 2; text-align: center; text-decoration-color: initial; text-decoration-style: initial; text-decoration-thickness: initial; text-indent: 0px; widows: 2;"><span style="font-size: x-large;">၀၁၂၃၄၅၆၇၈၉</span></p><p style="-webkit-text-stroke-width: 0px; color: black; font-style: normal; font-variant-caps: normal; font-variant-ligatures: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration-color: initial; text-decoration-style: initial; text-decoration-thickness: initial; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px;"><span style="font-family: inherit;">and Shan digits (U+1090-1099):</span></p><p style="orphans: 2; text-align: center; text-decoration-color: initial; text-decoration-style: initial; text-decoration-thickness: initial; text-indent: 0px; widows: 2;"><span style="font-size: x-large;">႐႑႒႓႔႕႖႗႘႙</span></p><p style="-webkit-text-stroke-width: 0px; color: black; font-style: normal; font-variant-caps: normal; font-variant-ligatures: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration-color: initial; text-decoration-style: initial; text-decoration-thickness: initial; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px;"><span style="font-family: inherit;">Burmese digits have the advantage (over Hindu-Arabic and Shan digits) of having ascenders and descenders which help to differentiate them. They are very similar to "Tai Tham Hora" digits (U+1A80..1A89). See <a href="https://chilliant.com/numerals.html">here</a>.</span></p><p style="-webkit-text-stroke-width: 0px; color: black; font-style: normal; font-variant-caps: normal; font-variant-ligatures: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration-color: initial; text-decoration-style: initial; text-decoration-thickness: initial; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px;"><span style="font-family: inherit;">The Shan script supposedly evolved from the Burmese, but their digits are markedly different. To my eyes, they appear to resemble the Hindu-Arabic digits, but the "8" and "9" are inexplicably similar.</span></p><p style="-webkit-text-stroke-width: 0px; color: black; font-variant-caps: normal; font-variant-ligatures: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration-color: initial; text-decoration-style: initial; text-decoration-thickness: initial; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px;"><span style="font-family: inherit;"><span style="font-style: normal;"><span>The Burmese language has words for very large numbers: powers of ten up t</span><span>o <span style="background-color: #f8f9fa; color: #202122; text-align: right;">10</span><sup style="background-color: #f8f9fa; color: #202122; line-height: 1; text-align: right;">7</sup> and then increasing multiplicatively by factors of</span> <span style="background-color: #f8f9fa; color: #202122; text-align: right;">10</span><sup style="background-color: #f8f9fa; color: #202122; line-height: 1; text-align: right;">7</sup> up to <span style="background-color: #f8f9fa; color: #202122; text-align: right;">10</span><sup style="background-color: #f8f9fa; color: #202122; line-height: 1; text-align: right;">140</sup> ("</span><i><a href="https://my.wiktionary.org/wiki/%E1%80%A1%E1%80%9E%E1%80%84%E1%80%BA%E1%80%B9%E1%80%81%E1%80%BB%E1%80%B1">athinche</a></i>", fittingly this is a synonym for "countless number"). I could find no reason why the names progress in multiples of <span style="background-color: #f8f9fa; color: #202122; text-align: right;">10</span><sup style="background-color: #f8f9fa; color: #202122; line-height: 1; text-align: right;">7</sup>. Most other languages use <span style="background-color: #f8f9fa; color: #202122; text-align: right;">10</span><sup style="background-color: #f8f9fa; color: #202122; line-height: 1; text-align: right;">3</sup> (e.g. <a href="https://en.wikipedia.org/wiki/Names_of_large_numbers">English</a> "thousand", "million", "billion", etc.) or sometimes <span style="background-color: #f8f9fa; color: #202122; text-align: right;">10</span><sup style="background-color: #f8f9fa; color: #202122; line-height: 1; text-align: right;">2</sup> (e.g. <a href="https://en.wikipedia.org/wiki/Indian_numbering_system">Indian</a> "lakh", "crore", "arab", etc.)</span></p><p style="-webkit-text-stroke-width: 0px; color: black; font-variant-caps: normal; font-variant-ligatures: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration-color: initial; text-decoration-style: initial; text-decoration-thickness: initial; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px;"><span style="font-family: inherit;">Another curiosity is that the <a href="https://www.soas.ac.uk/bbe/">tonal pronunciation</a> of digits changes depending on the denary position of the digit within the number. The tone generally changes from "low" to "creaky" (no, really!) for digits in the <span style="background-color: #f8f9fa; color: #202122; text-align: right;">10</span><sup style="background-color: #f8f9fa; color: #202122; line-height: 1; text-align: right;">1</sup>, <span style="background-color: #f8f9fa; color: #202122; text-align: right;">10</span><sup style="background-color: #f8f9fa; color: #202122; line-height: 1; text-align: right;">2</sup> and <span style="background-color: #f8f9fa; color: #202122; text-align: right;">10</span><sup style="background-color: #f8f9fa; color: #202122; line-height: 1; text-align: right;">3</sup> places.</span></p><p></p>Ian Taylorhttp://www.blogger.com/profile/06869762490434824010noreply@blogger.com0tag:blogger.com,1999:blog-5181746871086541575.post-37685063728926389162022-04-11T18:17:00.001+00:002022-04-13T16:05:27.296+00:00Unicode Trivia U+0F33<p><b>Codepoint:</b> U+0F33 "TIBETAN DIGIT HALF ZERO"<br /><b>Block:</b> U+0F00..0FFF "Tibetan"</p><p>The Unicode Character Database has a field named "<a href="https://unicode.org/reports/tr44/#Numeric_Value">Numeric_Value</a>" (abbreviated to "nv"). For the vast majority of the 144,697 used codepoints in Unicode 14.0.0 (in fact, precisely 142,890) this field holds the value "<a href="https://en.wikipedia.org/wiki/NaN">NaN</a>" meaning that the codepoint does not represent a numeric value.</p><p>Other values for "nv", with the number of codepoints having that value in parentheses, are shown below, in approximate order of frequency.</p><p>First, the denary digits. The distribution is not flat because of the irregularity of CJK ideographs representing small numbers and the lack of a "zero" digit in some writing systems:</p><p></p><ul style="text-align: left;"><li>"1" (141)</li><li>"2" (140)</li><li>"3" (141)</li><li>"4" (132)</li><li>"5" (130)</li><li>"6" (114)</li><li>"7" (113)</li><li>"8" (109)</li><li>"9" (113)</li><li>"0" (84)</li></ul><p style="text-align: left;">Next, multiples of ten:</p><p></p><p></p><ul style="text-align: left;"><li>"10" (62)</li><li>"20" (36)</li><li>"30" (19)</li><li>"40" (18)</li><li>"50" (29)</li><li>"60" (13)</li><li>"70" (13)</li><li>"80" (12)</li><li>"90" (12)</li></ul><p></p><p>Next, powers of ten. Characters for trillions are using in Japan and Taiwan (U+5146) and in the Pahawh Hmong script (U+16B61):</p><p></p><ul style="text-align: left;"><li>"100" (35)</li><li>"1000" (22)</li><li>"10000" (13)</li><li>"100000" (5)</li><li>"1000000" (1)</li><li>"10000000" (1)</li><li>"100000000" (3)</li><li>"10000000000" (1)</li><li>"1000000000000" (2)</li></ul><p></p><p>Next, sequential values up to twenty:</p><p></p><ul style="text-align: left;"><li>"11" (8)</li><li>"12" (8)</li><li>"13" (6)</li><li>"14" (6)</li><li>"15" (6)</li><li>"16" (7)</li><li>"17" (7)</li><li>"18" (7)</li><li>"19" (7)</li></ul><p></p><p>Next, blocks of circled numbers:</p><p></p><ul style="text-align: left;"><li>"21" (1)</li><li>"22" (1)</li><li>"23" (1)</li><li>"24" (1)</li><li>"25" (1)</li><li>"26" (1)</li><li>"27" (1)</li><li>"28" (1)</li><li>"29" (1)</li></ul><p></p><p></p><ul style="text-align: left;"><li>"31" (1)</li><li>"32" (1)</li><li>"33" (1)</li><li>"34" (1)</li><li>"35" (1)</li><li>"36" (1)</li><li>"37" (1)</li><li>"38" (1)</li><li>"39" (1)</li></ul><p></p><p></p><ul style="text-align: left;"><li>"41" (1)</li><li>"42" (1)</li><li>"43" (1)</li><li>"44" (1)</li><li>"45" (1)</li><li>"46" (1)</li><li>"47" (1)</li><li>"48" (1)</li><li>"49" (1)</li></ul><p></p><p>Next, multiples of 100. We can see the importance of 500 in ancient counting systems (e.g. "D" in Roman numerals)</p><p></p><ul style="text-align: left;"><li>"200" (6)</li><li>"300" (7)</li><li>"400" (7)</li><li>"500" (16)</li><li>"600" (7)</li><li>"700" (6)</li><li>"800" (6)</li><li>"900" (7)</li></ul><p></p><p>Next, multiples of 1000:</p><p></p><ul style="text-align: left;"><li>"2000" (5)</li><li>"3000" (4)</li><li>"4000" (4)</li><li>"5000" (8)</li><li>"6000" (4)</li><li>"7000" (4)</li><li>"8000" (4)</li><li>"9000" (4)</li></ul><p></p><p>Next, multiples of 10,000:</p><p></p><ul style="text-align: left;"><li>"20000" (4)</li><li>"30000" (4)</li><li>"40000" (4)</li><li>"50000" (7)</li><li>"60000" (4)</li><li>"70000" (4)</li><li>"80000" (4)</li><li>"90000" (4)</li></ul><p></p><p>Next, multiples of 100,000 (e.g. "<a href="https://en.wikipedia.org/wiki/Lakh">lakh</a>"):</p><p></p><ul style="text-align: left;"><li>"200000" (2)</li><li>"300000" (1)</li><li>"400000" (1)</li><li>"500000" (1)</li><li>"600000" (1)</li><li>"700000" (1)</li><li>"800000" (1)</li><li>"900000" (1)</li></ul><p></p><p>Next, multiples of 10,000,000 (e.g. "<a href="https://en.wikipedia.org/wiki/Crore">crore</a>"):</p><p></p><ul style="text-align: left;"><li>"20000000" (1)</li></ul><p></p><p>Next are two large numbers from cuneiform (<a href="https://en.wikipedia.org/wiki/Sexagesimal">base 60</a>):</p><p></p><ul style="text-align: left;"><li>"216000" (1)</li><li>"432000" (1)</li></ul><p></p><p>Next, we start the rational fractions (e.g. "half"):</p><p></p><ul style="text-align: left;"><li>"1/2" (18)</li></ul><p></p><p>Next, the quarters:</p><p></p><ul style="text-align: left;"><li>"1/4" (13)</li><li>"3/4" (8)</li></ul><p></p><p>Next, the eighths:</p><p></p><ul style="text-align: left;"><li>"1/8" (7)</li><li>"3/8" (1)</li><li>"5/8" (1)</li><li>"7/8" (1)</li></ul><p></p><p>Next, the sixteenths:</p><p></p><ul style="text-align: left;"><li>"1/16" (6)</li><li>"3/16" (5)</li></ul><p></p><p>Next, the thirty-seconds:</p><p></p><ul style="text-align: left;"><li>"1/32" (1)</li></ul><p></p><p>Next, the sixty-fourths:</p><p></p><ul style="text-align: left;"><li>"1/64" (1)</li><li>"3/64" (1)</li></ul><p></p><p>Next, the thirds (strangely, there's an Ancient Greek "⅔" U+10177, but not for "⅓"):</p><p></p><ul style="text-align: left;"><li>"1/3" (5)</li><li>"2/3" (6)</li></ul><p></p><p>Next, the fifths:</p><p></p><ul style="text-align: left;"><li>"1/5" (3)</li><li>"2/5" (1)</li><li>"3/5" (1)</li><li>"4/5" (1)</li></ul><p></p><p>Next, the sixths:</p><p></p><ul style="text-align: left;"><li>"1/6" (3)</li><li>"5/6" (2)</li></ul><p></p><p>Next, a seventh:</p><p></p><ul style="text-align: left;"><li>"1/7" (1)</li></ul><p></p><p>Next, a ninth:</p><p></p><ul style="text-align: left;"><li>"1/9" (1)</li></ul><p></p><p>Next, the twelfths (Meroitic cursive fractions, not reduced):</p><p></p><ul style="text-align: left;"><li>"1/12" (1)</li><li>"2/12" (1)</li><li>"3/12" (1)</li><li>"4/12" (1)</li><li>"5/12" (1)</li><li>"6/12" (1)</li><li>"7/12" (1)</li><li>"8/12" (1)</li><li>"9/12" (1)</li><li>"10/12" (1)</li><li>"11/12" (1)</li></ul><p></p><p>Next, a collection of (mostly Tamil and Malayalam) fractions <a href="http://chilliant.blogspot.com/2022/02/unicode-trivia-u0d5a.html">we've seen already</a>:</p><p></p><ul style="text-align: left;"><li>"1/320" (2)</li><li>"1/160" (2)</li><li>"1/80" (1)</li><li>"1/40" (2)</li><li>"3/80" (2)</li><li>"1/20" (2)</li><li>"1/10" (3)</li><li>"3/20" (2)</li></ul><p></p><p>Finally, a collection of what can only be described as "strange halves":</p><p></p><ul style="text-align: left;"><li>"3/2" (1)</li><li>"5/2" (1)</li><li>"7/2" (1)</li><li>"9/2" (1)</li><li>"11/2" (1)</li><li>"13/2" (1)</li><li>"15/2" (1)</li><li>"17/2" (1)</li><li>"-1/2" (1)</li></ul><p></p><p style="text-align: left;">These last nine all belong to the "Tibetan / Digits minus Half" group of codepoints (U+0F2A to U+0F33), including the wonderfully perplexing U+0F33 "TIBETAN DIGIT HALF ZERO".</p><p style="text-align: left;"><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://r12a.github.io/c/Tibetan/large/0F33.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="190" data-original-width="190" src="https://r12a.github.io/c/Tibetan/large/0F33.png" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">[<a href="https://r12a.github.io/uniview/?char=0F33">source</a>]</td></tr></tbody></table></p><p style="text-align: left;">This character supposedly has a numeric value of "-1/2" or "-0.5", and is the only codepoint (so far) with a negative "nv".</p><p style="text-align: left;">As <a href="https://www.babelstone.co.uk/Blog/2007/04/numbers-that-dont-add-up-tibetan-half.html">Andrew West</a> points out, there is much confusion (and little evidence) surrounding the numeric values of these codepoints. The glyphs seem to appear on postage stamps, but if the Royal Mail was in the habit of issuing stamps with a denomination of <b>minus</b> ½p, they quickly go out of business. If you went into a Post Office and asked for one million "-½p" stamps, the teller would be obliged to give you a huge tome of stamps <i>and</i> £5,000.</p>Ian Taylorhttp://www.blogger.com/profile/06869762490434824010noreply@blogger.com2tag:blogger.com,1999:blog-5181746871086541575.post-71868835550141647292022-02-20T18:19:00.001+00:002022-02-20T18:19:58.167+00:00Unicode Trivia U+0EA5<p><b>Codepoint:</b> U+0EA5 "LAO LETTER LO LOOT"<br /><b>Block:</b> U+0E80..0EFF "Lao"</p><p>The <a href="https://en.wikipedia.org/wiki/Lao_script">Lao script</a> (Akson Lao) is a sister script of the <a href="https://en.wikipedia.org/wiki/Thai_script">Thai script</a>; both derive from the <a href="https://en.wikipedia.org/wiki/Sukhothai_script">Sukhothai script</a> of the thirteenth century CE. As such, they have many similarities. For instance, both Lao and Thai consonants are given individual names. Here are the 27 Lao consonants with their typical names:</p><p></p><ol style="text-align: left;"><li>ກ = chicken (ໄກ່)</li><li>ຂ = egg (ໄຂ່)</li><li>ຄ = water buffalo (ຄວາຍ)</li><li>ງ = ox (ງົວ)</li><li>ຈ = glass (ຈອກ)</li><li>ສ = tiger (ເສືອ)</li><li>ຊ = elephant (ຊ້າງ)</li><li>ຍ = mosquito (ຍຸງ)</li><li>ດ = child (ເດັກ)</li><li>ຕ = eye (ຕາ)</li><li>ຖ = bag (ຖົງ)</li><li>ທ = flag (ທຸງ)</li><li>ນ = bird (ນົກ)</li><li>ບ = goat (ແບ້)</li><li>ປ = fish (ປາ)</li><li>ຜ = bee (ເຜິ້ງ)</li><li>ຝ = rain (ຝົນ)</li><li>ພ = mountain (ພູ)</li><li>ຟ = fire (ໄຟ)</li><li>ມ = cat (ແມວ)</li><li>ຢ = medicine (ຢາ)</li><li>ຣ = car (ຣົຖ)</li><li>ລ = monkey (ລີງ)</li><li>ວ = fan (ວີ)</li><li>ຫ = goose (ຫ່ານ)</li><li>ອ = bowl (ອື່ງ)</li><li>ຮ = house (ເຮືອນ)</li></ol><p></p><p style="text-align: left;">Each consonant's name begins with that consonant in a similar fashion to English alphabet mnemonics such a "A is for apple, B is for banana, etc.", known as <a href="https://en.wikipedia.org/wiki/Acrophony">acrophony</a>:</p><p style="text-align: left;"></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEjy-5p6hduIzNzBCS8RJMlrTijVDdOg0WH2EjiPz5apQ4-UGrTcCZ0f1Gv5Ge57pVr6TdV0A4QhVxCDLzp3ODkIk2sXE6zE10l3qcOAzdKo_RoZGGyHNusa76FSJRJuGvcqR8JYIMVqpc_3VjhzKD1h4_S5_o9PkVM2EtkCbtZ1xD048oJoGC1DISOm=s780" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="537" data-original-width="780" height="440" src="https://blogger.googleusercontent.com/img/a/AVvXsEjy-5p6hduIzNzBCS8RJMlrTijVDdOg0WH2EjiPz5apQ4-UGrTcCZ0f1Gv5Ge57pVr6TdV0A4QhVxCDLzp3ODkIk2sXE6zE10l3qcOAzdKo_RoZGGyHNusa76FSJRJuGvcqR8JYIMVqpc_3VjhzKD1h4_S5_o9PkVM2EtkCbtZ1xD048oJoGC1DISOm=w640-h440" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">[<a href="https://calisphere.org/item/ark:/13030/hb8h4nb5fp/">source</a>]</td></tr></tbody></table><p></p><p style="text-align: left;">Alas, the mapping of these consonants to the appropriate "column" of the Unicode Lao block is complicated by two factors:</p><p style="text-align: left;"></p><ol style="text-align: left;"><li>The Unicode encoding is based <i>loosely </i>on <a href="https://en.wikipedia.org/wiki/Thai_Industrial_Standard_620-2533">Thai Industrial Standard 620-2533</a> and has holes where unused characters are omitted.</li><li>The names of four of the consonants were incorrect when they were added to Unicode 1.0.</li></ol><p style="text-align: left;">These complications are discussed in Andrew West's <a href="https://www.unicode.org/wg2/docs/n3137.pdf">N3137</a> notes:</p><blockquote><p><i>The Unicode code charts note that the Lao block is "Based on TIS 620-2529". This statement is misleading as TIS 620-2529 is a Thai standard for representing the Thai script in an 8-bit code, and does not define names or code points for the Lao script. The Unicode Lao block is based on a mapping of Lao characters to the equivalent Thai characters in TIS 620, but is not actually based on this standard.</i></p></blockquote><p style="text-align: left;">And:</p><blockquote><p><i>The Unicode names for Lao consonants are based on the syllabic pronunciation of the character (i.e. consonant plus inherent vowel). All consonants belong to one of three tone classes: high, mid and low. Where two letters are only distinguished phonetically by their tone class, the modifiers SUNG "high" and TAM "low" are used to indicate the tone class of the letter (e.g. U+0E82 "LAO LETTER KHO SUNG" and U+0E84 "LAO LETTER KHO TAM"). However, the Unicode names for two of the consonants have the wrong tone class applied to them:</i></p></blockquote><blockquote><p><i>U+0E9D "LAO LETTER FO TAM" is a high tone class letter, and should have been named "LAO LETTER FO SUNG"</i></p></blockquote><blockquote><p><i>U+0E9F "LAO LETTER FO SUNG" is a low tone class letter, and should have been named "LAO LETTER FO TAM"</i></p></blockquote><blockquote><p><i>Whilst the Unicode names for 25 of the 27 consonants use this naming scheme, the names of two of the consonants use mnemonic names (presumably because they share the same vowel and tone class, and so could not otherwise be differentiated). Mnemonic names are how the consonants are normally identified in the Lao language, although there is no official list of standard mnemonic names for consonants, and different sources may use different mnemonic names for some letters.</i></p></blockquote><blockquote><p><i>The two letters whose Unicode names are based on mnemonic names are:</i></p></blockquote><blockquote><p><i>U+0EA3 "LAO LETTER LO LING"</i></p></blockquote><blockquote><p><i>U+0EA5 "LAO LETTER LO LOOT"</i></p></blockquote><blockquote><p><i>The mnemonic names for these two letters are the wrong way round. U+0EA5 is the normal letter [l] and is universally identified by the mnemonic name lo ling "lo as in ling [monkey]". On the other hand, U+0EA3 is a letter that is used to represent [r] in foreign words; however this letter has been officially deprecated by the Lao government since 1975, and is no longer in common use. The name element LO LOOT applied to U+0EA5 would seem to represent the mnemonic ro rot, "rot" meaning automobile, that should be applied to U+0EA3.</i></p></blockquote><p style="text-align: left;"></p><p style="text-align: left;">So U+0EA3 should be named "LAO LETTER RO ROT" (car) and U+0EA5 should be named "LAO LETTER LO LING" (monkey).</p><p style="text-align: left;">It is interesting that the Unicode standard has effectively "nailed down" the names of the consonants even though Andrew West says there is no official standard.</p><p style="text-align: left;">It has always troubled me that English does not have a satisfactory mechanism for naming its letters. These are the names typically used in British English:</p><p style="text-align: left;"></p><ol style="text-align: left;"><li>a</li><li>bee</li><li>cee</li><li>dee</li><li>e</li><li>eff</li><li>gee</li><li>aitch</li><li>i</li><li>jay</li><li>kay</li><li>el</li><li>em</li><li>en</li><li>o</li><li>pee</li><li>cue</li><li>ar</li><li>ess</li><li>tee</li><li>u</li><li>vee</li><li>double-u</li><li>ex</li><li>wye</li><li>zed</li></ol><p></p><p style="text-align: left;">If we ignore "double-u" (which we've met <a href="https://chilliant.blogspot.com/2022/01/unicode-trivia-u01bf.html">before</a>), the obvious elephant in the room is "cue" for "Q". Not only is it not acrophonic (only 15 of the 26 truly are), "Q" doesn't appear <i>anywhere</i> in its name.</p>Ian Taylorhttp://www.blogger.com/profile/06869762490434824010noreply@blogger.com0tag:blogger.com,1999:blog-5181746871086541575.post-8064163205155947832022-02-14T19:59:00.024+00:002022-02-14T19:59:00.146+00:00Unicode Trivia U+0E74<p><b>Codepoint:</b> U+0E74 "THAI PHONETIC ORDER VOWEL SIGN SARA MAI MALAI"<br /><b>Block:</b> U+0E00..0E7F "Thai"</p><p>Huh? According to Unicode's own <a href="https://util.unicode.org/UnicodeJsps/character.jsp?a=0E74">lookup utility</a>, U+0E74 is an unassigned codepoint. But that wasn't always the case. Back in Unicode 1.0.0 (October 1991) it was U+0E74 "THAI PHONETIC ORDER VOWEL SIGN SARA MAI MALAI":</p><p style="text-align: left;"></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEiOemIMMHO-tc1Ec8OmHmHujO-gScsU7mbOW_vAvI7F_MMGHiyIAzyWqhpq_EBnLfib9SDdmZYnKHzPrZMspNKrmo-6I2RN-pksa9Pkaji6RMH-Ebm_UM36qx2bVbYK7MRVRt4BxuhwbCtD1J01z2OuW3AryJEi4zDM1-iHkLBvYBjCFWs_JhDhFeS-" style="margin-left: auto; margin-right: auto;"><img alt="" data-original-height="803" data-original-width="624" height="640" src="https://blogger.googleusercontent.com/img/a/AVvXsEiOemIMMHO-tc1Ec8OmHmHujO-gScsU7mbOW_vAvI7F_MMGHiyIAzyWqhpq_EBnLfib9SDdmZYnKHzPrZMspNKrmo-6I2RN-pksa9Pkaji6RMH-Ebm_UM36qx2bVbYK7MRVRt4BxuhwbCtD1J01z2OuW3AryJEi4zDM1-iHkLBvYBjCFWs_JhDhFeS-=w499-h640" width="499" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">[<a href="https://www.unicode.org/versions/Unicode1.0.0/CodeCharts2.pdf">source</a>]</td></tr></tbody></table><p></p><div><p style="text-align: left;">Alas, codepoints U+0E70 to U+0E74 only lasted until Unicode 1.0.1 (June 1992) when they were <a href="https://www.unicode.org/versions/Unicode1.0.0/Notice.pdf">deleted</a>. This was the only time a non-zero patch version (i.e. "<i>major.minor.patch</i>" where <i>patch</i> ≠ 0) of Unicode was officially released. The <a href="https://www.unicode.org/policies/stability_policy.html#Encoding">stability policy</a> means that another patch release is highly unlikely and the removal of codepoints impossible:</p></div><p></p><blockquote><p><b style="text-decoration-line: underline;">Encoding Stability</b><i> (since Unicode 2.0)</i></p></blockquote><div><blockquote><blockquote></blockquote><p><b>Once a character is encoded, it will not be moved or removed.</b> </p></blockquote><blockquote><p><i>This policy ensures that implementers can always depend on each version of the Unicode Standard being a superset of the previous version. The Unicode Standard may deprecate the character (that is, formally discourage its use), but it will not reallocate, remove, or reassign the character.</i></p></blockquote></div><div><blockquote></blockquote><p style="text-align: left;">So why was U+0E74 "THAI PHONETIC ORDER VOWEL SIGN SARA MAI MALAI" and its siblings removed? According to the Notice, it was to bring the Unicode and ISO 10646 standards back in line; U+0E74 was never added to ISO 10646. According to the <a href="https://www.unicode.org/L2/L1992-UTC/UTC52-draft.txt">minutes</a> of a May 1992 meeting of the Unicode Technical Committee:</p><blockquote><p><i>The UTC has noticed the requirement to remove 5 THAI characters (U+0E70 - U+0E74) and 5 LAO characters (U+0EF0 - U+0EF4). In the interest of the merger between ISO 10646 and Unicode the UTC authorizes its representatives attending the SC2/WG2 meeting in Korea to be flexible on this subject.</i></p></blockquote><p style="text-align: left;">The juxtaposition of "<i>authorizes</i>" and "<i>flexible</i>" made me smile.</p><p style="text-align: left;">It appears that Thai Phonetic Order Vowel Signs were <a href="https://www.mail-archive.com/unicode@unicode.org/msg40432.html">redundant and could cause ambiguity</a>:</p><blockquote><p><i>Nowadays, the Thai syllable ไตร, normatively pronounced /</i>trai<i>/, is only encoded <U+0E44 THAI CHARACTER SARA AI MAIMALAI, U+0E15 THAI CHARACTER TO TAO, U+0E23 THAI CHARACTER RO RUA>, and the character U+0E3A is always visible when used; for most routine purposes it is little different to U+0E38 THAI CHARACTER SARA U. However, in Unicode 1.0</i>[.0]<i>, while <U+0E44, U+0E15, U+0E23> was rendered as at present, the same visible string could also be encoded as <U+0E15, U+0E3A, U+0E23, U+0E74 THAI PHONETIC ORDER VOWEL SIGN SARA MAI MALAI> - no glyph would be rendered for U+0E3A.</i></p></blockquote><p style="text-align: left;">I think that's implying that the sequence <... U+0E74> could just as easily be encoded as <U+0E44 ...>. The original glyph charts suggest that too:</p></div><div><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEhT1xQHTPEZKMZ_3uTsN5AlHI_iuR56E3Hw8CczbrjRNrio2-i-xDOV-hI3unwquwLrJoC21SoKrwi74JDRhb4TwK6au9V1fYQrnMzCba8yg_aTqHdGayGYmLBzvc0VrA7ZpJ_QYWcybYjKXHsRio6NcNTxh9X2BoviTO0fMsmckHker8N4Lrpis1Bb" style="margin-left: auto; margin-right: auto;"><img alt="" data-original-height="343" data-original-width="622" height="352" src="https://blogger.googleusercontent.com/img/a/AVvXsEhT1xQHTPEZKMZ_3uTsN5AlHI_iuR56E3Hw8CczbrjRNrio2-i-xDOV-hI3unwquwLrJoC21SoKrwi74JDRhb4TwK6au9V1fYQrnMzCba8yg_aTqHdGayGYmLBzvc0VrA7ZpJ_QYWcybYjKXHsRio6NcNTxh9X2BoviTO0fMsmckHker8N4Lrpis1Bb=w640-h352" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">[<a href="https://www.unicode.org/versions/Unicode1.0.0/CodeCharts2.pdf">source</a>]</td></tr></tbody></table><br /><p style="text-align: left;">Of course, if someone legitimately used U+0E74 in a document between October 1991 and June 1992, their document would become officially invalid or corrupt after June 1992.</p></div>Ian Taylorhttp://www.blogger.com/profile/06869762490434824010noreply@blogger.com0tag:blogger.com,1999:blog-5181746871086541575.post-77797746316163507032022-02-11T11:00:00.001+00:002022-02-11T11:00:04.367+00:00Unicode Trivia U+0DA5<p><b>Codepoint:</b> U+0DA5 "SINHALA LETTER TAALUJA SANYOOGA NAAKSIKYAYA"<br /><b>Block:</b> U+0D80..0DFF "Sinhala"</p><p>As Richard Gillam says in "<a href="https://www.amazon.co.uk/dp/0201700522">Unicode Demystified</a>" (2003), page 330:</p><blockquote><p><i>The Unicode Sinhala block runs from U+0D80 to U+0DFF. It does not follow the ISCII order, partly because the ISCII standard doesn't include a code page for Sinhala and partly because Sinhala includes a lot of sounds (and, thus, letters) that aren't present in any of the Indian scripts. The basic setup of the block is the same: anusvara and visarga first, followed by independent vowels, consonants, dependent vowels, and punctuation. Unlike in the ISCII-derived blocks, the al-lakuna (virama) precedes the dependent vowels, rather than following them.</i></p></blockquote><p>The order of codepoints (or of text made up of codepoints) can be thought of in at least three ways:</p><p></p><ol style="text-align: left;"><li>The order of codepoints within the character set, e.g. Unicode ("codepoint order")</li><li>The order of letters in an 'alphabet', e.g. Sinhala abugida ("alphabet order")</li><li>The typical order of words in a language's dictionary ("collation order")</li></ol><p style="text-align: left;">As an example, we'll consider the letters (and only the standalone letters) from the Sinhala block (U+0D80..0DFF).</p><p style="text-align: left;">In <a href="https://www.unicode.org/charts/PDF/U0D80.pdf">codepoint order</a>, these are:</p><p style="text-align: left;"></p><div></div><p></p><div><ul style="text-align: left;"><li>18 independent vowels (U+0D85..0D96)</li><li>41 consonants (U+0D9A..0DC6)</li></ul></div><p style="text-align: left;">The alphabet order (according to sites such as <a href="https://omniglot.com/writing/sinhala.htm">Omniglot</a>) is the same as the codepoint order. This was presumably a factor in the ordering of the codepoints when the block was added to Unicode 3.0 in 1999.</p><p style="text-align: left;">However, in "collation order" these 59 letters (along with their Sinhalese and Romanized phonetic names) are:</p><p style="text-align: left;"></p><div><ol style="text-align: left;"><li>U+0D85 = "අ" = AYANNA = vowel a</li><li>U+0D86 = "ආ" = AAYANNA = vowel aa</li><li>U+0D87 = "ඇ" = AEYANNA = vowel ae</li><li>U+0D88 = "ඈ" = AEEYANNA = vowel aae</li><li>U+0D89 = "ඉ" = IYANNA = vowel i</li><li>U+0D8A = "ඊ" = IIYANNA = vowel ii</li><li>U+0D8B = "උ" = UYANNA = vowel u</li><li>U+0D8C = "ඌ" = UUYANNA = vowel uu</li><li>U+0D8D = "ඍ" = IRUYANNA = vowel vocalic r</li><li>U+0D8E = "ඎ" = IRUUYANNA = vowel vocalic rr</li><li>U+0D8F = "ඏ" = ILUYANNA = vowel vocalic l</li><li>U+0D90 = "ඐ" = ILUUYANNA = vowel vocalic ll</li><li>U+0D91 = "එ" = EYANNA = vowel e</li><li>U+0D92 = "ඒ" = EEYANNA = vowel ee</li><li>U+0D93 = "ඓ" = AIYANNA = vowel ai</li><li>U+0D94 = "ඔ" = OYANNA = vowel o</li><li>U+0D95 = "ඕ" = OOYANNA = vowel oo</li><li>U+0D96 = "ඖ" = AUYANNA = vowel au</li><li>U+0D9A = "ක" = ALPAPRAANA KAYANNA = consonant ka</li><li>U+0D9B = "ඛ" = MAHAAPRAANA KAYANNA = consonant kha</li><li>U+0D9C = "ග" = ALPAPRAANA GAYANNA = consonant ga</li><li>U+0D9D = "ඝ" = MAHAAPRAANA GAYANNA = consonant gha</li><li>U+0D9E = "ඞ" = KANTAJA NAASIKYAYA = consonant nga</li><li>U+0D9F = "ඟ" = SANYAKA GAYANNA = consonant nnga</li><li>U+0DA0 = "ච" = ALPAPRAANA CAYANNA = consonant ca</li><li>U+0DA1 = "ඡ" = MAHAAPRAANA CAYANNA = consonant cha</li><li>U+0DA2 = "ජ" = ALPAPRAANA JAYANNA = consonant ja</li><li>U+0DA5 = "ඥ" = TAALUJA SANYOOGA NAAKSIKYAYA = consonant jnya</li><li>U+0DA3 = "ඣ" = MAHAAPRAANA JAYANNA = consonant jha</li><li>U+0DA4 = "ඤ" = TAALUJA NAASIKYAYA = consonant nya</li><li>U+0DA6 = "ඦ" = SANYAKA JAYANNA = consonant nyja</li><li>U+0DA7 = "ට" = ALPAPRAANA TTAYANNA = consonant tta</li><li>U+0DA8 = "ඨ" = MAHAAPRAANA TTAYANNA = consonant ttha</li><li>U+0DA9 = "ඩ" = ALPAPRAANA DDAYANNA = consonant dda</li><li>U+0DAA = "ඪ" = MAHAAPRAANA DDAYANNA = consonant ddha</li><li>U+0DAB = "ණ" = MUURDHAJA NAYANNA = consonant nna</li><li>U+0DAC = "ඬ" = SANYAKA DDAYANNA = consonant nndda</li><li>U+0DAD = "ත" = ALPAPRAANA TAYANNA = consonant ta</li><li>U+0DAE = "ථ" = MAHAAPRAANA TAYANNA = consonant tha</li><li>U+0DAF = "ද" = ALPAPRAANA DAYANNA = consonant da</li><li>U+0DB0 = "ධ" = MAHAAPRAANA DAYANNA = consonant dha</li><li>U+0DB1 = "න" = DANTAJA NAYANNA = consonant na</li><li>U+0DB3 = "ඳ" = SANYAKA DAYANNA = consonant nda</li><li>U+0DB4 = "ප" = ALPAPRAANA PAYANNA = consonant pa</li><li>U+0DB5 = "ඵ" = MAHAAPRAANA PAYANNA = consonant pha</li><li>U+0DB6 = "බ" = ALPAPRAANA BAYANNA = consonant ba</li><li>U+0DB7 = "භ" = MAHAAPRAANA BAYANNA = consonant bha</li><li>U+0DB8 = "ම" = MAYANNA = consonant ma</li><li>U+0DB9 = "ඹ" = AMBA BAYANNA = consonant mba</li><li>U+0DBA = "ය" = YAYANNA = consonant ya</li><li>U+0DBB = "ර" = RAYANNA = consonant ra</li><li>U+0DBD = "ල" = DANTAJA LAYANNA = consonant la</li><li>U+0DC0 = "ව" = VAYANNA = consonant va</li><li>U+0DC1 = "ශ" = TAALUJA SAYANNA = consonant sha</li><li>U+0DC2 = "ෂ" = MUURDHAJA SAYANNA = consonant ssa</li><li>U+0DC3 = "ස" = DANTAJA SAYANNA = consonant sa</li><li>U+0DC4 = "හ" = HAYANNA = consonant ha</li><li>U+0DC5 = "ළ" = MUURDHAJA LAYANNA = consonant lla</li><li>U+0DC6 = "ෆ" = FAYANNA = consonant fa</li></ol></div><div>Spot the anomaly? Well, U+0DA5 "SINHALA LETTER TAALUJA SANYOOGA NAAKSIKYAYA" is out of order.</div><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://r12a.github.io/c/Sinhala/large/0DA5.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="190" data-original-width="190" height="190" src="https://r12a.github.io/c/Sinhala/large/0DA5.png" width="190" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">U+0DA5 from <a href="https://r12a.github.io/uniview/?char=0DA5">r12a</a></td></tr></tbody></table><p style="text-align: left;">As an English speaker, the codepoint order, alphabet order and collation order of the letters "A" to "Z" are identical; so having subtle anomalies like this feels jarring. So jarring, in fact, that I checked it against three different sources (<a href="https://www.unicode.org/cldr/cldr-aux/charts/37/collation/si.html">Unicode CLDR</a>, <a href="https://bugs.mysql.com/bug.php?id=26474">MySQL</a> and <a href="http://www.dictionary.gov.lk/index.php?option=com_content&view=article&id=7&Itemid=123&lang=en#what-is-the-%E2%80%98sinhala-alphabet%E2%80%99-followed-by-the-sinhala-dictionary">dictionary.gov.lk</a>) to make sure I hadn't made a transcription error.</p><p style="text-align: left;">It's a bit like having the English alphabet "ABCDEFGHIJKLMNOPQRSTUVWXYZ" but listing words in an English dictionary in a different order, such as "ABCDEFGHIJKL<b>P</b>MNOQRSTUVWXYZ".</p><p style="text-align: left;">You only really need to nail down the order of letters of an writing system when you start creating reference dictionaries. However, as the <a href="http://www.dictionary.gov.lk/index.php?option=com_content&view=article&id=2&Itemid=104&lang=en">Sinhala Dictionary Compilation Institute</a> says, this didn't happen until British colonial rule of what became Sri Lanka. It's impossible to imagine that the British compilers didn't impose some of their preconceptions on the process and therefore muddied the ordering waters.</p><p style="text-align: left;">As Richard Gillam pointed out, Sinhala has a large number of letters and U+0DA5 "SINHALA LETTER TAALUJA SANYOOGA NAAKSIKYAYA" is one of those that doesn't fit into the canonical Brahmic consonant ordering utilised by ISCII.</p><p style="text-align: left;">A <a href="https://docplayer.net/62010543-The-sinhala-collation-sequence-and-its-representation-in-unicode.html">survey</a> by Weerasinghe, Herath and Gamage (2006) supplies many definitions of Sinhalese "dictionary order" in current use. Indeed, even if Unicode CLDR collation is adopted as a single <i>de facto </i>standard, the collation tailoring metadata is considered "<a href="https://github.com/unicode-org/cldr/commits/main/common/collation/si.xml">live</a>", and therefore liable to change anyway.</p><p style="text-align: left;"><br /></p>Ian Taylorhttp://www.blogger.com/profile/06869762490434824010noreply@blogger.com0tag:blogger.com,1999:blog-5181746871086541575.post-27295355907964244332022-02-07T15:48:00.263+00:002022-02-07T15:48:00.156+00:00Unicode Trivia U+0D5A<p><b>Codepoint:</b> U+0D5A "MALAYALAM FRACTION THREE EIGHTIETHS"<br /><b>Block:</b> U+0D00..0D7F "Malayalam"</p><p><a href="https://omniglot.com/writing/malayalam.htm">Malayalam script</a> is the main method of writing the Malayalam language of South West India, spoken by about forty million people. It is a Brahmic script, imported into Unicode 1.0 along with the other scripts covered by ISCII 1991.</p><p>Although "Malayalam" is a palindrome, I want to talk about their fractions. I like Unicode fractions. Have you noticed?</p><p>The <a href="https://www.unicode.org/versions/Unicode1.0.0/CodeCharts2.pdf">original Unicode 1.0 block</a> didn't have any had Malayalam-specific fraction codepoints. These were added in Unicode 6.0 (proposal <a href="https://www.unicode.org/wg2/docs/n2970.pdf">N2970</a> by V. S. Umamaheswaran 2005-08-23) and Unicode 9.0 (proposal <a href="https://www.unicode.org/wg2/docs/n4429.pdf">N4429</a> by Shriramana Sharma 2013-04-25). The latter proposal, N4429, gives details of the old system of Malayalam fractions used before decimalisation:</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEhQmCXiEzKk6UD87DZiPGyP47tbAgDXPM1-LBkvJz01ary-Id3GzF7LI12ZfaSWUwfrlHuD4w3SV1SclJ9uyr055-hgQskVMavOpoEf_kZiJRDRkdssMOBkKKcLRL8-C9HOLXzR4Fsi5zzDky1KvPcHs9xRS1gsHCHW0Lwl2A__wriC8PlK3r4lKmsv=s531" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="531" data-original-width="384" src="https://blogger.googleusercontent.com/img/a/AVvXsEhQmCXiEzKk6UD87DZiPGyP47tbAgDXPM1-LBkvJz01ary-Id3GzF7LI12ZfaSWUwfrlHuD4w3SV1SclJ9uyr055-hgQskVMavOpoEf_kZiJRDRkdssMOBkKKcLRL8-C9HOLXzR4Fsi5zzDky1KvPcHs9xRS1gsHCHW0Lwl2A__wriC8PlK3r4lKmsv=s16000" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Malayalam Fraction Multiplication Examples<br /><i>Prācīna Gaṇitaṃ Malayāḷattil, Prof C K Moosathu, Kerala State Institute of Language, 1980</i></td></tr></tbody></table><p style="text-align: left;">It appears to be similar to Tamil fractions, using 320 as a common denominator:</p><p></p><ul style="text-align: left;"><li>1/320 (one three-hundred-and-twentieth) = U+0D2A U+0D4D U+0D24 = "പ്ത" (muntiri)</li><li>2/320 (one one-hundred-and-sixtieth) = U+0D58 = "൘" (arakkāṇi)</li><li>4/320 (one eightieth) = U+0D2E = "മ" (kāṇi)</li><li>8/320 (one fortieth) = U+0D59 = "൙" (aramā)</li><li>12/320 (three eightieths) = U+0D5A = "൚" (mūnnukāṇi)</li><li>16/320 (one twentieth) = U+0D5B = "൛" (orumā)</li><li>20/320 (one sixteenth) = U+0D76 = "൶" (mākāṇi)</li><li>32/320 (one tenth) = U+0D5C = "൜" (raṇṭumā)</li><li>40/320 (one eighth) = U+0D77 = "൷" (arakkāl)</li><li>48/320 (three twentieths) = U+0D5D = "൝" (mūnnumā)</li><li>60/320 (three sixteenths) = U+0D78 = "൸" (muṇṭāṇi)</li><li>64/320 (one fifth) = U+0D5E = "൞" (nālŭmā)</li><li>80/320 (one quarter) = U+0D73 = "൳" (kāl)</li><li>160/320 (one half) = U+0D74 = "൴" (ara)</li><li>240/320 (three quarters) = U+0D75 = "൵" (mukkāl)</li></ul><p></p><p style="text-align: left;">Note that there's no single codepoint for "1/320" (the sequence U+0D2A U+0D4D U+0D24 achieves the required glyph) and "4/320" shares a glyph with U+0D2E "MALAYALAM LETTER MA". Other than these two exceptions, the names of the fraction codepoints are as expected:</p><ul style="text-align: left;"><li>U+0D58 "MALAYALAM FRACTION ONE ONE-HUNDRED-AND-SIXTIETH"</li><li>U+0D59 "MALAYALAM FRACTION ONE FORTIETH"</li><li>U+0D5A "MALAYALAM FRACTION THREE EIGHTIETHS"</li><li>U+0D5B "MALAYALAM FRACTION ONE TWENTIETH"</li><li>U+0D5C "MALAYALAM FRACTION ONE TENTH"<br /></li><li>U+0D5D "MALAYALAM FRACTION THREE TWENTIETHS"</li><li>U+0D5E "MALAYALAM FRACTION ONE FIFTH"</li><li>U+0D73 "MALAYALAM FRACTION ONE QUARTER"</li><li>U+0D74 "MALAYALAM FRACTION ONE HALF"</li><li>U+0D75 "MALAYALAM FRACTION THREE QUARTERS"</li><li>U+0D76 "MALAYALAM FRACTION ONE SIXTEENTH"</li><li>U+0D77 "MALAYALAM FRACTION ONE EIGHTH"</li><li>U+0D78 "MALAYALAM FRACTION THREE SIXTEENTHS"</li></ul><p style="text-align: left;">All fractions with a denominator of 320 can easily be represented by adding together parts. In the image above, the third answer is:</p><p style="text-align: left;"></p><blockquote style="border: none; margin: 0 0 0 40px; padding: 0px;"><p style="text-align: left;">3/16 = 60/320 = "൸" (U+0D78)</p></blockquote><p>The sixth answer is:</p><blockquote style="border: none; margin: 0 0 0 40px; padding: 0px;"><p style="text-align: left;">3/64 = (12 + 2 + 1)/320 = "൚൘പ്ത" (U+0D5A U+0D58 U+0D2A U+0D4D U+0D24)</p></blockquote><p></p><p style="text-align: left;">There's the possibility of ambiguity here because we could construct 3/64 using (12+2+1)/320 or (8+4+2+1)/320, but I assume you always pick the biggest part you can at every step.</p><p style="text-align: left;">Many Indic systems use base-4 for fractions, so the choice of 320 as the common denominator seems peculiar to me. If it <i>was</i> based on a currency or metric with subdivisions of 320, I haven't been able to find any references to that. I suppose base-320 has an advantage over <a href="https://en.wikipedia.org/wiki/Dyadic_rational">dyadic fractions</a> in that it is divisible by 5, 10, etc. But why not choose base-60 like the Sumerians, or even base-360?</p><p style="text-align: left;"></p><ul style="text-align: left;"><li>Factors of 320: 1, 2, 4, 5, 8, 10, <b>16</b>, 20, <b>32</b>, 40, <b>64</b>, <b>80</b>, <b>160</b>, <b>320</b> (14)</li><li>Factors of 360: 1, 2, <b>3</b>, 4, 5, <b>6</b>, 8, <b>9</b>, 10, <b>12</b>, <b>15</b>, <b>18</b>, 20, <b>24</b>, <b>30</b>, <b>36</b>, 40, <b>45</b>, <b>60</b>, <b>72</b>, <b>90</b>, <b>120</b>, <b>180</b>, <b>360</b> (24)</li></ul><p style="text-align: left;">I assume that the Malayali preferred powers of two over the convenience of dividing into thirds and the like. But the choice of 320 means that the sequence breaks down if you double in size from the smallest:<br /></p><div style="text-align: left;"><ul><li>1/320 (one three-hundred-and-twentieth) = U+0D2A U+0D4D U+0D24 = "പ്ത" (muntiri)</li><li>2/320 (one one-hundred-and-sixtieth) = U+0D58 = "൘" (arakkāṇi)</li><li>4/320 (one eightieth) = U+0D2E = "മ" (kāṇi)</li><li>8/320 (one fortieth) = U+0D59 = "൙" (aramā)</li><li>16/320 (one twentieth) = U+0D5B = "൛" (orumā)</li><li>32/320 (one tenth) = U+0D5C = "൜" (raṇṭumā)</li><li>64/320 (one fifth) = U+0D5E = "൞" (nālŭmā)</li><li><i>What now?</i></li></ul><p style="text-align: left;">Or if you halve in size:</p><ul><li>160/320 (one half) = U+0D74 = "൴" (ara)</li><li>80/320 (one quarter) = U+0D73 = "൳" (kāl)</li><li>40/320 (one eighth) = U+0D77 = "൷" (arakkāl)</li><li>20/320 (one sixteenth) = U+0D76 = "൶" (mākāṇi)</li><li>10/320 (one thirty-second) = <i>missing</i></li><li>5/320 (one sixty-fourth) = <i>missing</i></li><li><i>What now?</i></li></ul><div>Or if you quarter in size (base-4 fractions):</div><ul><li>80/320 (one quarter) = U+0D73 = "൳" (kāl)</li><li>160/320 (one half) = U+0D74 = "൴" (ara)</li><li>240/320 (three quarters) = U+0D75 = "൵" (mukkāl)<br /><br /></li><li>20/320 (one sixteenth) = U+0D76 = "൶" (mākāṇi)</li><li>40/320 (one eighth) = U+0D77 = "൷" (arakkāl)</li><li>60/320 (three sixteenths) = U+0D78 = "൸" (muṇṭāṇi)<br /><br /></li><li>5/320 (one sixty-fourth) = <i>missing</i></li><li>10/320 (one thirty-second) = <i>missing</i></li><li>15/320 (three sixty-fourths) = <i>missing</i><br /><br /></li><li><i>What now?</i></li></ul><p style="text-align: left;">The remaining fractions (if you remove those found in the three schemes immediately above) are:</p><ul><li>12/320 (three eightieths) = U+0D5A = "൚" (mūnnukāṇi)</li><li>48/320 (three twentieths) = U+0D5D = "൝" (mūnnumā)</li></ul><p style="text-align: left;">These suggest division into fifths and then quartering thereafter:</p><ul><li>64/320 (one fifth) = U+0D5E = "൞" (nālŭmā)<br /><br /></li><li>16/320 (one twentieth) = U+0D5B = "൛" (orumā)</li><li>32/320 (one tenth) = U+0D5C = "൜" (raṇṭumā)</li><li>48/320 (three twentieths) = U+0D5D = "൝" (mūnnumā)<br /><br /></li><li>4/320 (one eightieth) = U+0D2E = "മ" (kāṇi)</li><li>8/320 (one fortieth) = U+0D59 = "൙" (aramā)</li><li>12/320 (three eightieths) = U+0D5A = "൚" (mūnnukāṇi)<br /><br /></li><li>1/320 (one three-hundred-and-twentieth) = U+0D2A U+0D4D U+0D24 = "പ്ത" (muntiri)</li><li>2/320 (one one-hundred-and-sixtieth) = U+0D58 = "൘" (arakkāṇi)</li><li>3/320 (three three-hundred-and-twentieths) = <i>missing</i></li></ul><div>That's the only way I can think that they'd bother to have a symbol that became U+0D5A "MALAYALAM FRACTION THREE EIGHTIETHS"</div></div>Ian Taylorhttp://www.blogger.com/profile/06869762490434824010noreply@blogger.com0tag:blogger.com,1999:blog-5181746871086541575.post-46900882712393285362022-02-06T10:45:00.187+00:002022-02-06T10:45:00.165+00:00Unicode Trivia U+0CDE<p><b>Codepoint:</b> U+0CDE "KANNADA LETTER FA"<br /><b>Block:</b> U+0C80..0CFF "Kannada"</p><p>The <a href="https://scriptsource.org/cms/scripts/page.php?item_id=script_detail&key=Knda">Kannada script</a> is one of those added in Unicode 1.0 as part of the importing of the <a href="https://en.wikipedia.org/wiki/Indian_Script_Code_for_Information_Interchange">ISCII</a> character sets in 1991. The <a href="http://varamozhi.sourceforge.net/iscii91.pdf">1991 ISCII Standard</a> encoded ten Indic character sets:</p><p></p><ol style="text-align: left;"><li>Devanagari (DEV/57002)</li><li>Bengali (BNG/57003)</li><li>Tamil (TML/57004)</li><li>Telugu (TLG/57005)</li><li>Assamese (ASM/57006)</li><li>Oriya (ORI/57007)</li><li>Kannada (KND/57008)</li><li>Malayalam (MLM/57009)</li><li>Gujarati (GJR/57010)</li><li>Punjabi (PNJ/57011)</li></ol><p style="text-align: left;">As part of the importation process:</p><p style="text-align: left;"></p><ul style="text-align: left;"><li>"Bengali" and "Assamese" were folded into a single "Bengali/Assamese" script known in Unicode data tables simply as "Bengali"</li><li>"Punjabi" was renamed "Gurmukhi" (the former is a language, the latter is a script)</li><li>"Oriya" was <i>not</i> renamed "Odia" (as this didn't happen until <a href="https://en.wikipedia.org/wiki/The_Orissa_(Alteration_of_Name)_Act,_2011">November 2011</a>)</li></ul><p style="text-align: left;">The nine remaining scripts were mapped to 128-byte blocks we see in Unicode today:</p><p style="text-align: left;"></p><ul style="text-align: left;"><li>Devanagari [U+0900..097F]</li><li>Bengali [U+0980..09FF]</li><li>Gurmukhi [U+0A00..0A7F]</li><li>Gujarati [U+0A80..0AFF]</li><li>Oriya [U+0B00..0B7F]</li><li>Tamil [U+0B80..0BFF]</li><li>Telugu [U+0C00..0C7F]</li><li>Kannada [U+0C80..0CFF]</li><li>Malayalam [U+0D00..0D7F]</li></ul><div>Richard Ishida has an <a href="https://r12a.github.io/scripts/indic-overview/#implementation">excellent page</a> describing these scripts and the importation process; but here's a summary table I put together of the codepoints (with hexadecimal offsets within the blocks) that are <i>purposefully</i> aligned in each script:</div><p></p><p></p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEh3pV3NKjdjV3css4JYCwCqvZfZlXAoMwKDwVhmlQrJpDafLUODzy320T9GcSeKuDVDbJNeexmaACVmADPkjFw3-Jc6J_4qXFjiSxo6bQnlAyzbO0udjU9s2U1xqIxN8xWF7lyZFzeb_h-UfA4cePQmoc9IXmAlGwtj90UByEe_tu7e1auWUs3RsuBT=s3150" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="3150" data-original-width="680" src="https://blogger.googleusercontent.com/img/a/AVvXsEh3pV3NKjdjV3css4JYCwCqvZfZlXAoMwKDwVhmlQrJpDafLUODzy320T9GcSeKuDVDbJNeexmaACVmADPkjFw3-Jc6J_4qXFjiSxo6bQnlAyzbO0udjU9s2U1xqIxN8xWF7lyZFzeb_h-UfA4cePQmoc9IXmAlGwtj90UByEe_tu7e1auWUs3RsuBT=s16000" /></a></div><p>The alignment was originally designed to facilitate trivial transcription, but this was never truly practical.</p><p>We can see that the Tamil column has quite a few missing (grey) codepoints; Tamil has fewer isolated letters in its "alphabet" than other Brahmic scripts. This is partly because it does not have distinct letters for aspirated consonants.</p><p>There are obviously gaps in the rows in chart above, which give space for script-specific codepoints. So, for Kannada, there are extra codepoints:</p><p></p><ul style="text-align: left;"><li>U+0C80 "KANNADA SIGN SPACING CANDRABINDU" — a non-combining Candrabindu</li><li>U+0C84 "KANNADA SIGN SIDDHAM" — used at the beginning of texts as an invocation</li><li>U+0CBC "KANNADA SIGN NUKTA" — used to represent sounds not present in Kannada</li><li>U+0CD5 "KANNADA LENGTH MARK" — used to extend vowel sounds</li><li>U+0CD6 "KANNADA AI LENGTH MARK" — used to extend AI vowel sounds</li><li>U+0CDD "KANNADA LETTER NAKAARA POLLU" — a vowel-less form of NA</li><li>U+0CDE "KANNADA LETTER FA"</li></ul><p></p><p style="text-align: left;">U+0CDE "KANNADA LETTER FA" was added in Unicode 1.0:</p><p style="text-align: left;"></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEgkmaX7bFAUAGaQ2qA-Ea9n9b96H5T17xZ0y7UjY4bsmfgLhfkG7bKv8vTU8pvCt_Bb8XhQyH_SsU2ywWRfmxCcdGx93YeynubZZ0YqPXkH4Q9jH3PBBnZEMkhU1FaShlLeH10jA49qJVjZ1co4oS5l3EbkjyW9uxhjyf_HnkTgt4tXNfffoRDbW9cW=s713" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="367" data-original-width="713" height="330" src="https://blogger.googleusercontent.com/img/a/AVvXsEgkmaX7bFAUAGaQ2qA-Ea9n9b96H5T17xZ0y7UjY4bsmfgLhfkG7bKv8vTU8pvCt_Bb8XhQyH_SsU2ywWRfmxCcdGx93YeynubZZ0YqPXkH4Q9jH3PBBnZEMkhU1FaShlLeH10jA49qJVjZ1co4oS5l3EbkjyW9uxhjyf_HnkTgt4tXNfffoRDbW9cW=w640-h330" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><p><a href="https://www.unicode.org/versions/Unicode1.0.0/CodeCharts2.pdf">Unicode 1.0 Code Chart</a></p></td></tr></tbody></table><p></p><p style="text-align: left;">But there is no letter FA for Kannada mentioned in <a href="http://varamozhi.sourceforge.net/iscii91.pdf">ISCII 1991</a>. Indeed, there is no letter FA in Kannada <i>full stop</i>. As Richard Ishida explains:</p><blockquote><p style="text-align: left;"><i>The Kannada character U+0CDE KANNADA LETTER FA </i>"ೞ"<i> was incorrectly named. A more appropriate name would be LLLA, rather than FA. Because of the rules for Unicode naming, the current name cannot, however, be changed. Fortunately this letter has not been actively used in Kannada since the end of the 10th century.</i></p></blockquote><p style="text-align: left;">Fortunate, indeed!</p><p style="text-align: left;">The <a href="https://en.wikipedia.org/wiki/Indian_Script_Code_for_Information_Interchange#Code_points_for_all_language">table in Wikipedia</a> seems to want to perpetuate the error; although, as a record of the actual importation process, it's un-usefully accurate.</p>Ian Taylorhttp://www.blogger.com/profile/06869762490434824010noreply@blogger.com0tag:blogger.com,1999:blog-5181746871086541575.post-53698517244570246632022-02-05T10:49:00.002+00:002022-02-05T10:49:00.144+00:00Unicode Trivia U+0C4D<p><b>Codepoint:</b> U+0C4D "TELUGU SIGN VIRAMA"<br /><b>Block:</b> U+0C00..0C7F "Telugu"</p><p><a href="https://omniglot.com/writing/telugu.htm">Telugu</a> is a Dravidian language spoken by about 100 million people worldwide. The <a href="https://en.wikipedia.org/wiki/Telugu_(Unicode_block)">Telugu script</a> was added to Unicode 1.0 in 1991 as part of the migration of ISCII.</p><p>Telugu codepoints hit the <a href="https://www.bbc.co.uk/news/technology-43127179">headlines</a> in February 2018 due to <a href="https://nvd.nist.gov/vuln/detail/CVE-2018-4124">CVE-2018-4124</a>, also known as the "Telugu Bug". The actual bug was in Apple's text layout engine (named "<a href="https://developer.apple.com/documentation/coretext">Core Text</a>"), not in the Unicode specification. But that didn't stop some people pointing the finger and saying that Unicode composition was fundamentally flawed and hence, indirectly, the cause of the problem.</p><p><a href="https://serhack.me/articles/crash-iphone-telugu-character-en/">SerHack</a> and <a href="https://manishearth.github.io/blog/2018/02/15/picking-apart-the-crashing-ios-string/">Manish Goregaokar</a> provide good, in-depth reports of the bug, but essentially "Core Text" mangles the heap when it sees codepoint sequences like the following:</p><p></p><ol style="text-align: left;"><li>U+0C1C "TELUGU LETTER JA" = "జ"</li><li>U+0C4D "TELUGU SIGN VIRAMA" = "్"</li><li>U+0C1E "TELUGU LETTER NYA" = "ఞ"</li><li>U+200C "ZERO WIDTH NON-JOINER" = ZWNJ</li><li>U+0C3E "TELUGU VOWEL SIGN AA" = "ా"</li></ol><p></p><p style="text-align: left;">That should be rendered as:</p><p style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEhdwNXlHyvn1XLVC90BKc-NCHhQ3mYiKlKotNyaqM7FN4nKv3h-p45sVmBWH_OxrAXfo_VJ_idN-pw5U0rkC00E6DpZmtWGnnqiYC36Uehxxom9dULm-FuZqpzw9Kb8OBYzEt1PsGUuxglP08PZbljFqsUK7YLn2LMO5YXQ9DAXx2JEjYh6TEnuJiPg=s185" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="185" data-original-width="180" height="185" src="https://blogger.googleusercontent.com/img/a/AVvXsEhdwNXlHyvn1XLVC90BKc-NCHhQ3mYiKlKotNyaqM7FN4nKv3h-p45sVmBWH_OxrAXfo_VJ_idN-pw5U0rkC00E6DpZmtWGnnqiYC36Uehxxom9dULm-FuZqpzw9Kb8OBYzEt1PsGUuxglP08PZbljFqsUK7YLn2LMO5YXQ9DAXx2JEjYh6TEnuJiPg" width="180" /></a></p><p style="text-align: left;">I won't be embedding the actual sequence in this post, just in case you haven't updated your iPhone software since 2018. But when presented to Apple's library before the fix, "Core Text" attempts to perform a memory optimization that ends up writing data to an invalid address, thereby usually crashing whichever application is running.</p><p style="text-align: left;">It turns out the ZWNJ is bogus and can be dropped:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEjp7LM8BjGB5Ylpdqf13UzsKlvPeRDS9a5KCKofYVSlE7jn0WqV6xHhMJqTcQFXUIftRVG5Le1Wh_oiiPdn958DSqfrjp1AK-YzCF1GWmvKwozTQ-h3s_ZSxGVW6-6gyZwnWCZ1x6ymBp00kJRXoN_AQun1h_uI-WHA_2DsFjtcS8O3qPCxoiTysho-=s185" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="185" data-original-width="147" height="185" src="https://blogger.googleusercontent.com/img/a/AVvXsEjp7LM8BjGB5Ylpdqf13UzsKlvPeRDS9a5KCKofYVSlE7jn0WqV6xHhMJqTcQFXUIftRVG5Le1Wh_oiiPdn958DSqfrjp1AK-YzCF1GWmvKwozTQ-h3s_ZSxGVW6-6gyZwnWCZ1x6ymBp00kJRXoN_AQun1h_uI-WHA_2DsFjtcS8O3qPCxoiTysho-" width="147" /></a></div><p style="text-align: left;">But that four-codepoint sequence doesn't trigger the bug in "Core Text". It raises the interesting (but knotty) problem of what constitutes a "valid" sequence of codepoints. Whatever the result, crashing is probably not a good response under any circumstances.</p><p style="text-align: left;">The <a href="https://www.unicode.org/mail-arch/unicode-ml/y2018-m02/0103.html">Unicode mailing list has a thread</a> discussing the bug, with a <a href="https://docs.microsoft.com/en-gb/typography/script-development/bengali#reor">reference</a> to just how complicated glyph shaping for Indic fonts is to implement.</p><p style="text-align: left;">"Core Text" is proprietary Apple code, so we cannot inspect the source code, nor is it Apple's policy to explain fixes to critical security bugs.</p><div><p>P.S. Another codepoint I could have picked for the Telugu block trivia was the fabulously named U+0C78 "TELUGU FRACTION DIGIT ZERO FOR ODD POWERS OF FOUR" but I've already recently covered fractions and Mark Jason Dominus <a href="https://blog.plover.com/math/telugu.html">describes it brilliantly</a></p></div>Ian Taylorhttp://www.blogger.com/profile/06869762490434824010noreply@blogger.com0tag:blogger.com,1999:blog-5181746871086541575.post-85990713901696060422022-02-04T10:28:00.132+00:002022-02-04T10:28:00.147+00:00Unicode Trivia U+0BD0<p><b>Codepoint:</b> U+0BD0 "TAMIL OM"<br /><b>Block:</b> U+0B80..0BFF "Tamil"</p><p>Looking for mildly interesting facts within a Unicode block means a bit of research. Take block U+0B80..0BFF "Tamil"; here are my go-to places for information:</p><p></p><ol style="text-align: left;"><li>Wikipedia (<a href="https://en.wikipedia.org/wiki/Tamil_language">language</a>, <a href="https://en.wikipedia.org/wiki/Tamil_script">script</a> and <a href="https://en.wikipedia.org/wiki/Tamil_(Unicode_block)">block</a>)</li><li>Unicode <a href="https://www.unicode.org/charts/PDF/U0B80.pdf">code chart</a></li><li>Unicode <a href="https://www.unicode.org/versions/Unicode14.0.0/ch12.pdf">core specification</a> (Section 12.6)</li><li><a href="https://scriptsource.org/cms/scripts/page.php?item_id=script_detail_use&key=Taml">ScriptSource</a></li><li><a href="https://omniglot.com/writing/tamil.htm">Omniglot</a></li><li>Richard Ishida's excellent <a href="https://r12a.github.io/scripts/tamil/">r12a</a></li></ol><p></p><p>Obviously, many of these sites link to other resources. And therein lies the "fun".</p><p>Looking at the official code chart I found U+0BD0 "TAMIL OM". Of this codepoint, <a href="https://r12a.github.io/scripts/tamil/#symbols">r12a</a> says:</p><blockquote><p><i>OM is a religious concept found in all three major religions born in India viz. Hinduism, Jainism and Buddhism.</i> ௐ [U+0BD0 TAMIL OM] <i>is widely used in Hindu religious texts, temple publications, and as neon lamps of sign boards in shops etc.</i></p></blockquote><p style="text-align: left;">Hmm. That's a bit jarring, isn't it? How did the reference to "<i>neon lamps of sign boards in shops</i>" make it into a list of sacred uses? A quick google of that exact phrase only turns up references to r12a. But I cannot imagine Richard Ishida conjuring up that phrase from thin air.</p><p style="text-align: center;"><span style="font-size: x-large;">ௐ</span></p><p>U+0BD0 "TAMIL OM" was added in Unicode 5.1 (April 2008); recently enough for there to be quite a good paper trail. Indeed, the proposal (<a href="https://www.unicode.org/wg2/docs/n3119.pdf">N3119</a>) to add it was submitted in April 2006 by the International Forum for Information Technology in Tamil (INFITT) Working Group 2 (WG02). Section 2.1 of the proposal says:</p><blockquote><p><i>Devanagari and Gujarati scripts have a sign named OM in their Unicode ranges. However in Tamil the corresponding slot is left vacant. Gurmukhi script also has an OM sign. Tamil OM sign is widely used in Hindu religious texts, temple publications, and as neon lamps of sign boards in shops etc. OM is a religious concept found in all three major religions born in India viz. Hinduism, Jainism and Buddhism. This document proposes to add the character TAMIL OM in Unicode Tamil range at U+0BD0.</i></p></blockquote><p>Surely this must be the source of the "<i>neon lamps</i>" narrative? Somehow Google haven't (yet) indexed it.</p><p>Written proposals to the Unicode committee usually have examples (known as "attestations") attached to their end. Alas, proposal N3119 does not provide a photograph of a neon shop sign.</p><p>It is also surprisingly difficult to find shop signage featuring the <i>Tamil</i> Om on the internet; though other Oms are available. The only good match I found was this:</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEivT6byBc6hcWL-p1slEz2OUFBNTvDB3xJAtHwbcRqdchuP0IsyqQC9gE0mbTDxPR_7tfDFpUF1tE8rRAk5ylLGoLhaiQTu3UKLLxz8xTAHfPGGyIhAqbgqM6TlMCLY3_mwxKgLZ0lGM2CCJG2VVygntjO42z3ZgC1eYzUc8_Ys7nqBFQ0dqyWVguXT=s447" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="447" data-original-width="278" src="https://blogger.googleusercontent.com/img/a/AVvXsEivT6byBc6hcWL-p1slEz2OUFBNTvDB3xJAtHwbcRqdchuP0IsyqQC9gE0mbTDxPR_7tfDFpUF1tE8rRAk5ylLGoLhaiQTu3UKLLxz8xTAHfPGGyIhAqbgqM6TlMCLY3_mwxKgLZ0lGM2CCJG2VVygntjO42z3ZgC1eYzUc8_Ys7nqBFQ0dqyWVguXT=s16000" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">[<a href="https://www.indiamart.com/sudhaneonlight/led-sign-board.html#neon-sign-board">source</a>]</td></tr></tbody></table><p style="text-align: left;">This is from the IndiaMART page of Sudha Neon Lights of Chennai, Tamil Nadu. The Tamil Om is in green; the red "spear" is <a href="https://en.wikipedia.org/wiki/Vel">Vel</a> , the divine javelin of Murugan, the Hindu God of war. The blue text underneath is, I believe, "முருகா" or "Muruga", an alternative spelling of <a href="https://en.wikipedia.org/wiki/Kartikeya">Murugan</a>.</p><p>Given the paucity of images of neon lamp signage in shops incorporating Tamil Om, I wonder just how common it is in Southern India and where the suggestion in N3119 actually comes from. Alas, INFITT/WG02 was <a href="https://www.infitt.org/workgroups/">dissolved</a> some time before May 2020, so we may never know.</p>Ian Taylorhttp://www.blogger.com/profile/06869762490434824010noreply@blogger.com0tag:blogger.com,1999:blog-5181746871086541575.post-46499743267231765632022-02-03T10:04:00.000+00:002022-02-03T10:04:00.153+00:00Unicode Trivia U+0B77<p><b>Codepoint:</b> U+0B77 "ORIYA FRACTION THREE SIXTEENTHS"<br /><b>Block:</b> U+0B00..0B7F "Oriya"</p><p>The Odia language (formerly named <a href="https://omniglot.com/writing/oriya.htm">Oriya</a>) is spoken in Odisha (formerly Orissa): </p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Odia_map.svg/713px-Odia_map.svg.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="768" data-original-width="713" height="640" src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Odia_map.svg/713px-Odia_map.svg.png" width="594" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">[<a href="https://commons.wikimedia.org/wiki/File:Odia_map.svg">source</a>]</td></tr></tbody></table><p>Unlike many Brahmic scripts, the head bar of each glyph is not a contiguous, straight line. As <a href="https://omniglot.com/writing/oriya.htm">Omniglot</a> says:</p><blockquote><p><i>The Odia script developed from the Kalinga script, one of the many descendants of the Brahmi script of ancient India. The earliest known inscription in the Odia language, in the Kalinga script, dates from 1051.</i></p><p><i>The curved appearance of the Odia script is a result of the practice of writing on palm leaves, which have a tendency to tear if you use too many straight lines.</i></p></blockquote><p>There are six "fractions signs" added to the "Oriya" block in Unicode 6.0 (October 2010):</p><p></p><ul style="text-align: left;"><li>U+0B72 "୲" ORIYA FRACTION ONE QUARTER</li><li>U+0B73 "୳" ORIYA FRACTION ONE HALF</li><li>U+0B74 "୴" ORIYA FRACTION THREE QUARTERS</li><li>U+0B75 "୵" ORIYA FRACTION ONE SIXTEENTH</li><li>U+0B76 "୶" ORIYA FRACTION ONE EIGHTH</li><li>U+0B77 "୷" ORIYA FRACTION THREE SIXTEENTHS</li></ul><p></p><p>The <a href="https://www.unicode.org/L2/L2008/08199-oriyafractions.pdf">original proposal</a> by Anshuman Pandey explains that they were primarily used to subdivide one rupee into sixteen annas. See also Section 9.5 of <a href="https://www.unicode.org/versions/Unicode6.0.0/ch09.pdf">South Asian Scripts-I (6.0)</a>.</p><p>Why does it stop at U+0B77 "ORIYA FRACTION THREE SIXTEENTHS"? It first glance, it looks like there must be some codepoints missing, but Anshuman Pandey explains that this is an additive base-4 system, where you can express "N/16" for N=1..15 with <i>at most</i> two of the above codepoints:</p><p></p><ul style="text-align: left;"><li>1/16 = "୵" = 1/16</li><li>2/16 = "୶" = 1/8</li><li>3/16 = "୷" = 3/16</li><li>4/16 = "୲" = 1/4</li><li>5/16 = "୲୵" = 1/4 + 1/16</li><li>6/16 = "୲୶" = 1/4 + 1/8</li><li>7/16 = "୲୷" = 1/4 + 3/16</li><li>8/16 = "୳" = 1/2</li><li>9/16 = "୳୵" = 1/2 + 1/16</li><li>10/16 = "୳୶" = 1/2 + 1/8</li><li>11/16 = "୳୷" = 1/2 + 3/16</li><li>12/16 = "୴" = 3/4</li><li>13/16 = "୴୵" = 3/4 + 1/16</li><li>14/16 = "୴୶" = 3/4 + 1/8</li><li>15/16 = "୴୷" = 3/4 + 3/16</li></ul><p></p><p>As supporting evidence, he also includes a passage from "First Lessons in Oriya" by A. H. Young (1953, revised by B. Das. Cuttack: Orissa Mission Press):</p><blockquote><p><i>The leading principle of Oriya arithmetic, to divide by </i>four<i> rather than any other number, pervades also the system of fractions.</i></p></blockquote><p>This suggests base-4 was used elsewhere in the region's number system. I haven't been able to find any other concrete examples for Odia, but <a href="https://en.wikipedia.org/wiki/Kharosthi#Numerals">Kharosthi numbers</a> have a base-4 component and most other Brahmic scripts have fractions built upon quarters or sixteenths, such as <a href="https://chilliant.blogspot.com/2022/01/unicode-trivia-u09f8.html">Bengali</a> that we saw earlier.</p>Ian Taylorhttp://www.blogger.com/profile/06869762490434824010noreply@blogger.com0tag:blogger.com,1999:blog-5181746871086541575.post-91623951348137071022022-02-02T10:16:00.009+00:002022-02-02T10:16:00.158+00:00Unicode Trivia U+0AF1<p><b>Codepoint:</b> U+0AF1 "GUJARATI RUPEE SIGN"<br /><b>Block:</b> U+0A80..0AFF "Gujarati"</p><p>Sometimes a codepoint loses its lustre. Take U+0AF1 "GUJARATI RUPEE SIGN" as an example.</p><ul><li>October 1991 — The "Gujarati" block is <a href="https://www.unicode.org/versions/Unicode1.0.0/CodeCharts2.pdf">imported</a> from ISCII into Unicode 1.0 without a specific rupee symbol</li></ul><p>The rise...</p><p></p><p style="text-align: left;"></p><ul style="text-align: left;"><li>July 2001 — The Indian Ministry of Information Technology <a href="https://www.unicode.org/L2/L2001/01303-india-letter.pdf">suggests the addition</a> of a Gujarati rupee symbol</li></ul><ul style="text-align: left;"><li>November 2001 — The Unicode Technical Committee <a href="https://www.unicode.org/L2/L2001/01430R.pdf">agrees</a> to "<i>add this rupee sign for Gujarati to the list of proposed additions, since the symbol is not made from pieces that are already encoded Gujarati characters. The form of this character is very Gujarati-like, and it will be proposed for encoding at this location, rather than in the Currency Symbols block.</i>"</li></ul><ul style="text-align: left;"><li>April 2003 — U+0AF1 "GUJARATI RUPEE SIGN" is formally added to Unicode 4.0</li></ul><p></p><p style="text-align: left;"></p><div style="text-align: center;"><span style="font-size: xx-large;"><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://r12a.github.io/c/Gujarati/large/0AF1.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="190" data-original-width="190" src="https://r12a.github.io/c/Gujarati/large/0AF1.png" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><a href="https://r12a.github.io/uniview/?char=0AF1">U+0AF1</a></td></tr></tbody></table></span><p style="text-align: left;">And fall...</p></div><p></p><p style="text-align: left;"></p><p></p><p style="text-align: left;"></p><ul style="text-align: left;"><li>October 2009 — Anshuman Pandey <a href="https://www.unicode.org/L2/L2009/09330-gujarati-abbrev.pdf">proposes</a> the addition of a Gujarati abbreviation sign</li></ul><ul style="text-align: left;"><li>October 2009 — Anshuman Pandey also <a href="https://www.unicode.org/L2/L2009/09331-gujarati-rupee-sign-deprec.pdf">proposes</a> that U+0AF1 be deprecated as, with the addition of the abbreviation sign, the Gujarati rupee can be rendered using the codepoint sequence:</li></ul><p></p><ul style="text-align: left;"><ul><li>U+0AB0 "GUJARATI LETTER RA"</li><li>U+0AC2 "GUJARATI VOWEL SIGN UU"</li><li>U+0AF0 "GUJARATI ABBREVIATION SIGN"</li></ul></ul><p></p><ul style="text-align: left;"><li>January 2012 — U+0AF0 "GUJARATI ABBREVIATION SIGN" is formally added to Unicode 6.1</li></ul><p></p><p style="text-align: left;">Of course, you cannot just remove an existing codepoint from the Unicode standard. What would you do with all the documents that had already embedded U+0AF1 as the rupee symbol? Instead, an annotation was added to U+0AF1 saying</p><div><i><blockquote>preferred spelling is 0AB0 0AC2 0AF0</blockquote></i></div><p style="text-align: left;">Job done? Not quite...</p><div><ul style="text-align: left;"><li>September 2018 — Charlotte Buff <a href="https://www.unicode.org/L2/L2018/18301-deprecation.pdf">points out</a> an inconsistency. She "<i>identified the following 18 characters </i>[including U+0AF1] <i>that are strongly implied to be deprecated in the code charts, but actually aren’t in the UCD</i>". She also raises the point that "<i>U+0AF1 does not decompose into its preferred representation</i>"</li></ul><p style="text-align: left;">Should U+0AF1 be <i>formally</i> <a href="https://en.wikipedia.org/wiki/Unicode_character_property#Deprecated">deprecated</a>? Or should its usage be "discouraged"? Should codepoints in general be decomposed into their preferred spellings?</p></div><p>Personally, I think this is a case that's getting beyond the purview of the core Unicode Standard. Let's face it, U+0AF1 is already out there. Of course, it's difficult to know how prevalent it is; but even one occurrence makes it irrevocable.</p><p>And how exactly do you discourage the use of a codepoint, let alone deprecate it? Do you raid people's homes in the middle of the night and confiscate all the Gujarati Rupee codepoints?</p><p>The keen-eyed reader will have noticed I haven't actually used codepoint U+0AF1 in this post. I don't want to be woken up at 2am, thank you very much!</p>Ian Taylorhttp://www.blogger.com/profile/06869762490434824010noreply@blogger.com0tag:blogger.com,1999:blog-5181746871086541575.post-57152889417453848612022-02-01T10:39:00.135+00:002022-02-01T10:39:00.169+00:00Unicode Trivia U+0A70<p><b>Codepoint:</b> U+0A70 "GURMUKHI TIPPI"<br /><b>Block:</b> U+0A00..0A7F "Gurmukhi"</p><p>Sometimes the information in the <a href="https://unicode.org/ucd/">Unicode Character Database</a> (UCD) is either insufficient for some purpose or requires clarification. This is the role of <a href="https://www.unicode.org/reports/">Unicode Technical Reports</a> (UTR) and <a href="http://unicode.org/notes/">Unicode Technical Notes</a> (UTN).</p><p>Like other Brahmic scripts, the <a href="https://unicode.org/charts/PDF/U0A00.pdf">Gurmukhi</a> script was imported into Unicode 1.0 as part of <a href="http://varamozhi.sourceforge.net/iscii91.pdf">ISCII</a>, where it was known as Punjabi.</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEhDES0pRABJty9HbeclD_iVGF6UPFw5QJolCoF7gmB-QWN8x5sqTCeYv9ixs1-Jm5kLIgN3NFgSay6Ytgn3eyB6Tw7ZpI1ZYijNIa2A8Zwn04MqJF7Gp8zfj2NshiiCahCrgKNkr69kV0XqhVwwp9Nv0Bm5tS9Fp8LiK3I8Wf6aaoPXRRZ1yjI_vQQc=s955" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="955" data-original-width="673" height="640" src="https://blogger.googleusercontent.com/img/a/AVvXsEhDES0pRABJty9HbeclD_iVGF6UPFw5QJolCoF7gmB-QWN8x5sqTCeYv9ixs1-Jm5kLIgN3NFgSay6Ytgn3eyB6Tw7ZpI1ZYijNIa2A8Zwn04MqJF7Gp8zfj2NshiiCahCrgKNkr69kV0XqhVwwp9Nv0Bm5tS9Fp8LiK3I8Wf6aaoPXRRZ1yjI_vQQc=w453-h640" width="453" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">U+0A02 and U+0A70 highlighted in yellow [<a href="https://www.unicode.org/versions/Unicode1.0.0/CodeCharts2.pdf">source</a>]<br /></td></tr></tbody></table><p>However, Gurmukhi differs in having <a href="https://r12a.github.io/scripts/gurmukhi/#nasalisation">two diacritics for nasalisation</a>, the Bindi and Tippi:</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEgqJ9jDOCX4wZsOkJk1TJroxGPb-eAGnIVQvZ2iAqRkHoXiBdQMk7uHmGL3t48HygCLHLLYw12Z88EHlDhUHkWuul89t_T_ixmC3d78CmhnuRSXpxGc_q0Id7FaA-MEA-YBis7VyQbVoTc03mz0oiOL0KuFfahDYWYYKAygW6aHB9Pb9q5Zi5hsglT5=s190" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="190" data-original-width="190" height="190" src="https://blogger.googleusercontent.com/img/a/AVvXsEgqJ9jDOCX4wZsOkJk1TJroxGPb-eAGnIVQvZ2iAqRkHoXiBdQMk7uHmGL3t48HygCLHLLYw12Z88EHlDhUHkWuul89t_T_ixmC3d78CmhnuRSXpxGc_q0Id7FaA-MEA-YBis7VyQbVoTc03mz0oiOL0KuFfahDYWYYKAygW6aHB9Pb9q5Zi5hsglT5" width="190" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">U+0A02 "GURMUKHI SIGN BINDI"<br /></td></tr></tbody></table><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEhwubUz1QvorgqEFb7yeZmFsctLhCBJCaBOskkoVCtDzMURdweuNU4RRU5-MQz17w5_xj3gbgwNx_Jp-7TJQOS-BC5dcaUZYX7dxXgii_oHLKM463BLWovyUo8cdayup3mit0p50VeisT8lM9mczDocjFI2SuIcHRn06ZXnpX6g8Oyx1p8u2GEIKOwF=s190" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="190" data-original-width="190" height="190" src="https://blogger.googleusercontent.com/img/a/AVvXsEhwubUz1QvorgqEFb7yeZmFsctLhCBJCaBOskkoVCtDzMURdweuNU4RRU5-MQz17w5_xj3gbgwNx_Jp-7TJQOS-BC5dcaUZYX7dxXgii_oHLKM463BLWovyUo8cdayup3mit0p50VeisT8lM9mczDocjFI2SuIcHRn06ZXnpX6g8Oyx1p8u2GEIKOwF" width="190" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">U+0A70 "GURMUKHI TIPPI"</td></tr></tbody></table><p style="text-align: left;">The ISCII codepage for Punjabi uses character 0xA2 for both diacritics with the expectation that the correct combining glyph will be rendered according to context. This logic is clarified in <a href="https://www.unicode.org/notes/tn30/utn30-gurmukhi.pdf">UTN #30</a> from Sukhjinder Sidhu (2006):</p><blockquote><p><i>Bindi and Tippi are encoded using a single code point in ISCII (0xA2) and the underlying rendering engine selects the correct glyph. However, in Unicode they are given two separate code points.</i></p><p><i>Thus, 0xA2 should be converted to U+0A70 (Tippi) when:</i></p><p></p><ul style="text-align: left;"><li><i>The preceding letter is a consonant (ignoring any Nuktas)</i></li><li><i>The preceding letter is Vowel Sign I (U+0A3F), Vowel Sign U (U+0A41), Vowel Sign UU (U+0A42)</i></li><li><i>The preceding letter is Letter A (U+0A05), Letter I (U+0A07)</i></li></ul><p></p><p><i>In all other cases, the sign should remain a Bindi (U+0A02).</i></p><p><i>When converting from Unicode to ISCII, both Bindi and Tippi should be converted to Bindi (0xA2).</i></p></blockquote><p style="text-align: left;">This special case logic isn't part of the core Unicode Standard; it is advisory only. But Sukhjinder Sidhu points out</p><blockquote><p><i>If the advice in this document is not heeded, any resulting conversion will not be legible to readers of the Gurmukhi script,</i></p></blockquote><p style="text-align: left;">One would hope that software vendors would take heed, but a casual read of Microsoft's .NET core library source reveals no implementation (or even mention) of UTN #30 in <a href="https://referencesource.microsoft.com/#mscorlib/system/text/isciiencoding.cs">ISCIIEncoding,cs</a>. The code maps 0xA2 to and from U+0A02 (Bindi) but provides no transformations for Tippi. At the top of the C# source file is a comment:</p><blockquote><p><i>Ported from windows c_iscii. If you find bugs here, there're likely similar bugs in the windows version</i></p></blockquote><div><p></p><p></p></div><div><p></p><p>I decided not to look any further.</p></div>Ian Taylorhttp://www.blogger.com/profile/06869762490434824010noreply@blogger.com0tag:blogger.com,1999:blog-5181746871086541575.post-74061512618321373942022-01-31T10:13:00.385+00:002022-01-31T10:13:00.146+00:00Unicode Trivia U+09F8<p><b>Codepoint:</b> U+09F8 "BENGALI CURRENCY NUMERATOR ONE LESS THAN THE DENOMINATOR"<br /><b>Block:</b> U+0980..09FF "Bengali"</p><p><a href="https://www.royalmint.com/discover/decimalisation/the-story-of-decimalisation/">Decimal Day</a>. 15 February 1971. A Monday. The day the United Kingdom and the Republic of Ireland converted to decimal currency. Before that, each pound was divided into twenty shillings and each shilling into twelve pence. We'll ignore farthings.</p><p>So, if I bought something worth one penny with an old, pre-decimalisation five pound note, I'd get a dirty look and the following change:</p><p style="text-align: center;">£4 19/11 = £4. 19s. 11d = 4 pounds, 19 shillings, 11 pence</p><p>Such mixed-radix currencies were not uncommon. In British India, the rupee had been <a href="https://www.tribuneindia.com/news/musings/mind-boggling-annas-and-pice-98432">divided</a> into sixteen annas, each anna into four pice (paisa), and each pice into three pies. The change from five rupees for a one pie item would be:</p><p style="text-align: center;">Rs. 4/15/3/2 = 4 rupees, 15 annas, 3 pice, 2 pies</p><p>In pre-decimal Bengal, the taka (rupee) had been <a href="https://www.pramukhime.com/blog/type-bengali-currency-ana-ganda-rupaya">divided</a> into sixteen ana, and each ana into twenty ganda. The change from five taka for a one ganda item would be:</p><p style="text-align: center;">Tk. 4/15/19 = 4 taka, 15 ana, 19 ganda</p><p>Of course, that's in English using the Latin script and Western Arabic numerals. In Bengali, one <i>could</i> have written:</p><p style="text-align: center;"><span style="font-size: x-large;">৪৲৸৶৹৻১৯</span></p><p style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEi0BgdNp2A9OwrXSTCYTK1Jd506kbIwNDObfumXRzLIwdkWL0-kj-CbK71rUNl_JZfTc44P6EwuwuMc1VHTVoFYB55q6AHleFgL_xVCCE4r9wKtdPSuB5yM_TQBNw-D8ChtyEvd69I_zNU-2rPebl6NQ6cgSVuymslFDjcp9MXNKhBK9dKjS1Tx6vt9=s279" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="86" data-original-width="279" height="86" src="https://blogger.googleusercontent.com/img/a/AVvXsEi0BgdNp2A9OwrXSTCYTK1Jd506kbIwNDObfumXRzLIwdkWL0-kj-CbK71rUNl_JZfTc44P6EwuwuMc1VHTVoFYB55q6AHleFgL_xVCCE4r9wKtdPSuB5yM_TQBNw-D8ChtyEvd69I_zNU-2rPebl6NQ6cgSVuymslFDjcp9MXNKhBK9dKjS1Tx6vt9" width="279" /></a></p><p style="text-align: center;">U+09EA U+09F2 U+09F8 U+09F6 U+09F9 U+09FB U+09E7 U+09EF</p><p style="text-align: left;">As <a href="http://www.unicode.org/L2/L2007/07192-bengali-ganda.pdf">Anshuman Pandey</a>, points out, only one currency mark was actually used when multiple units were written. We'll return to this in due course, but in the meantime I've left that refinement out of the example above.</p><p style="text-align: left;">Bengali is a Brahmic script written left-to-right, so in Unicode this example is:</p><p style="text-align: left;"></p><ol style="text-align: left;"><li>U+09EA "BENGALI DIGIT FOUR"</li><li>U+09F2 "BENGALI RUPEE MARK"</li><li>U+09F8 "BENGALI CURRENCY NUMERATOR ONE LESS THAN THE DENOMINATOR"</li><li>U+09F6 "BENGALI CURRENCY NUMERATOR THREE"</li><li>U+09F9 "BENGALI CURRENCY DENOMINATOR SIXTEEN"</li><li>U+09FB "BENGALI GANDA MARK"</li><li>U+09E7 "BENGALI DIGIT ONE"</li><li>U+09EF "BENGALI DIGIT NINE"</li></ol><p></p><p style="text-align: left;">The first two glyphs ("৪৲") represent "4 taka" in decimal; the Bengali digit four just happens to look like a Western Arabic digit eight. The next three glyphs ("৸৶৹") represent "15 ana". This is complicated by the fact that, traditionally, ana were written as <i>fractions</i> of a taka. Finally, the last three glyphs ("৻১৯") represent "19 ganda" in decimal where, just to confuse us further, the ganda mark comes <i>before</i> the digits, not after them as with taka and ana.</p><p style="text-align: left;">The ana component is the most perplexing. The Unicode codepoint name for U+09F8, "BENGALI CURRENCY NUMERATOR ONE LESS THAN THE DENOMINATOR", doesn't really help. Fortunately, there's an explanation within the much later <a href="http://www.unicode.org/L2/L2007/07192-bengali-ganda.pdf">proposal to add the ganda mark</a> in 2007.</p><p style="text-align: left;">The fifteen possible quantities of ana are:</p><p style="text-align: left;"></p><ul style="text-align: left;"><li>৴৹ = 1 ana (Numerator 1)</li><li>৵৹ = 2 ana (Numerator 2)</li><li>৶৹ = 3 ana (Numerator 3)</li><li>৷৹ = 4 ana (Numerator 4)</li><li>৷৴৹ = 5 ana</li><li>৷৵৹ = 6 ana</li><li>৷৶৹ = 7 ana</li><li>৷৷৹ = 8 ana</li><li>৷৷৴৹ = 9 ana</li><li>৷৷৵৹ = 10 ana</li><li>৷৷৶৹ = 11 ana</li><li>৸৹ = 12 ana (Numerator One Less Than the Denominator)</li><li>৸৴৹ = 13 ana</li><li>৸৵৹ = 14 ana</li><li>৸৶৹ = 15 ana</li></ul><p></p><p style="text-align: left;">This looks like a modified base-4 tally mark system. But, thinking back to what Anshuman Pandey said about elided currency marks, I wonder if this scheme didn't originate in a finer-grained positional system.</p><p style="text-align: left;">Imagine that instead of the taka being divided directly into sixteen ana, it was divided into four virtual "beta", which were themselves divided into four virtual "alpha". Obviously:</p><p style="text-align: center;"></p><ul><li style="text-align: left;">ana = alpha + beta * 4</li></ul><p></p><p style="text-align: left;">But now we have the following encoding:</p><p style="text-align: left;"></p><p style="-webkit-text-stroke-width: 0px; color: black; font-family: "Times New Roman"; font-size: medium; font-style: normal; font-variant-caps: normal; font-variant-ligatures: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration-color: initial; text-decoration-style: initial; text-decoration-thickness: initial; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px;"></p><p></p><ul style="-webkit-text-stroke-width: 0px; color: black; font-family: "Times New Roman"; font-size: medium; font-style: normal; font-variant-caps: normal; font-variant-ligatures: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration-color: initial; text-decoration-style: initial; text-decoration-thickness: initial; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px;"><li>৴ = 1 alpha (Numerator 1)</li><li>৵ = 2 alpha (Numerator 2)</li><li>৶ = 3 alpha (Numerator 3)</li><li>৷ = 1 beta</li><li>৷৷ = 2 beta</li><li>৸ = 3 beta (Numerator One Less Than the Denominator)</li></ul><p style="text-align: left;">For beta, the "denominator" is indeed four, to the mysterious U+09F8 "BENGALI CURRENCY NUMERATOR ONE LESS THAN THE DENOMINATOR" suddenly makes sense.</p><p style="text-align: left;">We can now come up with an algorithm for writing out a currency amount according to the scheme described by Anshuman Pandey:</p><p style="text-align: left;"></p><ul style="text-align: left;"><li><span style="font-family: courier; font-size: x-small;">Let T, A, G be the number of taka (0 or more), ana (0 to 15), ganda (0 to 19) respectively</span></li><li><span style="font-family: courier; font-size: x-small;">If T is <i>not</i> zero then</span></li><ul><li><span style="font-family: courier; font-size: x-small;">Write out the Bengali decimal representation of T</span></li><li><span style="font-family: courier; font-size: x-small;">If both A and G are zero</span></li><ul><li><span style="font-family: courier; font-size: x-small;">Write out U+09F2 "BENGALI RUPEE MARK"</span></li><li><span style="font-family: courier; font-size: x-small;">We're finished</span></li></ul></ul><li><span style="font-family: courier; font-size: x-small;">Let α be A modulo 4 (0 to 3)</span></li><li><span style="font-family: courier; font-size: x-small;">Let β be A divided by 4, rounded down (0 to 3)</span></li><li><span style="font-family: courier; font-size: x-small;">If β is 1, write out U+09F7 "BENGALI CURRENCY NUMERATOR FOUR"</span></li><li><span style="font-family: courier; font-size: x-small;">If β is 2, write out U+09F7 "BENGALI CURRENCY NUMERATOR FOUR" <i>twice</i></span></li><li><span style="font-family: courier; font-size: x-small;">If β is 3, write out U+09F8 "BENGALI CURRENCY NUMERATOR ONE LESS THAN THE DENOMINATOR"</span></li><li><span style="font-family: courier; font-size: x-small;">If α is 1, write out U+09F4 "BENGALI CURRENCY NUMERATOR ONE"</span></li><li><span style="font-family: courier; font-size: x-small;">If α is 2, write out U+09F5 "BENGALI CURRENCY NUMERATOR TWO"</span></li><li><span style="font-family: courier; font-size: x-small;">If α is 3, write out U+09F6 "BENGALI CURRENCY NUMERATOR THREE"</span></li><li><span style="font-family: courier; font-size: x-small;">If G is zero then</span></li><ul><li><span style="font-family: courier; font-size: x-small;">Write out U+09F9 "BENGALI CURRENCY DENOMINATOR SIXTEEN"</span></li><li><span style="font-family: courier; font-size: x-small;">We're finished</span></li></ul><li><span style="font-family: courier; font-size: x-small;">Write out U+09FB "BENGALI GANDA MARK"</span></li><li><span style="font-family: courier; font-size: x-small;">Write out the Bengali decimal representation of G</span></li><li><span style="font-family: courier; font-size: x-small;">We're finished</span></li></ul><p style="text-align: left;">For our example, T=4, A=15, G=19, α=3, β=3 and the output is:</p><p style="text-align: center;"><span style="font-size: x-large;">৪৸৶৻১৯</span></p><p style="text-align: center;"></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEjtRk9K5VMeCiTu3HSKvGkoNiEFrqt_q_YxsbfWpCF5BFSYyU8Z1GELYcjjs5giB6cxM4OR_2QIYjwkWLl57SssxBjlM0RLcB6SWBgBCA3nViJ_YtKoTYeoesZz4hnt705K7TfZP6kBE-ksoS-A00Pcp7UQ2MsOFBqC-7ay8T5RhyKBx5hzp1e4hEhB" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="86" data-original-width="213" src="https://blogger.googleusercontent.com/img/a/AVvXsEjtRk9K5VMeCiTu3HSKvGkoNiEFrqt_q_YxsbfWpCF5BFSYyU8Z1GELYcjjs5giB6cxM4OR_2QIYjwkWLl57SssxBjlM0RLcB6SWBgBCA3nViJ_YtKoTYeoesZz4hnt705K7TfZP6kBE-ksoS-A00Pcp7UQ2MsOFBqC-7ay8T5RhyKBx5hzp1e4hEhB=s16000" /></a></div><br />U+09EA U+09F8 U+09F6 U+09FB U+09E7 U+09EF<p></p><p style="text-align: left;">This representation is surprisingly concise and totally unambiguous.</p><p></p>Ian Taylorhttp://www.blogger.com/profile/06869762490434824010noreply@blogger.com0tag:blogger.com,1999:blog-5181746871086541575.post-46259363590159746412022-01-30T10:31:00.001+00:002022-01-30T10:31:00.143+00:00Unicode Trivial U+0950<p><b>Codepoint:</b> U+0950 "DEVANAGARI OM"<br /><b>Block:</b> U+0900..097F "Devanagari"</p><p>Next, we come to our first <a href="https://en.wikipedia.org/wiki/Brahmic_scripts">Brahmic script</a>: <a href="https://omniglot.com/writing/devanagari.htm">Devanagari</a>. Devanagari is the most <a href="https://en.wikipedia.org/wiki/List_of_writing_systems#List_of_writing_systems_by_adoption">widely-used</a> Brahmic script. Almost <a href="https://www.unicode.org/L2/L2003/03102-indic-ov.pdf">50%</a> of the Indian population use it to write their native language. It is a left-to-right abugida used to write dozens of languages.</p><p>It is sometimes glibly called a "washing line" script because, unlike Latin/Greek/Cyrillic scripts that "sit" on top of their baselines, Devanagari also "hangs" from a head line:</p><p></p><p style="text-align: center;"><span style="font-size: x-large;">Devanagari script<br />देवनागरी लिपि</span></p><p></p><p>Devanagari typography is non-trivial, even when the letterforms are isolated:</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEjyZXy5QnA93KchWrxMhNYbPqcyiVKcRkTIIXt8uKx8frqBOmotpsI_EpKM505M0_djI00In8vimLNpNdqvlfTMDP7NjdTbN--rGPYwQcDDHWZJUJ2fZBLnzVqovmpy9atg1kCHVpyWLfCOFCZSojLfcFbRKB3Vg9GaEHFj9os_WcTIFp-BimDzMO8m=s955" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="614" data-original-width="955" height="412" src="https://blogger.googleusercontent.com/img/a/AVvXsEjyZXy5QnA93KchWrxMhNYbPqcyiVKcRkTIIXt8uKx8frqBOmotpsI_EpKM505M0_djI00In8vimLNpNdqvlfTMDP7NjdTbN--rGPYwQcDDHWZJUJ2fZBLnzVqovmpy9atg1kCHVpyWLfCOFCZSojLfcFbRKB3Vg9GaEHFj9os_WcTIFp-BimDzMO8m=w640-h412" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><i>Devanagari Type Anatomy</i> [<a href="https://www.type-together.com/devanagari-type-anatomy">source</a>]</td></tr></tbody></table><p style="text-align: left;">Of course, <a href="https://medium.com/s/about-face/eyes-to-c-arms-to-e-a034793cbf49">anthropomorphism in typography</a> isn't limited to Brahmic scripts, but one can take "type anatomy" quite literally:</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://www.type-together.com/resources/-2016/editorial/articles/2018_devatypeanatomy/2018_deva_type_anatomy_4.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="547" data-original-width="800" height="438" src="https://www.type-together.com/resources/-2016/editorial/articles/2018_devatypeanatomy/2018_deva_type_anatomy_4.jpg" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><i>Design Parameters of Devanagari</i> (Gokhale, 1983) [<a href="https://www.type-together.com/devanagari-type-anatomy">source</a>]</td></tr></tbody></table><p style="text-align: left;">The flipside of grammatology is phonology.</p><p style="text-align: left;">"<a href="https://en.wikipedia.org/wiki/Om">Om</a>" is the <i>sound </i>of a sacred spiritual symbol in Indic religions. In Devanagari, in its simplest form, it is written:</p><p style="text-align: left;"><span style="text-align: center;"></span></p><div style="text-align: center;"><span style="font-size: x-large;">Om</span><br /><span style="font-size: x-large;">ओम<br /></span>U+0913 U+092E</div><p style="-webkit-text-stroke-width: 0px; color: black; font-family: "Times New Roman"; font-size: medium; font-style: normal; font-variant-caps: normal; font-variant-ligatures: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration-color: initial; text-decoration-style: initial; text-decoration-thickness: initial; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px;">However, there is also a ligatured codepoint in the Devanagari block named "DEVANAGARI OM":</p><p style="-webkit-text-stroke-width: 0px; color: black; font-family: "Times New Roman"; font-size: medium; font-style: normal; font-variant-caps: normal; font-variant-ligatures: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration-color: initial; text-decoration-style: initial; text-decoration-thickness: initial; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px;"><span style="font-size: x-large; text-align: center;"></span></p><div style="text-align: center;"><span style="font-size: x-large; text-align: center;">Om (sign)<br />ॐ</span></div><span style="text-align: center;"><div style="text-align: center;">U+0950</div></span><p></p><p style="-webkit-text-stroke-width: 0px; color: black; font-family: "Times New Roman"; font-size: medium; font-variant-caps: normal; font-variant-ligatures: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-decoration-color: initial; text-decoration-style: initial; text-decoration-thickness: initial; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px;"><span style="font-style: normal;">There are currently <a href="https://chilliant.com/universe/universe.html#name=/%5CbOM%5Cb/u">many Oms</a> encoded into Unicode. The Devanagari Om was added right at the onset in <a href="https://www.unicode.org/versions/Unicode1.0.0/CodeCharts2.pdf">Unicode 1.0</a> (1991). This incorporated a wholesale import of the <a href="http://varamozhi.sourceforge.net/iscii91.pdf">ISCII</a> (Indian Script Code for Information Interchange, 1988) character sets which </span>encode Om as a multi-byte sequence (0xA1 0xE9). The Unicode Consortium allocated U+0950 "DEVANAGARI OM" to allow one-to-one mapping of ISCII on this basis. Of course, once you allow one Om through...</p>Ian Taylorhttp://www.blogger.com/profile/06869762490434824010noreply@blogger.com0tag:blogger.com,1999:blog-5181746871086541575.post-50018354025772532822022-01-29T10:09:00.014+00:002022-01-29T10:09:00.171+00:00Unicode Trivia U+08C8<b>Codepoint:</b> U+08C8 "ARABIC LETTER GRAF"<br /><b>Block:</b> U+08A0..08FF "Arabic Extended-A"<p>Unicode blocks are <i>not</i> allocated sequentially. Consequently, "<a href="https://www.unicode.org/charts/PDF/U08A0.pdf">Arabic Extended-A</a>" (U+08A0..08FF, originating in Unicode 6.1, January 2012) comes numerically <i>after</i> "<a href="https://www.unicode.org/charts/PDF/U0870.pdf">Arabic Extended-B</a>" (U+0870..089F, originating in Unicode 14.0, September 2021).</p><p>Even more confusing, codepoints within a block can be allocated at different times. For example, U+08C8 "ARABIC LETTER GRAF" was <a href="https://www.unicode.org/charts/PDF/Unicode-14.0/U140-08A0.pdf">assigned in Unicode 14.0</a> (September 2021); but its neighbour, U+08C7 "ARABIC LETTER LAM WITH SMALL ARABIC LETTER TAH ABOVE" was assigned in Unicode 13.0 (March 2020).</p><p>U+08C8 "ARABIC LETTER GRAF" is an addition to the Arabic script for writing the Balti language:</p><p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg21rYZyppZoZM9v4FwrrUXKTPpKIVXe8GfGRhkXpkqWpf3jK1ID8BZIC0kgfRsgz7F-oO8c3PP2DnvdvuVkOCtcIv6vopwQYxeUIoGCDoF-m0whrsH-fG-GPMQ02ohmBc2bOOAgAsdls8/" style="margin-left: auto; margin-right: auto;"><img alt="" data-original-height="190" data-original-width="190" height="240" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg21rYZyppZoZM9v4FwrrUXKTPpKIVXe8GfGRhkXpkqWpf3jK1ID8BZIC0kgfRsgz7F-oO8c3PP2DnvdvuVkOCtcIv6vopwQYxeUIoGCDoF-m0whrsH-fG-GPMQ02ohmBc2bOOAgAsdls8/" width="240" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><i>Isolated form of U+08C8 </i>[<a href="https://r12a.github.io/uniview">source</a>]</td></tr></tbody></table></p><p>For example, the Balti word for knife (U+08C8 U+06CC):</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEjSiPyTZfGdxm7LqBgAFtI9ja1awtgOH-bUJm7EJfNFOY93ui9drUmltzslmnUVFNHoN09mkcLsM8f49nX8l_YrwM-euOfCJkgldWYM__B7BRuu2o3pXzBlpjnnVA3lx59h4wPBT-7CBtrMNrWbcM8Q5RrX7it7fvGC6Ycgn8KfbDLexUea304l0hVz=s381" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="381" data-original-width="356" src="https://blogger.googleusercontent.com/img/a/AVvXsEjSiPyTZfGdxm7LqBgAFtI9ja1awtgOH-bUJm7EJfNFOY93ui9drUmltzslmnUVFNHoN09mkcLsM8f49nX8l_YrwM-euOfCJkgldWYM__B7BRuu2o3pXzBlpjnnVA3lx59h4wPBT-7CBtrMNrWbcM8Q5RrX7it7fvGC6Ycgn8KfbDLexUea304l0hVz=s16000" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">[<a href="http://karachvi.com/books/6/Balti_qaida.pdf">source</a>]</td></tr></tbody></table><p style="text-align: left;">U+08C8 is the only <i>specific </i>addition required for the Arabic script to be able to write Balti, although <a href="https://chilliant.com/universe/universe.html#C+0F6B">two other codepoints</a> (U+0F6B "TIBETAN LETTER KKA" and U+0F6C "TIBETAN LETTER RRA", added in Unicode 5.1, April 2008) are needed to write Balti using the Tibetan script.</p><p style="text-align: left;"><a href="https://omniglot.com/writing/balti.htm">Balti</a> is spoken in <a href="https://en.wikipedia.org/wiki/Baltistan">Baltistan</a>, or Little Tibet, a mountainous region in the Gilgit-Baltistan part of Pakistan-administered Kashmir. This <a href="https://en.wikipedia.org/wiki/Balti_(food)#Origin,_history_and_etymology">may</a> be the origins of the balti curry dish popular in the UK since the nineteen-seventies.</p>Ian Taylorhttp://www.blogger.com/profile/06869762490434824010noreply@blogger.com0tag:blogger.com,1999:blog-5181746871086541575.post-81361037098872545352022-01-28T10:57:00.002+00:002022-01-28T10:57:00.176+00:00Unicode Trivia U+0891<p><b>Codepoint:</b> U+0891 "ARABIC PIASTRE MARK ABOVE"<br /><b>Block:</b> U+0870..089F "Arabic Extended-B"</p><p>Consider this photograph by Tinou Bao:</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEhk-SlSmRJcmhG13LQPhLnHIC_5-LSzvGUUdUpXfaReaLRAiSeCS6TJ6j-uXrE6Ds1ZmDOryFGtFczyVBO_I_4D8Cf_FD3MWWAc-hNHFk87BtH8spRtoDdFpZ6psAJAoi4bBnHCfu34Lkjdakv7uBVsv-rrbblgOwmml1bUqdlFxyLsv68qT6nOx5lv=s640" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="640" data-original-width="512" height="640" src="https://blogger.googleusercontent.com/img/a/AVvXsEhk-SlSmRJcmhG13LQPhLnHIC_5-LSzvGUUdUpXfaReaLRAiSeCS6TJ6j-uXrE6Ds1ZmDOryFGtFczyVBO_I_4D8Cf_FD3MWWAc-hNHFk87BtH8spRtoDdFpZ6psAJAoi4bBnHCfu34Lkjdakv7uBVsv-rrbblgOwmml1bUqdlFxyLsv68qT6nOx5lv=w512-h640" width="512" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><a href="https://www.flickr.com/photos/57621379@N00/2066076908">Fruit Seller</a><br /><i>"The guy asked to be photographed"</i></td></tr></tbody></table><p style="text-align: left;">It could elicit any number of reactions:</p><p style="text-align: left;"></p><ol style="text-align: left;"><li>What wonderfully colourful fruit!</li><li>Are those dates expensive?</li><li>I hope he doesn't drop cigarette ash on that fruit</li><li>His fruit are suspiciously glossy</li><li>Why did he <i>want</i> his <a href="https://www.cheapflights.co.uk/news/top-10-travel-scams-to-watch-out-for">photo</a> taken?</li><li>That's an interesting symbol above the price</li></ol><p></p><p>You're in the right place if your reaction was number six.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEijYmnQksNANajlAHqE8GIJcKPkdlHSoKMJQtcQF42nxPAomdpyAD26tcsI6jcUmxsshQxrDZRJiTg0YHQFGq19Ev5g3rcLZ4Jh0y4zRTqOCl0ul0Nlh7USY-wyrUCMIJwxlvf2uBZRGZOgkDsC515PQ7IskiF0V-yTdSh5UV4Zx8GqQ4VBGsn2PMPD=s285" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="285" data-original-width="263" height="285" src="https://blogger.googleusercontent.com/img/a/AVvXsEijYmnQksNANajlAHqE8GIJcKPkdlHSoKMJQtcQF42nxPAomdpyAD26tcsI6jcUmxsshQxrDZRJiTg0YHQFGq19Ev5g3rcLZ4Jh0y4zRTqOCl0ul0Nlh7USY-wyrUCMIJwxlvf2uBZRGZOgkDsC515PQ7IskiF0V-yTdSh5UV4Zx8GqQ4VBGsn2PMPD" width="263" /></a></div><p>That symbol, circled in blue, is an Arabic supertending currency symbol for <a href="https://en.wikipedia.org/wiki/Egyptian_pound">Egyptian piastres</a>. The photo was used as part of the <a href="https://www.unicode.org/L2/L2020/20245-three-arabic-symbols.pdf">proposal</a> for the addition of two new currency codepoints:</p><p></p><ul style="text-align: left;"><li>U+0890 "ARABIC POUND MARK ABOVE"</li><li>U+0891 "ARABIC PIASTRE MARK ABOVE"</li></ul><p></p><p>The proposal was formally submitted in August 2020, <a href="https://www.unicode.org/L2/L2020/20250-script-adhoc-rept.pdf">accepted</a> in October 2020 and <a href="http://www.unicode.org/versions/Unicode14.0.0/">released</a> as part of Unicode 14.0 in September 2021.</p><p></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhhXYdk-sbhKKO5NWT8kaldkUoCetlKOqOtd7le4hsp3xqNzzR8Y0SUBzTkkjjW65IxLSbbOcKuF8QO5l1RSY-1yJD05EBbH-gPgGeg0x_Y2Wa5ERWWJtZ7MwTGICB2jr1nShwDmFnFiGo/" style="margin-left: auto; margin-right: auto;"><img alt="" data-original-height="190" data-original-width="190" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhhXYdk-sbhKKO5NWT8kaldkUoCetlKOqOtd7le4hsp3xqNzzR8Y0SUBzTkkjjW65IxLSbbOcKuF8QO5l1RSY-1yJD05EBbH-gPgGeg0x_Y2Wa5ERWWJtZ7MwTGICB2jr1nShwDmFnFiGo/s16000/image.png" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">[<a href="https://r12a.github.io/uniview/?char=0891">source</a>]</td></tr></tbody></table><br /><br /><p></p>Ian Taylorhttp://www.blogger.com/profile/06869762490434824010noreply@blogger.com0tag:blogger.com,1999:blog-5181746871086541575.post-42198841260075580222022-01-27T10:13:00.048+00:002022-01-27T10:13:00.157+00:00Unicode Trivia U+0861<p><b>Codepoint:</b> U+0861 "SYRIAC LETTER MALAYALAM JA"<br /><b>Block:</b> U+0860..086F "Syriac Supplement"</p><p>The <a href="https://www.unicode.org/charts/PDF/U0860.pdf">Syriac Supplement</a> block contains letters used for writing <a href="https://omniglot.com/writing/suriyanimalayalam.htm">Suriyani Malayalam</a>, also known as Syriac Malayalam. This is an Eastern Syriac script with eleven new letters added to capture Malayalam sounds:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEgXdrRRmGE7BvvBP46gHnPIYMV74GUloAPeh2I6SlAgypRT70613NE0SqivAVb8yJWLAaw-wr6USgJ0_fFdI7qD0xfa9t04BeWS40e2aqF_yFUWvHvJmrXsZ8Wq7s1TeOmT0DTHA0kkxtLNKPZOU7GmK6BB4dH-NfKYEzbk7KnIDB1iNqn9lSmPkAvw=s364" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="34" data-original-width="364" src="https://blogger.googleusercontent.com/img/a/AVvXsEgXdrRRmGE7BvvBP46gHnPIYMV74GUloAPeh2I6SlAgypRT70613NE0SqivAVb8yJWLAaw-wr6USgJ0_fFdI7qD0xfa9t04BeWS40e2aqF_yFUWvHvJmrXsZ8Wq7s1TeOmT0DTHA0kkxtLNKPZOU7GmK6BB4dH-NfKYEzbk7KnIDB1iNqn9lSmPkAvw=s16000" /></a></div><p>The Syriac and Malayalam scripts are almost entirely unrelated; the former is a right-to-left abjad from the Middle East:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEh5F3qWvWSFEEAZ8_TzbV6pW9HrYKdUGciWIq-NqbWB97uA3uSGbIQzrmC9Ca2qBYipyh-75d5CUlrUwPk3MSAvaAeuxuSRcyFBe-4BZoc9C7HqBiAglx861wTg0BwsYFu8T5VsHIk4ZX4egQIhN-BAAx776ZUBLcbl-IQhIBCvElDSfsDZI1qwtuWG=s727" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="34" data-original-width="727" src="https://blogger.googleusercontent.com/img/a/AVvXsEh5F3qWvWSFEEAZ8_TzbV6pW9HrYKdUGciWIq-NqbWB97uA3uSGbIQzrmC9Ca2qBYipyh-75d5CUlrUwPk3MSAvaAeuxuSRcyFBe-4BZoc9C7HqBiAglx861wTg0BwsYFu8T5VsHIk4ZX4egQIhN-BAAx776ZUBLcbl-IQhIBCvElDSfsDZI1qwtuWG=s16000" /></a></div><p>Whilst the latter is a left-to-right abugida from Southern Asia:</p><p><a href="https://blogger.googleusercontent.com/img/a/AVvXsEiikvTJC7cSoaqvb1oQBuvQQ6y21jEjb4iB6EeOsUv76IOK_uMD9wiTQwgI1llFHm_yfMbQf0DJa4eelwMpttAb-BtNqz-F0SQ46btjZJDZLts1vBoOgbmNEXQWe2-ckcu9hj4GYOD9zUcv7YuVAKAfJJrj2-r4sgnh59Fsax-QkWjuR1rSnKYmB1-f=s1189" imageanchor="1" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="34" data-original-width="1189" src="https://blogger.googleusercontent.com/img/a/AVvXsEiikvTJC7cSoaqvb1oQBuvQQ6y21jEjb4iB6EeOsUv76IOK_uMD9wiTQwgI1llFHm_yfMbQf0DJa4eelwMpttAb-BtNqz-F0SQ46btjZJDZLts1vBoOgbmNEXQWe2-ckcu9hj4GYOD9zUcv7YuVAKAfJJrj2-r4sgnh59Fsax-QkWjuR1rSnKYmB1-f=s16000" /></a></p><p>So the "mashing together" of the two scripts is somewhat surprising and problematic.</p><p>For example, the Suriyani Malayalam letter "<i>ja</i>" only appears in isolated form, so the "standard" U+0D1C "MALAYALAM LETTER JA" could have been used, however, the <a href="https://www.unicode.org/L2/L2015/15156-syriac-malayalam.pdf">decision was taken</a> to encode a separate U+0861 "SYRIAC LETTER MALAYALAM JA":</p><blockquote><p><i>Although it may be possible to use U+0D1C within a Syriac environment, a separate encoding is needed</i> [...] <i>so that Syriac vowel marks can be combined with the letter. Furthermore the differing directionalities of the Malayalam and Syriac scripts may cause problems for introducing a Malayalam character directly in Syriac sequences.</i></p></blockquote><p>Anyone who has tried editing text with mixed left-to-right and right-to-left script will appreciate that last comment.</p><p>Suriyani Malayalam is used by Saint Thomas Christians of Kerala in India as a liturgical language. <a href="https://en.wikipedia.org/wiki/Saint_Thomas_Christians#History">According to tradition</a>, Thomas the Apostle voyaged to Muziris on the Malabar coast (Kerala) in 52 CE, bringing Christianity to the region. This may sound <a href="https://www.smithsonianmag.com/travel/how-christianity-came-to-india-kerala-180958117/">implausible</a>, but Kerala had an established Jewish community at around that time, particular in Cochin. So it is possible for an Aramaic-speaking Jew, such as Saint Thomas from Galilee, to make a trip to Kerala via the maritime Silk Road routes:</p><div class="separator" style="clear: both; text-align: center;"><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEgFdI0cgrWOJ_tFFkxTwExFC8ERtGRDAhZvyfOMhPbuysxUqjY7veMR3BimjPnefQpZibl1X8TL0tB9y-oYOmfseT61RfAYL2A9fVhh31gUNsO1-vMeeBeFtgOkZLvqmoec3gg25p1qxbc4BIUDcR1M0LsQIbkCV-PxWlh21RA78BCO1f4_7ucpaDc1=s850" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="432" data-original-width="850" height="326" src="https://blogger.googleusercontent.com/img/a/AVvXsEgFdI0cgrWOJ_tFFkxTwExFC8ERtGRDAhZvyfOMhPbuysxUqjY7veMR3BimjPnefQpZibl1X8TL0tB9y-oYOmfseT61RfAYL2A9fVhh31gUNsO1-vMeeBeFtgOkZLvqmoec3gg25p1qxbc4BIUDcR1M0LsQIbkCV-PxWlh21RA78BCO1f4_7ucpaDc1=w640-h326" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">[<a href="http://www.historyshistories.com/silk-road-maritime-route.html">source</a>]</td></tr></tbody></table><p style="text-align: left;">Perhaps not surprisingly, after almost two thousand years, the Saint Thomas Christians have experienced schisms and (sadly fewer) reunifications:</p><p style="text-align: justify;"><span style="text-align: left;"></span></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEg1Cw9ooUnzrzer0eyYxddYCPhs2ueRStKz6lhfpW0fLJraY2YEyj1Hv7eZbRIV1J8TxcJWQ7MdGYa5Sj0G20Fu7Dux_5lJoyYDugHVM7g0aFkA_2Pxc1_hBqe4OMDSUt9uptKU8PTVg8VPz4cyif0Vu2NIqDCarxCRQPfeqAlu4MCM_XpCyQTSruq6=s1024" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="673" data-original-width="1024" height="420" src="https://blogger.googleusercontent.com/img/a/AVvXsEg1Cw9ooUnzrzer0eyYxddYCPhs2ueRStKz6lhfpW0fLJraY2YEyj1Hv7eZbRIV1J8TxcJWQ7MdGYa5Sj0G20Fu7Dux_5lJoyYDugHVM7g0aFkA_2Pxc1_hBqe4OMDSUt9uptKU8PTVg8VPz4cyif0Vu2NIqDCarxCRQPfeqAlu4MCM_XpCyQTSruq6=w640-h420" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">[<a href="https://commons.wikimedia.org/wiki/File:SaintThomasChristian%27sDivisionsHistoryFinal-en.svg">source</a>]</td></tr></tbody></table></div>Ian Taylorhttp://www.blogger.com/profile/06869762490434824010noreply@blogger.com0tag:blogger.com,1999:blog-5181746871086541575.post-52320004025503846122022-01-26T10:02:00.000+00:002022-01-26T10:02:00.159+00:00Unicode Trivia U+0840<p><b>Codepoint:</b> U+0840 "MANDAIC LETTER HALQA"<br /><b>Block:</b> U+0840..085F "Mandaic"</p><p style="text-align: left;">The <a href="https://omniglot.com/writing/mandaic.htm">Mandaic alphabet</a> contains 22 letters (in the same order as the Aramaic alphabet) and one digraph:</p><p style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEhcIg7l1Fsz3HWOQB2TEF0ARJljzHJvzXO773GkRQp8x4PNQNt9Q3yUe8KmjAa1zDdQi2vZW_PUpFjvyEe1CL_Xxa1uqJ1g9nfoJVUjYy1IggfK1GPhQpm_Jo5tXiB6PHHyP472-vZlFA-tHqINIJigwxONeGRBN8JSMseyN4hV1zCQZHCDZwWup5Ca=s760" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="34" data-original-width="760" height="29" src="https://blogger.googleusercontent.com/img/a/AVvXsEhcIg7l1Fsz3HWOQB2TEF0ARJljzHJvzXO773GkRQp8x4PNQNt9Q3yUe8KmjAa1zDdQi2vZW_PUpFjvyEe1CL_Xxa1uqJ1g9nfoJVUjYy1IggfK1GPhQpm_Jo5tXiB6PHHyP472-vZlFA-tHqINIJigwxONeGRBN8JSMseyN4hV1zCQZHCDZwWup5Ca=w640-h29" width="640" /></a></p><p style="text-align: left;">The alphabet is "rounded up" to a symbolic count of 24 letters by repeating the first letter, U+0840 "MANDAIC LETTER HALQA". It is unusual for a Semitic script in being a true alphabet with letters for both consonants and vowels:</p><p style="text-align: left;"></p><ol style="text-align: left;"><li>U+0840 "Halqa" = a [<i>vowel</i>]</li><li>U+0841 "Ab" = ba</li><li>U+0842 "Ag" = ga</li><li>U+0843 "Ad" = da</li><li>U+0844 "Ah" = ha</li><li>U+0845 "Ushenna" = wa [<i>vowel</i>]</li><li>U+0846 "Az" = za</li><li>U+0847 "It" = eh</li><li>U+0848 "Att" = ṭa</li><li>U+0849 "Aksa" = ya [<i>vowel</i>]</li><li>U+084A "Ak" = ka</li><li>U+084B "Al" = la</li><li>U+084C "Am" = ma</li><li>U+084D "An" = na</li><li>U+084E "As" = sa</li><li>U+084F "In" = e [<i>vowel</i>]</li><li>U+0850 "Ap" = pa</li><li>U+0851 "Asz" = ṣa</li><li>U+0852 "Aq" = qa</li><li>U+0853 "Ar" = ra</li><li>U+0854 "Ash" = ša</li><li>U+0855 "At" = ta</li><li>U+0856 "Dushenna" = ḏ</li></ol><p></p><p style="text-align: left;">The eighteenth letter was renamed from "Ass" to "Asz" as part of the <a href="https://www.unicode.org/wg2/docs/n3485.pdf">original proposal</a>, presumably to stop the giggling at the back of the classroom.</p><p style="text-align: left;">The <a href="https://en.wikipedia.org/wiki/Mandaic_language">Classical Mandaic language</a> is still used by Mandaean priests in liturgical rites. It is estimated that there are about 5,500 native speakers. <a href="https://en.wikipedia.org/wiki/Neo-Mandaic">Neo-Mandaic</a> is a modern evolution of Mandaic but generally unwritten. Only a few hundred Mandaeans, located mainly in Iran, speak Neo-Mandaic as a first language.</p><p style="text-align: left;">One of the unintended consequences of the <a href="https://en.wikipedia.org/wiki/2003_invasion_of_Iraq">2003 invasion of Iraq</a> was the diaspora of over 60,000 Iraqi Mandaeans. Today, <a href="https://en.wikipedia.org/wiki/Mandaeans_in_Sweden">Sweden</a> has the largest community of <i>any</i> country.</p>Ian Taylorhttp://www.blogger.com/profile/06869762490434824010noreply@blogger.com0tag:blogger.com,1999:blog-5181746871086541575.post-2928620844722640142022-01-25T10:29:00.010+00:002022-01-25T10:29:00.156+00:00Unicode Trivia U+0837<p><b>Codepoint:</b> U+0837 "SAMARITAN PUNCTUATION MELODIC QITSA"<br /><b>Block:</b> U+0800..083F "Samaritan"</p><p>The <a href="https://omniglot.com/writing/samaritan.htm">Samaritan script</a> was derived from the Paleo-Hebrew <i>circa</i> 600 BCE and was used alongside the Aramaic script in Judaism until the latter was repurposed as the Hebrew alphabet <i>circa</i> 100 BCE.</p><p>Samaritan is a right-to-left abjad with 22 basic consonants and diacritics to mark vowels:</p><p></p><div class="separator" style="clear: both; text-align: center;"><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEiIzdbdet4rLdfUJjR4dOgiJy_-FgWysYPVYuj6HR6hhmFKs2Yls0vf2-79N1WOfZjgyGeHpZz7CwYsUyjcH3SqcQimQPvpeZsoVEVTEFVJc14OdL6cp7fcm65zVHi0PmXiLLuAjUwqLx4caNY-NScVEwxTVa5VIaCDqfm3-rJ2jWGVi1scBuEKXjvw=s727" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="34" data-original-width="727" src="https://blogger.googleusercontent.com/img/a/AVvXsEiIzdbdet4rLdfUJjR4dOgiJy_-FgWysYPVYuj6HR6hhmFKs2Yls0vf2-79N1WOfZjgyGeHpZz7CwYsUyjcH3SqcQimQPvpeZsoVEVTEFVJc14OdL6cp7fcm65zVHi0PmXiLLuAjUwqLx4caNY-NScVEwxTVa5VIaCDqfm3-rJ2jWGVi1scBuEKXjvw=s16000" /></a></div></div><p style="text-align: left;">Much is made of the extensive <a href="https://chilliant.com/universe/universe.html#C+0837">punctuation</a> in the Samaritan script. Here are the fifteen codepoints of the "Punctuation" column (U+0830 to U+083E):</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEj1dinKK-JEFccH0RSasQQkvEHtF5OM5rqvDXseH_k556K183eKxr3G5QrRnPyTOMgJxHtJYtziZSR7lXTjZ9JZb-BKGMFmyXuOFtqv38bU2ttQ7dw-MWTlVOLzQcK_9vk_NM3mCzBgKdSQlRafqJ3LnYOSUMXUZktHgd0Vpi5Fw1Hbduwbqfn-L0k8=s496" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="67" data-original-width="496" src="https://blogger.googleusercontent.com/img/a/AVvXsEj1dinKK-JEFccH0RSasQQkvEHtF5OM5rqvDXseH_k556K183eKxr3G5QrRnPyTOMgJxHtJYtziZSR7lXTjZ9JZb-BKGMFmyXuOFtqv38bU2ttQ7dw-MWTlVOLzQcK_9vk_NM3mCzBgKdSQlRafqJ3LnYOSUMXUZktHgd0Vpi5Fw1Hbduwbqfn-L0k8=s16000" /></a></div><br /><p style="text-align: left;"><b><span></span></b></p><a name='more'></a><b>a. U+0830 "SAMARITAN PUNCTUATION NEQUDAA"</b><p></p><p>A central dot. Confusingly, the <a href="https://www.unicode.org/charts/PDF/U0800.pdf">annotation</a> for this codepoint says it's a "word separator". However, the original Unicode <a href="https://www.unicode.org/wg2/docs/n3377.pdf">submission notes</a> suggests that, along with U+0831, it should be treated as a sentence terminal with U+2E31 "WORD SEPARATOR MIDDLE DOT" used for Samaritan word separation. But the <a href="https://util.unicode.org/UnicodeJsps/character.jsp?a=0830">UCD</a> confirms that U+0830 <i>was not</i> marked as a sentence terminator. <a href="https://www.unicode.org/versions/Unicode14.0.0/ch09.pdf">Section 9.4 of Unicode Standard</a> again suggests using U+2E31 to separate Samaritan words and that U+0830 is actually analogous to an English semicolon. Not to be confused with the <i>combining </i>U+082D "SAMARITAN MARK NEQUDAA" which is an editorial mark indicating that there is a variant reading of the word.</p><p><b>b. U+0831 "SAMARITAN PUNCTUATION AFSAAQ"</b></p><p>Literally "<i>interruption</i>". <a href="https://www.unicode.org/versions/Unicode14.0.0/ch09.pdf">Section 9.4 of Unicode Standard</a> describes it as "pause"; slightly longer than a semicolon. It does not terminate a sentence. Looks and acts like an English colon.</p><p><b>c. U+0832 "SAMARITAN PUNCTUATION ANGED"</b></p><p>Literally "<i>restraint</i>". A shorter pause than U+0831. Possibly analogous to an English comma.</p><p><b>d. U+0833 "SAMARITAN PUNCTUATION BAU"</b></p><p>Literally "<i>prayer</i>". Performative punctuation indicating that the preceding sentence is a request, prayer or humble petition.</p><p><b>e. U+0834 "SAMARITAN PUNCTUATION ATMAAU"</b></p><p>Literally "<i>surprise</i>". Performative punctuation indicating that the preceding sentence is an expression of surprise. Analogous to the English exclamation mark.</p><p><b>f. U+0835 "SAMARITAN PUNCTUATION SHIYYAALAA"</b></p><p>Literally "<i>question</i>". Performative punctuation indicating that the preceding sentence is a question. Analogous to the English question mark.</p><p><b>g. U+0836 "SAMARITAN ABBREVIATION MARK"</b></p><p>Follows an abbreviation. Analogous to an English full stop used in this way.</p><p><b>h. U+0837 "SAMARITAN PUNCTUATION MELODIC QITSA"</b></p><p>An end of section like U+0839. It indicates the end of a sentence which one should be read melodically.</p><p><b>i. U+0838 "SAMARITAN PUNCTUATION ZIQAA"</b></p><p>Literally "<i>shouting</i>". Performative punctuation indicating that the preceding sentence is shouted or cried out. Analogous to the English exclamation mark or, just possibly, <a href="https://meh.com/forum/topics/capital-crimes-part-1--shout-shout-let-it-all-out">SHOUTY CAPS</a>.</p><p><b>j. U+0839 "SAMARITAN PUNCTUATION QITSA"</b></p><p>An end of section. It may be followed by a blank line. Analogous to a paragraph break.</p><p><b>k. U+083A "SAMARITAN PUNCTUATION ZAEF"</b></p><p>Literally "<i>outburst</i>". Performative punctuation indicating that the preceding sentence is an outburst or said in anger. Analogous to the English exclamation mark.</p><p><b>l. U+083B "SAMARITAN PUNCTUATION TURU"</b></p><p>Literally "<i>teaching</i>". Performative punctuation indicating that the preceding sentence is a didactic expression or teaching.</p><p><b>m. U+083C "SAMARITAN PUNCTUATION ARKAANU"</b></p><p>Literally "<i>submissiveness</i>". Performative punctuation indicating that the preceding sentence is an expression of submissiveness.</p><p><b>n. U+083D "SAMARITAN PUNCTUATION SOF MASHFAAT"</b></p><p>An end of sentence. Analogous to an English full stop.</p><p><b>o. U+083E "SAMARITAN PUNCTUATION ANNAAU"</b></p><p>Literally "<i>rest</i>". Like U+0839 but indicates that a longer time has passed between actions narrated in the sentences it separates. Analogous to a paragraph break with a <a href="https://en.wikipedia.org/wiki/Dinkus">dinkus</a>.</p><span><!--more--></span><p>I agree that this is an impressive repertoire of punctuation. <a href="https://en.wikipedia.org/wiki/List_of_typographical_symbols_and_punctuation_marks">Latin typography</a> is similarly rich, so I've tried to come up with an equivalence table:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEgMB6Nu24958uhmLr44e7Q0Jzkgedld3YhltQT66aBk2IrFR_6gCievmqU1FXaSH5Zle6YXibDimChKc5KC046slbp-y8IAIKC3JUbJFBaCYCAujGYZEFyb4GON8o66UjujFsBuJk0ooptidqt0fp7r8jy0-r-eYE7TdwvHwHL9pB1aDgq1wAJ48-vS=s537" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="531" data-original-width="537" src="https://blogger.googleusercontent.com/img/a/AVvXsEgMB6Nu24958uhmLr44e7Q0Jzkgedld3YhltQT66aBk2IrFR_6gCievmqU1FXaSH5Zle6YXibDimChKc5KC046slbp-y8IAIKC3JUbJFBaCYCAujGYZEFyb4GON8o66UjujFsBuJk0ooptidqt0fp7r8jy0-r-eYE7TdwvHwHL9pB1aDgq1wAJ48-vS=s16000" /></a></div><p>Obviously, it's not a one-to-one mapping and these days you'd probably use symbols like "♫" or emoji for some of them.</p><p>I smiled when I saw that U+0837 "SAMARITAN PUNCTUATION MELODIC QITSA" comes at the <i>end</i> of the sentence. I had visions of performers suddenly coming across the marker and realising, too late, that they were meant to be singing or chanting the preceding text. I guess that's why <a href="https://bbc.github.io/subtitle-guidelines/#Indicate-song-lyrics-with">subtitles use hash symbols</a> at the beginning <i>and</i> end of lyrics.</p><p>In a similar vein, I've always found the English convention of putting exclamation marks and question marks <i>only</i> at the end of sentences bad for reading:</p><p></p><blockquote><p>Have you finished?</p><p>Yes!</p></blockquote><p></p><p>The <a href="https://en.wikipedia.org/wiki/Inverted_question_and_exclamation_marks">Spanish convention</a> seems much more sensible :</p><p></p><blockquote><p>¿Has terminado?</p><p>¡Sí!</p></blockquote><p></p>Ian Taylorhttp://www.blogger.com/profile/06869762490434824010noreply@blogger.com0tag:blogger.com,1999:blog-5181746871086541575.post-39876342687430427602022-01-24T09:25:00.346+00:002022-01-24T09:25:00.156+00:00Unicode Trivia U+07C1<p><b>Codepoint:</b> U+07C1 "NKO DIGIT ONE"<br /><b>Block:</b> U+07C0..07FF "NKo"</p><p>As reported by Dr Dianne White Oyler in "<a href="https://www.jstor.org/stable/3097555">A Cultural Revolution in Africa: Literacy in the Republic of Guinea since Independence</a>" (2001), the N'Ko script was developed by <a href="https://en.wikipedia.org/wiki/Solomana_Kante">Souleymane Kanté</a> in 1949, partly in response to</p><blockquote><p><i>a 1944 challenge posed by the Lebanese journalist Kamal Marwa in an Arabic-language publication, </i>Nahnu fi Afrikiya<i> [We Are in Africa]. Marwa argued that Africans were inferior because they possessed no indigenous written form of communication. His statement that "African voices [languages] are like those of the birds, impossible to transcribe" reflected the prevailing views of many colonial Europeans. Although the journalist acknowledged that the Vai had created a syllabary, he discounted its cultural relevancy because he deemed it incomplete.</i> [Page 588]</p></blockquote><p>Kanté discarded both Arabic and Latin scripts as unable to transcribe all the characteristics of the <a href="https://en.wikipedia.org/wiki/Mande_languages">Mande languages</a>. Having developed a completely novel alphabet instead,</p><blockquote><p><i>he called together children and illiterates and asked them to draw a line in the dirt; he noticed that seven out of ten drew the line from right to left. For that reason he chose a right-to-left orientation. In all Mande languages the pronoun </i>n-<i> means "I" and the verb </i>ko<i> represents the verb "to say". </i>[Page 589]</p></blockquote><p>So "N'Ko" means "I say" in all the target languages.</p><p>The right-to-left mantra extends not only to words, but to digits and numbers too. The ten digits zero to nine (U+07C0 "NKO DIGIT ZERO" to U+07C9 "NKO DIGIT NINE") face right:</p><p></p><div class="separator" style="clear: both; text-align: center;"><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEh014LREWidG_UMrbMMQLWYNVAaSTc65hfpZcNUGfmzxj9g-zllXll3iJaygHjTC1Y2Zi7rA1OwcItuoJTR1QOwYsfF9wwmd0KsG5j9vob0Ev33YhOmzO73N8lNuAY0V0XNyt082RVpuxaLwFfhfDrd1T1fb3P8GmiXzWciHNnGJgaXY-NBTDepbezp=s331" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="100" data-original-width="331" height="97" src="https://blogger.googleusercontent.com/img/a/AVvXsEh014LREWidG_UMrbMMQLWYNVAaSTc65hfpZcNUGfmzxj9g-zllXll3iJaygHjTC1Y2Zi7rA1OwcItuoJTR1QOwYsfF9wwmd0KsG5j9vob0Ev33YhOmzO73N8lNuAY0V0XNyt082RVpuxaLwFfhfDrd1T1fb3P8GmiXzWciHNnGJgaXY-NBTDepbezp=s320" width="320" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><i>N'Ko digits (top), Western Arabic (middle), Eastern Arabic (bottom)</i></td></tr></tbody></table></div><p></p><p>This is particularly noticeable with U+07C1 "NKO DIGIT ONE": '߁'</p><p>Not only that, but the least significant digits of multi-digit N'Ko numbers are on the left, unlike almost all other writing systems. Latin, Greek, Arabic and Hebrew numbers place the least significant digit on the right, even though the latter two scripts are written right-to-left.</p><p>Consider the improbable phrase "There are 12345 eggs":</p><p style="text-align: center;"><span style="font-family: Roboto Mono; font-size: large;">There are <span style="color: red;">12345</span> eggs = <i>English</i></span></p><p style="text-align: center;"><span style="font-family: Roboto Mono; font-size: large;">Υπάρχουν <span style="color: red;">12345</span> αυγά = <i>Greek</i></span></p><p style="text-align: center;"><span style="font-family: Roboto Mono; font-size: large;"> يوجد <span style="color: red;">١٢٣٤٥</span> بيضة = <i>Arabic</i></span></p><p style="text-align: center;"><span style="font-family: Roboto Mono; font-size: large;">יש <span style="color: red;">12345</span> ביצים = <i>Hebrew</i></span></p><p style="text-align: center;"><span style="font-family: Roboto Mono; font-size: large;"><span style="color: red;">߁߂߃߄߅</span> ߞߟߌ߫ ߦߋ߫ ߦߋ߲߬ = <i>N’Ko</i></span></p><p>In case of tofu:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEg8jUZkpa4I-SSx-b7Ocf6aivbdzNQkZ4X6BuLnOjFsnYvjcFXWYiFdR_6-dWJG8cH4kLcJ0oWO6is4jWXokWIQTkHuM8qv9JMw7mbu3wTFW0nX2vJ8BD3wOiXicLjM5iamUEMvSRVD73Wa4h-k3Gn1qW-0osPyzqd-Q_-ewu65dsvzGQV1_Y47R1_6=s493" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="295" data-original-width="493" src="https://blogger.googleusercontent.com/img/a/AVvXsEg8jUZkpa4I-SSx-b7Ocf6aivbdzNQkZ4X6BuLnOjFsnYvjcFXWYiFdR_6-dWJG8cH4kLcJ0oWO6is4jWXokWIQTkHuM8qv9JMw7mbu3wTFW0nX2vJ8BD3wOiXicLjM5iamUEMvSRVD73Wa4h-k3Gn1qW-0osPyzqd-Q_-ewu65dsvzGQV1_Y47R1_6=s16000" /></a></div><p style="text-align: left;">Note that the order of the codepoints for "1", "2", "3" "4" and "5" occur in ascending memory order in all cases. For example:</p><p style="text-align: left;"></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj_3O7qvkvKSGbDARlZMkBwMVeEmsY3ldet-xMBi6FsVIk6CMc9lkPi2XrHiT4Kj6L3AvOVmLUBm4eYKfH0bE_vImmrv5wIe6fAklmHtDBa0iDEsr-6Qh1TbYnIqdVEJFpHo4alD_ksv0E/" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="67" data-original-width="628" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj_3O7qvkvKSGbDARlZMkBwMVeEmsY3ldet-xMBi6FsVIk6CMc9lkPi2XrHiT4Kj6L3AvOVmLUBm4eYKfH0bE_vImmrv5wIe6fAklmHtDBa0iDEsr-6Qh1TbYnIqdVEJFpHo4alD_ksv0E/s16000/image.png" /></a></div><br />At first, I wasn't sure how much "support" the Unicode standard gives for this type of anomaly. UCD's sister project <a href="https://cldr.unicode.org/">CLDR</a> (Common Locale Data Repository) has <a href="https://github.com/unicode-cldr/cldr-core/blob/master/supplemental/numberingSystems.json#L248">very little</a> to say about N'Ko. There <i>is </i>scope for <a href="https://unicode-org.github.io/cldr/ldml/tr35-numbers.html#Rule-Based_Number_Formatting">algorithmic number formatting</a>, but I didn't find anything specific.<p></p><p style="text-align: left;">However, after a bit of thought I realised that, because directionality is a property of each codepoint and <i>not</i> of the script of the codepoints, digit ordering in N'Ko works "out of the box".</p><p style="text-align: left;">Consider these bidirectional class fields ("bc") from the UCD:</p><p style="text-align: left;"></p><ul style="text-align: left;"><li>Latin</li><ul><li>"A" (U+0041 "LATIN CAPTIAL LETTER A") = "L" = strong left-to-right</li><li>"1" (U+0041 "LATIN CAPTIAL LETTER A") = "EN" = European number (left-to-right)</li></ul><li>Greek</li><ul><li>"α" (U+03B1 "GREEK SMALL LETTER ALPHA") = "L" = strong left-to-right</li></ul><li>Arabic</li><ul><li>"ا" (U+0627 "ARABIC LETTER ALEF") = "AL" = Arabic letter (right-to-left)</li><li>"١" (U+0661 "ARABIC-INDIC DIGIT ONE") = "AN" = Arabic number (left-to-right)</li></ul><li>Hebrew</li><ul><li>"א" (U+05D0 "HEBREW LETTER ALEF") = "R" = strong right-to-left</li></ul><li>N'Ko</li><ul><li>"ߊ" (U+07CA "NKO LETTER A") = "R" = strong right-to-left</li><li>"߁" (U+07C1 "NKO DIGIT ONE") = "R" = strong right-to-left</li></ul></ul><p></p><p style="text-align: left;">Unlike the other digits, N'Ko digits are marked as strongly right-to-left. The only other examples in Unicode 14.0 I could <a href="https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%5B%3Agc%3ANd%3A%5D%26%5B%3Abc%3AR%3A%5D%5D">find</a> were <a href="https://en.wikipedia.org/wiki/Adlam_script#Digits">Adlam digits</a> (1989).</p><p style="text-align: left;">Another interesting codepoint from the Unicode "NKo" block is <a href="https://r12a.github.io/scripts/nko/block#char07F7">U+07F7</a> "NKO SYMBOL GBAKURUNEN":</p><p style="text-align: center;"><span style="font-size: x-large;"></span></p><div class="separator" style="clear: both; text-align: center;"><span style="font-size: x-large;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgnLKkP7MsSRRUlkHW5-_zvknZpfcqGYgxPthPLyJ2k8bh2PmkfeO0eT3mD2LjkmpboICtR9uLIW27GLvAFJT6is01kEOQoWyt_Lq1r1K01P4P_HcGE0ejKr1j6nB2WEKRyp_Mrnd-6ISg/" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="190" data-original-width="190" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgnLKkP7MsSRRUlkHW5-_zvknZpfcqGYgxPthPLyJ2k8bh2PmkfeO0eT3mD2LjkmpboICtR9uLIW27GLvAFJT6is01kEOQoWyt_Lq1r1K01P4P_HcGE0ejKr1j6nB2WEKRyp_Mrnd-6ISg/s16000/image.png" /></a></span></div><p></p><p style="text-align: left;">It's a decorative punctuation symbol used to mark the end of a major section of text and represents the <a href="https://www.appropedia.org/Three_stone_cooking_fire">three stones holding a cooking pot</a> over a fire:</p><p style="text-align: left;"></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEio9NEeIJrby9-v3IUsQDwqx8ZwvUpntWxb5oJBYr2D_TVHsDE8ZrLkZ5iJCsMutfhT612ZfePa7gqipbj-o160AFlTA9KVacu7ZaXbzsHv_C7OiPBXdxvq2uE84sqDo2gMwQczs99EnV5k5YJE8zA0L9AjLqapaw5uqHwtmhmdr2uwGa0QUMd1bknW=s572" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="572" data-original-width="571" src="https://blogger.googleusercontent.com/img/a/AVvXsEio9NEeIJrby9-v3IUsQDwqx8ZwvUpntWxb5oJBYr2D_TVHsDE8ZrLkZ5iJCsMutfhT612ZfePa7gqipbj-o160AFlTA9KVacu7ZaXbzsHv_C7OiPBXdxvq2uE84sqDo2gMwQczs99EnV5k5YJE8zA0L9AjLqapaw5uqHwtmhmdr2uwGa0QUMd1bknW=s16000" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">[<a href="https://www.lowtechmagazine.com/2014/06/thermal-efficiency-cooking-stoves.html">source</a>]</td></tr></tbody></table><p></p><p style="text-align: left;">Finally, there can't be many alphabets that have their own day: <a href="https://anydayguide.com/calendar/1899">April 14</a>.</p><div>[Many thanks to <a href="https://www.ankataa.com/about">Coleman Donaldson</a> for help with the N'Ko language]</div><div><p></p></div>Ian Taylorhttp://www.blogger.com/profile/06869762490434824010noreply@blogger.com0tag:blogger.com,1999:blog-5181746871086541575.post-16663556973658796922022-01-23T10:40:00.003+00:002022-01-23T10:40:00.168+00:00Unicode Trivia U+0780<p><b>Codepoint:</b> U+0780 "THAANA LETTER HAA"<br /><b>Block:</b> U+0780..07BF "Thaana"</p><p>The <a href="https://omniglot.com/writing/thaana.htm">Thaana script</a> is used to write the Maldivian language. According to <a href="https://en.wikipedia.org/wiki/Thaana">Wikipedia</a>, it's an abugida with no inherent vowel. According to the <a href="https://unicode.org/iso15924/iso15924-codes.html">ISO standard</a>, it's a right-to-left-written alphabet (as indicated by the hundreds digit of its numeric ISO-15924 code "170").</p><p style="text-align: left;">It first appeared in about 1705 CE and seems have been developed with obfuscation in mind. The alphabet order is arbitrary and the consonant letterforms are derived from numeric figures:</p><p style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEhm_fw-J8sdIPNnqKxA5dayFls0IXyUp8RuwvXuxH2C1qJvcgjmxyhQ1l5KKES8xRh68kcMxyDiVQmbGqunPujUT9mjvZVhGj_zkCtigiRYlo6YYtwTLRK3Cb4X9395u2RIPqNTI74hCE3w-C9kZV6vVTPjLqGTNh6N_qW9-KLVrxlGjjgk1fo_kdNn=s793" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="67" data-original-width="793" height="54" src="https://blogger.googleusercontent.com/img/a/AVvXsEhm_fw-J8sdIPNnqKxA5dayFls0IXyUp8RuwvXuxH2C1qJvcgjmxyhQ1l5KKES8xRh68kcMxyDiVQmbGqunPujUT9mjvZVhGj_zkCtigiRYlo6YYtwTLRK3Cb4X9395u2RIPqNTI74hCE3w-C9kZV6vVTPjLqGTNh6N_qW9-KLVrxlGjjgk1fo_kdNn=w640-h54" width="640" /></a></p>On the top row, in white, are the 24 basic consonants in Thaana alphabetical order. These are the 24 consecutive Unicode codepoints U+0780 "THAANA LETTER HAA" to U+0797 "THAANA LETTER CHAVIYANI".<div><p style="text-align: left;">The second row shows the Arabic-Indic digits one to nine in blue and the Dhives Akuru digits one to six in red. Dhives Akuru was a Maldivian script used before Thaana. The main part of the alphabet looks very much like a simple replacement cipher.</p><p style="text-align: left;">An early version of the Thaana script, Gabulhi Thaana, was written <a href="https://en.wikipedia.org/wiki/Scriptio_continua">scriptio continua</a>, that is, without inter-word spacing or punctuation. This sounds like an absolute nightmare but was quite common in classical Greek and Latin. Before mechanical printing, Arabic was also written without spacing. This is, perhaps, why many writing systems have distinct letterforms for final letters in words.</p><p style="text-align: left;">According to "<a href="http://www.qaumiyyath.gov.mv/docs/whitepapers/history/Scripts%20of%20Maldives.pdf">Scripts of Maldives</a>", the early Thaana script, Gabulhi Thaana, got its name from the Maldivian word "<a href="https://www.bliss.mv/en/2020/06/12-words-for-coconut-in-dhivehi/">gabulhi</a>" meaning the in-between stage of a coconut, when it is neither fully ripe nor quite tender. Hence the idea of "immature" or "not fully-formed".</p></div>Ian Taylorhttp://www.blogger.com/profile/06869762490434824010noreply@blogger.com0tag:blogger.com,1999:blog-5181746871086541575.post-55786419001622966022022-01-22T10:47:00.008+00:002022-01-22T10:47:00.162+00:00Unicode Trivia U+0753 <p><b>Codepoint:</b> U+0753 "ARABIC LETTER BEH WITH THREE DOTS POINTING UPWARDS BELOW AND TWO DOTS ABOVE"<br /><b>Block:</b> U+0750..077F "Arabic Supplement"</p>
<p>Syriac is not the only script that makes extensive use of diacritics. The spread of the Arabic script throughout the world means it is used for diverse languages, many of which have sounds not found in Arabic. Part of the "Arabic Supplement" block contains a column "Extended Arabic letters" with the annotation:</p>
<blockquote><p><i>These are primarily used in Arabic-script orthographies of African languages.</i></p></blockquote>
<p>One codepoint, U+0753, has the somewhat precise name of "ARABIC LETTER BEH WITH THREE DOTS POINTING UPWARDS BELOW AND TWO DOTS ABOVE". When I render that codepoint using "<a href="https://fonts.google.com/noto/specimen/Noto+Sans+Arabic">Noto Sans Arabic</a>" on my PC, I get this:</p>
<p></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiy3-kjcRJRaQZ4mV7gGQ2WmRmCUGalKP_QHjMk81VJyJ3nt1Mg9JhC22Y41XO4kmjy4WYn4Q2BuOyin85560F-71sGx4F2BbjxFmhS3hOY4ir9p7Ru0dzY9b2wmUr9HEJA1PXLjt9V-1c/" style="margin-left: auto; margin-right: auto;"><img alt="" data-original-height="128" data-original-width="112" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiy3-kjcRJRaQZ4mV7gGQ2WmRmCUGalKP_QHjMk81VJyJ3nt1Mg9JhC22Y41XO4kmjy4WYn4Q2BuOyin85560F-71sGx4F2BbjxFmhS3hOY4ir9p7Ru0dzY9b2wmUr9HEJA1PXLjt9V-1c/s16000/image.png" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><p><span style="text-align: left;">Noto Sans Arabic (2.004)</span></p></td></tr></tbody></table><p></p>
<p style="text-align: left;">When I render it with a default local font, I get:</p>
<p style="text-align: left;"></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiJuZYvfCvIbWqYu_bcT2Ch_iY0syCxr1YsnHGDFJCLyv_q5V4D4moHRPmvs3gSO-Ao7z0dI3mkg41w7lm0k0C1EwPl_PZzqYzWI4pBS3A7Nxk6tc78mJ4RHarGDYucu6xwP5kVTY40ioY/" style="margin-left: auto; margin-right: auto;"><img alt="" data-original-height="128" data-original-width="89" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiJuZYvfCvIbWqYu_bcT2Ch_iY0syCxr1YsnHGDFJCLyv_q5V4D4moHRPmvs3gSO-Ao7z0dI3mkg41w7lm0k0C1EwPl_PZzqYzWI4pBS3A7Nxk6tc78mJ4RHarGDYucu6xwP5kVTY40ioY/s16000/image.png" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><p>Arial (7.00)</p></td></tr></tbody></table><p></p>
<p style="text-align: left;">Spot the difference!</p>
<p>There's definitely a discrepancy in the orientation of the lower dots, but which is correct? I came up with three possibilities:</p>
<p></p><ol><li>I have old/corrupt font files installed on my PC</li><li>The name of the Unicode codepoint is incorrect</li><li>The orientation of the lower dots doesn't really matter, so there is no issue</li><li>One of the font glyphs is incorrect</li></ol><p></p>
<p style="text-align: left;">Initially, I did indeed think it was an old version of Noto Sans Arabic installed on my machine. But I updated my local version of Noto Sans Arabic to 2.009 with the same results. <a href="https://fonts.google.com/?query=arabic&preview.text=%DD%93&preview.text_type=custom">Google web font specimens</a> confirmed the issue is with Noto Sans Arabic in general:</p>
<p style="text-align: left;"></p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgZg5cyignprhI5pG7BQnqQWqQsIvHYRS9LEnM-J_mmaaNuYKo6KsTyQTga_jmxxUyY84S34JVwfBDoukqMg5_Dp1Jo8pGUNS6QQiIssVlHlW0sug_OJJN2Gb-w6C0s6V8Fb9pDfLvVkt4/" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="551" data-original-width="641" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgZg5cyignprhI5pG7BQnqQWqQsIvHYRS9LEnM-J_mmaaNuYKo6KsTyQTga_jmxxUyY84S34JVwfBDoukqMg5_Dp1Jo8pGUNS6QQiIssVlHlW0sug_OJJN2Gb-w6C0s6V8Fb9pDfLvVkt4/s16000/image.png" /></a></div>
<p style="text-align: left;">Three of the four specimens suggest the name of the Unicode codepoint is probably correct. I <a href="https://chilliant.com/universe/universe.html#name=/ARABIC%20LETTER%20BEH%20WITH%20.*/">checked</a> that there are no similarly-named codepoints; there is no "ARABIC LETTER BEH WITH THREE DOTS POINTING <i>DOWNWARDS</i> BELOW AND TWO DOTS ABOVE"</p>
<p style="text-align: left;">I couldn't really imagine that, carefully named as it is, the orientation of the lower dots in U+0753 was unimportant.</p>
<p style="text-align: left;">I then checked <a href="https://unicode.org/errata/">Unicode Updates and Errata</a> but found no references to this or nearby codepoints.</p>
<p style="text-align: left;">So the finger of suspicion fell on the glyph within the Noto Sans Arabic font being incorrect. <a href="https://fontforge.org/">FontForge </a>confirmed this:</p>
<div>
<p style="text-align: left;"></p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhYEEc-k6REQ33EN-N0tAWKzL7z520QPtGLb4JzDhw49nMDo8blcfWA62rg23TMR37BD_NMOt-D-sRV-ahQ5cMcAonD2ErAFEZlvHoqmJ4EbVB1zjNaT1Y2tiiWzIVFzWZt_j3PQyEstsc/" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="744" data-original-width="689" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhYEEc-k6REQ33EN-N0tAWKzL7z520QPtGLb4JzDhw49nMDo8blcfWA62rg23TMR37BD_NMOt-D-sRV-ahQ5cMcAonD2ErAFEZlvHoqmJ4EbVB1zjNaT1Y2tiiWzIVFzWZt_j3PQyEstsc/w592-h640/image.png" width="592" /></a></div>
<br />I looked through the <a href="https://github.com/googlefonts/noto-fonts/issues">issues reported for Noto fonts</a>, but found nothing, so I submitted a <a href="https://github.com/googlefonts/noto-fonts/issues/2216">new one</a>.<p></p><p style="text-align: left;">Of course, this has only a passing connection to the Unicode standard. But one can easily imagine the amount of noise that has to be ploughed through by the committee along the lines of "My text doesn't get displayed how I expected" just to get to genuine issues with the Unicode standard itself.</p><p style="text-align: left;">According to <a href="https://en.wiktionary.org/wiki/%DD%93">Wiktionary</a>, U+0753 "ARABIC LETTER BEH WITH THREE DOTS POINTING UPWARDS BELOW AND TWO DOTS ABOVE" is</p><blockquote><p style="text-align: left;"><i>The third letter of the Hausa alphabet in ajami script, equivalent to Latin script <b>c</b>.</i></p></blockquote><p style="text-align: left;">I was initially a bit suspicious of this. Both <a href="https://omniglot.com/writing/hausa.htm">Omniglot</a> and <a href="https://en.wikipedia.org/wiki/Hausa_language#Ajami_(Arabic)">Wikipedia</a> suggest that the three dots go <i>above</i> that letter, making it more like U+062B "ARABIC LETTER THEH". However, Richard Ishida <a href="https://r12a.github.io/scripts/arabic/hausa#variants">points out</a> that there are lots of subtle local variations and the <a href="https://www.unicode.org/L2/L2003/03168-african-chars.pdf">initial Unicode proposal</a> shows an "ARABIC LETTER BEH WITH THREE DOTS POINTING UPWARDS BELOW AND TWO DOTS ABOVE" in Figure 5. The proposal cites "<a href="https://www.google.co.uk/books/edition/Using_Arabic_Script_in_Writing_the_Langu/8SG6vgEACAAJ?hl=en&kptab=overview">Using Arabic Script in Writing the Languages of the Peoples of Muslim Africa</a>" (1992) by Mohamed Chtatou:</p><p style="text-align: left;"></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjfHYNn_zOqVMhX4ocan_rqdeeVl8c9Qhanhwe2QSyvFPp4XBEKzcRYp84j_PGChDcUuspf7HHehNZdLdSoLWI8KqNA9cse_E8-alrEppN8GswcZrTfkrCK_NrwuVvab_YQMWmDlVHKgjU/" style="margin-left: auto; margin-right: auto;"><img alt="" data-original-height="153" data-original-width="375" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjfHYNn_zOqVMhX4ocan_rqdeeVl8c9Qhanhwe2QSyvFPp4XBEKzcRYp84j_PGChDcUuspf7HHehNZdLdSoLWI8KqNA9cse_E8-alrEppN8GswcZrTfkrCK_NrwuVvab_YQMWmDlVHKgjU/s16000/image.png" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">"Figure 5" (Chtatou, 1992)</td></tr></tbody></table><p></p><p style="text-align: left;">Richard Ishida again:</p><blockquote><p style="text-align: left;"><i>Unicode policy for the Arabic script is to encode fully precomposed characters rather than to use combining characters for <a href="https://en.wikipedia.org/wiki/Arabic_diacritics">ijam</a>.</i></p></blockquote>It would appear that the task of supporting more obscure and/or infrequent Arabic script glyphs in Unicode (and in fonts) can only get harder.<p></p>
</div>
Ian Taylorhttp://www.blogger.com/profile/06869762490434824010noreply@blogger.com0tag:blogger.com,1999:blog-5181746871086541575.post-13240856290443747542022-01-21T10:06:00.031+00:002022-01-21T10:06:00.145+00:00Unicode Trivia U+0740<p><b>Codepoint:</b> U+0740 "SYRIAC FEMININE DOT"<br /><b>Block:</b> U+0700..074F "Syriac"</p><p><a href="https://chilliant.com/universe/tour.html#Syrc">Syriac</a> has got to be one of the dottiest scripts in Unicode. The fact that there's a 232-page book devoted to Syriac diacritics says a lot:</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://images-na.ssl-images-amazon.com/images/I/81v0LTa6MyL.jpg" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="800" data-original-width="646" height="800" src="https://images-na.ssl-images-amazon.com/images/I/81v0LTa6MyL.jpg" width="646" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">[<a href="https://www.amazon.co.uk/Syriac-Dot-Short-History/dp/1463241003">source</a>]</td></tr></tbody></table><blockquote><p style="text-align: left;"><i>The dot is used for everything in Syriac from tense to gender, number, and pronunciation, and unsurprisingly represents one of the biggest obstacles to learning the language.</i></p></blockquote><p style="text-align: left;">Section 9.3 of the <a href="https://www.unicode.org/versions/Unicode14.0.0/ch09.pdf">Core Specification 14.0.0</a> gives an introduction to some of these complexities. Within the sub-section concerning exceptions to the diabolical diacritical rules, is this:</p><blockquote><p style="text-align: left;"><i>The feminine dot is usually placed to the left of a final taw.</i></p></blockquote><p>This refers to codepoint U+0740 "SYRIAC FEMININE DOT". According to <a href="https://r12a.github.io/scripts/syriac/#feminine">Richard Ishida</a>, this non-spacing mark:</p><p></p><blockquote>[...] <i>is a feminine marker used with "ܬ" [U+072C SYRIAC LETTER TAW] to indicate a feminine suffix. East Syriac fonts should render as two dots below the base letter, whereas West Syriac fonts render as a single dot to the left of the base.</i></blockquote><p></p><p>So far as I can tell, this is the only diacritic <i>currently</i> in Unicode that distinguishes (or elucidates) the underlying word's gender.</p><p>Below are va<span style="font-family: inherit;">riations of "<span style="font-family: inherit;">ܩܛܠܬ</span>" (= kill) distinguished solely by diacritics ("<span style="font-size: 11pt;">ܩ̇ܛܠܬ</span>", "<span style="font-size: 11pt;">ܩܛ</span><span style="font-size: 11pt;">̣</span><span style="font-size: 11pt;">ܠܬ</span>" and "<span style="font-size: 11pt;">ܩܛܠܬ݀</span>") rendered with "<a href="https://fonts.google.com/noto/specimen/Noto+Sans+Syriac">Noto Sans Syriac</a>":</span></p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjLZlzXlY3F5VIIA4niJkNx-hQL52qch-vnAy91KhX3vtY3SUHtdYahwPOoj8T_umtMGMb1GBVX6YXfhBPo0A9xvP4HbXardSRNsxyvDMwHR5AOs-D3y9bJiopPT1G3rtuwxqhEBD5bdR8/" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="178" data-original-width="631" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjLZlzXlY3F5VIIA4niJkNx-hQL52qch-vnAy91KhX3vtY3SUHtdYahwPOoj8T_umtMGMb1GBVX6YXfhBPo0A9xvP4HbXardSRNsxyvDMwHR5AOs-D3y9bJiopPT1G3rtuwxqhEBD5bdR8/s16000/image.png" /></a></div><p style="text-align: left;">Notice U+0740 "SYRIAC FEMININE DOT" at the end (left) of the last line.</p><p style="text-align: left;">[Thanks to Richard Ishida and, indirectly, J F Coakley at <a href="https://www.jericho-press.com/about">Jericho Press</a>]</p><p></p>Ian Taylorhttp://www.blogger.com/profile/06869762490434824010noreply@blogger.com0