Friday, 17 December 2021

Universe 3: Fonts

The Universe project uses Google Noto fonts as much as possible. As the Noto project page says:

The name is also short for "no tofu", as the project aims to eliminate 'tofu': blank rectangles shown when no font is available for your text.

According to the Universe Statistic page UCD 14.0.0, there are 144,762 "used" codepoints. This breaks down as follows when you include "unused" codepoints such as "private", "surrogate" and "noncharacter":

Kind Criteria Codepoints
format gc is "Cf", "Zl" or "Zp" 165
control gc is "Cc" 65
private gc is "Co" 137,468
surrogate gc is "Cs" 2,048
noncharacter gc is "Cn" 66
modifier gc is "Mc", "Me" or "Mn" 2,408
emoji EPres is "Y" 1,185
graphic otherwise 140,939
Total 284,344

Note that this "Kind" classification is slightly more fine-grained than Unicode's "basic type" but less so than "general category". Also note that the codepoints classified as "private", "surrogate" and "noncharacter" are fixed and will not change in subsequent Unicode versions.

This still leaves a great many codepoints that need to be rendered. The relationship between Unicode codepoints and character glyphs within a font is non-trivial, to say the least; but it can be useful, in a character browser, to render an "archetype" glyph of each codepoint for illustrative purposes.

In Universe, each "used" codepoint has a 32-by-32 pixel bitmap glyph. The aim is to use Google Noto fonts wherever possible to construct these bitmaps, because:

  1. Noto fonts are "free and open source".
  2. Their codepoint coverage is relatively good.
  3. They try to adhere to a consistent look and feel.

Consequently, I named the set of bitmap glyphs "Notoverse". One can think of Notoverse as an alternative to GNU's Unifont initiative, except:

  • The glyphs are 32-by-32, not 16-by-16.
  • The glyphs are 24-bit RGB, not 1-bit monochrome.
  • There is full Unicode 14.0.0 coverage.

The construction of the Notoverse glyphs was tedious and exhausting. I now know why so many similar projects run out of steam. The final breakdown of sources for each of the 144,762 glyphs is as follows:

It turns out that Noto contributes to about 54% of the glyphs; more if we include the glyphs manually constructed from Noto elements.

U+0061: LATIN SMALL LETTER A

Here is the list of the Google Noto fonts I used (in priority order):

  • Noto Sans
  • Noto Sans Armenian
  • Noto Sans Hebrew
  • Noto Sans Arabic
  • Noto Sans Syriac
  • Noto Sans Thaana
  • Noto Sans NKo/Noto Sans N Ko
  • Noto Sans Samaritan
  • Noto Sans Mandaic
  • Noto Sans Malayalam
  • Noto Sans Devanagari
  • Noto Sans Bengali
  • Noto Sans Gurmukhi
  • Noto Sans Gujarati
  • Noto Sans Oriya
  • Noto Sans Tamil
  • Noto Sans Tamil Supplement
  • Noto Sans Telugu
  • Noto Sans Kannada
  • Noto Sans Malayalam
  • Noto Sans Sinhala
  • Noto Sans Thai
  • Noto Sans Myanmar
  • Noto Sans Georgian
  • Noto Sans Cherokee
  • Noto Sans Canadian Aboriginal
  • Noto Sans Ogham
  • Noto Sans Runic
  • Noto Sans Tagalog
  • Noto Sans Hanunoo
  • Noto Sans Buhid
  • Noto Sans Tagbanwa
  • Noto Sans Khmer
  • Noto Sans Mongolian
  • Noto Sans Limbu
  • Noto Sans Tai Le
  • Noto Sans New Tai Lue
  • Noto Sans Buginese
  • Noto Sans Tai Tham
  • Noto Sans Balinese
  • Noto Sans Sundanese
  • Noto Sans Batak
  • Noto Sans Lepcha
  • Noto Sans Ol Chiki
  • Noto Sans Glagolitic
  • Noto Sans Coptic
  • Noto Sans Tifinagh
  • Noto Sans Yi
  • Noto Sans Lisu
  • Noto Sans Vai
  • Noto Sans Bamum
  • Noto Sans Syloti Nagri
  • Noto Sans PhagsPa
  • Noto Sans Saurashtra
  • Noto Sans Kayah Li
  • Noto Sans Rejang
  • Noto Sans Javanese
  • Noto Sans Cham
  • Noto Sans Tai Viet
  • Noto Sans Ethiopic
  • Noto Sans Linear A
  • Noto Sans Linear B
  • Noto Sans Phoenician
  • Noto Sans Lycian
  • Noto Sans Carian
  • Noto Sans Old Italic
  • Noto Sans Gothic
  • Noto Sans Old Permic
  • Noto Sans Ugaritic
  • Noto Sans Old Persian
  • Noto Sans Deseret
  • Noto Sans Shavian
  • Noto Sans Osmanya
  • Noto Sans Osage
  • Noto Sans Elbasan
  • Noto Sans Caucasian Albanian
  • Noto Sans Cypriot
  • Noto Sans Imperial Aramaic
  • Noto Sans Palmyrene
  • Noto Sans Nabataean
  • Noto Sans Hatran
  • Noto Sans Lydian
  • Noto Sans Meroitic
  • Noto Sans Kharoshthi
  • Noto Sans Old South Arabian
  • Noto Sans Old North Arabian
  • Noto Sans Manichaean
  • Noto Sans Avestan
  • Noto Sans Inscriptional Parthian
  • Noto Sans Inscriptional Pahlavi
  • Noto Sans Psalter Pahlavi
  • Noto Sans Old Turkic
  • Noto Sans Old Hungarian
  • Noto Sans Hanifi Rohingya
  • Noto Sans Old Sogdian
  • Noto Sans Sogdian
  • Noto Sans Elymaic
  • Noto Sans Brahmi
  • Noto Sans Kaithi
  • Noto Sans Sora Sompeng
  • Noto Sans Chakma
  • Noto Sans Mahajani
  • Noto Sans Sharada
  • Noto Sans Khojki
  • Noto Sans Multani
  • Noto Sans Khudawadi
  • Noto Sans Grantha
  • Noto Sans Newa
  • Noto Sans Tirhuta
  • Noto Sans Siddham
  • Noto Sans Modi
  • Noto Sans Takri
  • Noto Sans Warang Citi
  • Noto Sans Zanabazar Square
  • Noto Sans Soyombo
  • Noto Sans Pau Cin Hau
  • Noto Sans Bhaiksuki
  • Noto Sans Marchen
  • Noto Sans Masaram Gondi
  • Noto Sans Gunjala Gondi
  • Noto Sans Cuneiform
  • Noto Sans Egyptian Hieroglyphs
  • Noto Sans Anatolian Hieroglyphs
  • Noto Sans Mro
  • Noto Sans Bassa Vah
  • Noto Sans Pahawh Hmong
  • Noto Sans Medefaidrin
  • Noto Sans Miao
  • Noto Sans Nushu
  • Noto Sans Duployan
  • Noto Sans SignWriting
  • Noto Sans Wancho
  • Noto Sans Mende Kikakui
  • Noto Sans Meetei Mayek/Noto Sans MeeteiMayek
  • Noto Sans Adlam Unjoined
  • Noto Sans Indic Siyaq Numbers
  • Noto Serif Tibetan
  • Noto Serif Vithkuqi
  • Noto Serif Yezidi
  • Noto Serif Ahom
  • Noto Serif Dogra
  • Noto Serif Tangut
  • Noto Serif Hmong Nyiakeng/Noto Serif Nyiakeng Puachue Hmong
  • Noto Sans Symbols
  • Noto Sans Symbols2/Noto Sans Symbols 2
  • Noto Sans Math
  • Noto Sans Display
  • Noto Looped Lao
  • Noto Sans Lao
  • Noto Sans CJK SC/Noto Sans SC
  • Noto Music
Where there are two font names with a slash between them, the first is the local font name and the second the web font name. Alas, the naming is somewhat lax. See the HTML source and CSS for more details.

I hope I haven't trodden on anyone's toes by using their fonts in this way. The individual glyphs are down-sampled to 32-by-32 pixels and used for illustrative purposes only. I trust you'll agree that's "fair use".

So, it looks like we're still a long way from getting a pan-Unicode font, or even a set of fonts that achieve the same goal.

For completeness, here's a list of attempts at providing good Unicode coverage:

No comments:

Post a Comment