Sunday 19 December 2021

Universe 5: Font Detection

Any Unicode codepoint browser ends up as an exercise in font wrangling. Universe is no exception.

The threat of fingerprinting means a browser script cannot typically enumerate the locally installed fonts. There is also currently no native way of determining if a font supports a particular codepoint. One would think that determining font support for codepoints would therefore be problematic, but it turns out to be relatively easy for our use-case with Chrome.

Consider the following JavaScript snippet:

var canvas = document.createElement("canvas");
var context = canvas.getContext("2d");
context.font = "10px font-a,font-b,font-c";
var measure = context.measureText(text);

If a character in the "text" string is not supported by "font-a", Chrome text rendering falls back to using "font-b". If "font-b" also doesn't support the character, "font-c" is used. If the character is not supported by "font-c" either, a system default is used.

We can take advantage of this fall back mechanism by using a "blank" font that guarantees to render any glyph as zero-width/zero-mark. Fortunately, there's just such a font already out there: Adobe Blank:

@font-face {
  font-family: "blank";
  src: url("https://raw.githubusercontent.com/adobe-fonts/
         adobe-blank/master/AdobeBlank.otf.woff") format("woff");
}

Now, we can write a function to test a font for supported characters:

function IsBlank(font, text) {
  var canvas = document.createElement("canvas");
  var context = canvas.getContext("2d");
  context.font = `10px "${font}",blank`;
  var measure = context.measureText(text);
  return (measure.width <= 0) &&
    (measure.actualBoundingBoxRight <= -measure.actualBoundingBoxLeft);
}

The actual code in universe.js has some optimisations and additional features; see "MeasureTextContext()" and "TextIsBlank()".

Using this technique, we can iterate around some "well-known" fonts and render them where appropriate. For our example of codepoint U+0040:

The origins of each of the glyphs above are:

  1. "notoverse" is the 32-by-32 pixel bitmap glyph described previously.
  2. "(default)" is the default font used for this codepoint by Chrome. In my case, it's "Times New Roman".
  3. "(sans-serif)" is the default sans serif font: "Arial".
  4. "(serif)" is the default serif font: "Times New Roman".
  5. "(monospace)" is the default monospace font: "Consolas".
  6. "r12a.io" is the PNG from Richard Ishida's UniView online app.
  7. "glyphwiki.org" is the SVG from GlyphWiki. It looks a bit squashed because it's a half-width glyph; GlyphWiki is primarily concerned with CJK glyphs.
  8. "unifont" is the GNU Unifont font. I couldn't find a webfont-friendly source for version 14, so I had to piggyback version 12 from Terence Eden.
  9. "noto" is a composite font of several dozen Google Noto fonts. See ".noto" in universe.css.
  10. Subsequent "Noto ..." glyphs are from the corresponding, individual Noto font, in priority order.

Saturday 18 December 2021

Universe 4: User Interface

The Universe character browser has a very basic user interface. In addition to the HTML5 presentation, there is a URL scheme that allows you to jump to various subpages.

Landing Page

universe.html#

The landing page gives a brief introduction and a collapsible table of statistics. As with all pages, a header contains breadcrumbs and a footer lists some useful jumping-off points.

Codepoints

To examine a specific codepoint, expressed in hexadecimal between "0000" and "10FFFF" inclusive, use:

universe.html#U+<hex>

In this case, "#U+0040" examines "@". The subpage lists the block, column and codepoint details, as well as font support, UCD data fields and external links, where appropriate.

Planes

To list all 17 Unicode planes, use:

universe.html#P

To examine details of the plane containing codepoint "U+<hex>", use:

universe.html#P+<hex>

Sheets

In this context, a "sheet" is a 32-by-32 grid of 1024 contiguous codepoints; it is how the Notoverse glyphs are organised.

To list all 161 sheets which contain allocated codepoints, use:

universe.html#S

To examine the sheet containing codepoint "U+<hex>", use:

universe.html#S+<hex>

Blocks

To list all 320 named Unicode blocks, use:

universe.html#B

To examine the block containing codepoint "U+<hex>", use:

universe.html#B+<hex>

This will list constituent codepoints, organised by column.

Columns

To examine just the column containing codepoint "U+<hex>", use:

universe.html#C+<hex>

This will list constituent codepoints in more detail.

Queries

To query the UCD for matching codepoints, use:

universe.html#<key>=<value>

The "<key>" can be:

  • One of the abbreviated UCD field names, such as "gc" or "na".
  • "id" to search the codepoint id (e.g. "U+0040").
  • "basic" to search the computed codepoint basic category.
  • "kind" to search the computed codepoint kind.
  • "script" to search the codepoint script list (see below).
  • "extra" to search NamesList.txt annotations.
  • "text" to search all the above.

The "<value>" can be simple text or a JavaScript regular expression in the form "/<expr>/<flags>". For example:

universe.html#na=/\bANT\b/i

This searches for codepoints whose name field contains the whole word "ANT" case-insensitively.

Scripts

Queries with the special key "script" search the fields "sc" and "scx". To query for the 29 codepoints of the Ogham script, use:

universe.html#script=Ogam

List all the 210 scripts of the ISO-15924 standard, use:

universe.html#script

Search

Queries with the special key "search" perform full-text searches. To bring up a search dialog, use:

universe.html#search

Gears

As mentioned earlier, loading the full UCD database and glyph sheets for the first time can take quite a few minutes. Searches can also take a few seconds. For long-running JavaScript functions, we display animated gears:

To keep the page responsive, we wrap the long-running functionality inside a call to the "Gears()" function in universe.js:

function Gears(parent, asynchronous, complete) {
  var gears = document.createElement("img");
  gears.className = "gears";
  gears.title = "Please wait...";
  gears.src = "gears.svg";
  parent.appendChild(gears);
  gears.onload = async () => {
    var result = await asynchronous();
    if (gears.parentNode === parent) {
      parent.removeChild(gears);
    }
    if (complete) {
      complete(result);
    }
  };
}

Inside the asynchronous, long-running function we have to make sure we periodically call "YieldAsync()":

function YieldAsync(milliseconds) {
  // Yield every few milliseconds 
  // (or every time this function is called if argument is missing)
  var now = performance.now();
  if (!YieldAsync.already) {
    // First call
    YieldAsync.already = Promise.resolve(undefined);
    YieldAsync.channel = new MessageChannel();
  } else if (milliseconds && ((now - YieldAsync.previous) < milliseconds)) {
    // Resolve immediately
    return YieldAsync.already;
  }
  YieldAsync.previous = now;
  return new Promise(resolve => {
    YieldAsync.channel.port1.addEventListener("message",
      () => resolve(), { once: true });
    YieldAsync.channel.port1.start();
    YieldAsync.channel.port2.postMessage(null);
  });
}

This was inspired by a much-underrated StackOverflow answer.

Friday 17 December 2021

Universe 3: Fonts

The Universe project uses Google Noto fonts as much as possible. As the Noto project page says:

The name is also short for "no tofu", as the project aims to eliminate 'tofu': blank rectangles shown when no font is available for your text.

According to the Universe Statistic page UCD 14.0.0, there are 144,762 "used" codepoints. This breaks down as follows when you include "unused" codepoints such as "private", "surrogate" and "noncharacter":

Kind Criteria Codepoints
format gc is "Cf", "Zl" or "Zp" 165
control gc is "Cc" 65
private gc is "Co" 137,468
surrogate gc is "Cs" 2,048
noncharacter gc is "Cn" 66
modifier gc is "Mc", "Me" or "Mn" 2,408
emoji EPres is "Y" 1,185
graphic otherwise 140,939
Total 284,344

Note that this "Kind" classification is slightly more fine-grained than Unicode's "basic type" but less so than "general category". Also note that the codepoints classified as "private", "surrogate" and "noncharacter" are fixed and will not change in subsequent Unicode versions.

This still leaves a great many codepoints that need to be rendered. The relationship between Unicode codepoints and character glyphs within a font is non-trivial, to say the least; but it can be useful, in a character browser, to render an "archetype" glyph of each codepoint for illustrative purposes.

In Universe, each "used" codepoint has a 32-by-32 pixel bitmap glyph. The aim is to use Google Noto fonts wherever possible to construct these bitmaps, because:

  1. Noto fonts are "free and open source".
  2. Their codepoint coverage is relatively good.
  3. They try to adhere to a consistent look and feel.

Consequently, I named the set of bitmap glyphs "Notoverse". One can think of Notoverse as an alternative to GNU's Unifont initiative, except:

  • The glyphs are 32-by-32, not 16-by-16.
  • The glyphs are 24-bit RGB, not 1-bit monochrome.
  • There is full Unicode 14.0.0 coverage.

The construction of the Notoverse glyphs was tedious and exhausting. I now know why so many similar projects run out of steam. The final breakdown of sources for each of the 144,762 glyphs is as follows:

It turns out that Noto contributes to about 54% of the glyphs; more if we include the glyphs manually constructed from Noto elements.

U+0061: LATIN SMALL LETTER A

Here is the list of the Google Noto fonts I used (in priority order):

  • Noto Sans
  • Noto Sans Armenian
  • Noto Sans Hebrew
  • Noto Sans Arabic
  • Noto Sans Syriac
  • Noto Sans Thaana
  • Noto Sans NKo/Noto Sans N Ko
  • Noto Sans Samaritan
  • Noto Sans Mandaic
  • Noto Sans Malayalam
  • Noto Sans Devanagari
  • Noto Sans Bengali
  • Noto Sans Gurmukhi
  • Noto Sans Gujarati
  • Noto Sans Oriya
  • Noto Sans Tamil
  • Noto Sans Tamil Supplement
  • Noto Sans Telugu
  • Noto Sans Kannada
  • Noto Sans Malayalam
  • Noto Sans Sinhala
  • Noto Sans Thai
  • Noto Sans Myanmar
  • Noto Sans Georgian
  • Noto Sans Cherokee
  • Noto Sans Canadian Aboriginal
  • Noto Sans Ogham
  • Noto Sans Runic
  • Noto Sans Tagalog
  • Noto Sans Hanunoo
  • Noto Sans Buhid
  • Noto Sans Tagbanwa
  • Noto Sans Khmer
  • Noto Sans Mongolian
  • Noto Sans Limbu
  • Noto Sans Tai Le
  • Noto Sans New Tai Lue
  • Noto Sans Buginese
  • Noto Sans Tai Tham
  • Noto Sans Balinese
  • Noto Sans Sundanese
  • Noto Sans Batak
  • Noto Sans Lepcha
  • Noto Sans Ol Chiki
  • Noto Sans Glagolitic
  • Noto Sans Coptic
  • Noto Sans Tifinagh
  • Noto Sans Yi
  • Noto Sans Lisu
  • Noto Sans Vai
  • Noto Sans Bamum
  • Noto Sans Syloti Nagri
  • Noto Sans PhagsPa
  • Noto Sans Saurashtra
  • Noto Sans Kayah Li
  • Noto Sans Rejang
  • Noto Sans Javanese
  • Noto Sans Cham
  • Noto Sans Tai Viet
  • Noto Sans Ethiopic
  • Noto Sans Linear A
  • Noto Sans Linear B
  • Noto Sans Phoenician
  • Noto Sans Lycian
  • Noto Sans Carian
  • Noto Sans Old Italic
  • Noto Sans Gothic
  • Noto Sans Old Permic
  • Noto Sans Ugaritic
  • Noto Sans Old Persian
  • Noto Sans Deseret
  • Noto Sans Shavian
  • Noto Sans Osmanya
  • Noto Sans Osage
  • Noto Sans Elbasan
  • Noto Sans Caucasian Albanian
  • Noto Sans Cypriot
  • Noto Sans Imperial Aramaic
  • Noto Sans Palmyrene
  • Noto Sans Nabataean
  • Noto Sans Hatran
  • Noto Sans Lydian
  • Noto Sans Meroitic
  • Noto Sans Kharoshthi
  • Noto Sans Old South Arabian
  • Noto Sans Old North Arabian
  • Noto Sans Manichaean
  • Noto Sans Avestan
  • Noto Sans Inscriptional Parthian
  • Noto Sans Inscriptional Pahlavi
  • Noto Sans Psalter Pahlavi
  • Noto Sans Old Turkic
  • Noto Sans Old Hungarian
  • Noto Sans Hanifi Rohingya
  • Noto Sans Old Sogdian
  • Noto Sans Sogdian
  • Noto Sans Elymaic
  • Noto Sans Brahmi
  • Noto Sans Kaithi
  • Noto Sans Sora Sompeng
  • Noto Sans Chakma
  • Noto Sans Mahajani
  • Noto Sans Sharada
  • Noto Sans Khojki
  • Noto Sans Multani
  • Noto Sans Khudawadi
  • Noto Sans Grantha
  • Noto Sans Newa
  • Noto Sans Tirhuta
  • Noto Sans Siddham
  • Noto Sans Modi
  • Noto Sans Takri
  • Noto Sans Warang Citi
  • Noto Sans Zanabazar Square
  • Noto Sans Soyombo
  • Noto Sans Pau Cin Hau
  • Noto Sans Bhaiksuki
  • Noto Sans Marchen
  • Noto Sans Masaram Gondi
  • Noto Sans Gunjala Gondi
  • Noto Sans Cuneiform
  • Noto Sans Egyptian Hieroglyphs
  • Noto Sans Anatolian Hieroglyphs
  • Noto Sans Mro
  • Noto Sans Bassa Vah
  • Noto Sans Pahawh Hmong
  • Noto Sans Medefaidrin
  • Noto Sans Miao
  • Noto Sans Nushu
  • Noto Sans Duployan
  • Noto Sans SignWriting
  • Noto Sans Wancho
  • Noto Sans Mende Kikakui
  • Noto Sans Meetei Mayek/Noto Sans MeeteiMayek
  • Noto Sans Adlam Unjoined
  • Noto Sans Indic Siyaq Numbers
  • Noto Serif Tibetan
  • Noto Serif Vithkuqi
  • Noto Serif Yezidi
  • Noto Serif Ahom
  • Noto Serif Dogra
  • Noto Serif Tangut
  • Noto Serif Hmong Nyiakeng/Noto Serif Nyiakeng Puachue Hmong
  • Noto Sans Symbols
  • Noto Sans Symbols2/Noto Sans Symbols 2
  • Noto Sans Math
  • Noto Sans Display
  • Noto Looped Lao
  • Noto Sans Lao
  • Noto Sans CJK SC/Noto Sans SC
  • Noto Music
Where there are two font names with a slash between them, the first is the local font name and the second the web font name. Alas, the naming is somewhat lax. See the HTML source and CSS for more details.

I hope I haven't trodden on anyone's toes by using their fonts in this way. The individual glyphs are down-sampled to 32-by-32 pixels and used for illustrative purposes only. I trust you'll agree that's "fair use".

So, it looks like we're still a long way from getting a pan-Unicode font, or even a set of fonts that achieve the same goal.

For completeness, here's a list of attempts at providing good Unicode coverage:

Thursday 16 December 2021

Universe 2: Loading Resources

When the Universe web page first loads into a browser, about 100MB of data are pulled over the network in over 250 HTTP requests. This includes 77 web fonts (6MB), 161 glyph image sheets (77MB) and the Unicode Character Database, UCD, as a tab-separated value file (28MB).

It can take three or four minutes the first time around with a slow network connection, but all these resources are usually cached by the browser, so subsequent page loads perform very little actual data transfer. However, decoding the UCD is another matter.

Even the subset of the UCD data we're actually interested in takes up a quarter of a gigabyte when expressed as JSON. So caching the raw JSON file is problematic. I elected to transfer the data as sequential key-value deltas in TSV format. This reduces the size down to under 30MB and only 6.5MB when compressed "on the wire". It is also relatively quick to reconstitute the full UCD object hierarchy in JavaScript: it takes about seven seconds on my machine.

Here are the lines describing codepoint U+0040 ("@"):

na<tab>COMMERCIAL AT
lb<tab>AL
STerm<tab>N
Term<tab>N
SB<tab>XX
0040
<tab>= at sign

The first five lines (in "<key><tab><value>" form) list only the differences in UCD properties (keyed by short property aliases) compared to the preceding codepoint (i.e. U+003F "?").

The next line (in "<hex>" form) creates a new codepoint record for "U+0040".

The final line (in "<tab><extra>" form) adds an extra line of narrative to the preceding record. In this case, it refers to a NamesList.txt alias.

The ucd.14.0.0.tsv file is constructed using a NodeJS script from the following sources:

It is read and reconstituted by the "LoadDatabaseFileAsync()" function of universe.js. Notice that the final action of that function is to store the data in an IndexedDB database. This allows the JavaScript to test for the existence of that data in subsequent page reloads, obviating the need to fetch and decode each time. This saves several seconds each time. See "IndexedDatabaseGet()" and "IndexedDatabasePut()" in universe.js.

The downside of using this client-side caching is that it takes up 260MB of disk space:


We rely on the browser (Chrome, in this case) to manage the IndexedDB storage correctly, but even if the cache is purged, we fall back over to parsing the TSV again.

Wednesday 15 December 2021

Universe 1: Introduction

Version 14.0.0 of Unicode was released back in September. There are an plethora of online resources for browsing the Unicode Character Database (UCD):

But they usually have something lacking: the ability to render the codepoints and/or keeping up-to-date with the ever-evolving Unicode standard.

Of course, you can just download the official Unicode code charts as a single PDF and manually cross-reference them with the UCD. But the PDF is over 100MB and the UCD is distributed as a collection of human-unfriendly text files.


Universe is my attempt at producing a client-side HTML/CSS/JavaScript Unicode browser. It's obviously a work in progress, but the main areas of interest (that I plan to cover in subsequent blog posts) are:

  1. A user interface to navigate around Unicode codepoints, blocks, planes, etc.
  2. Loading the UCD into the client (this is my biggest concern at present as it takes about ten seconds to load and parse the database).
  3. A flexible search mechanism.
  4. Font support.
  5. A representative rendering of all the codepoints.