I've been working with historical data recently as part of my Whim project. The thorny issue of how to store time series (and time-varying relationships) inevitably came up, but I'll restrict this post to some thoughts on storing (uncertain) dates, times and timespans.
There are many good texts on the broader subject (Richard T. Snodgrass's "Developing Time-oriented Database Applications in SQL" is old but entertaining) and a whole industry has grown up around it. There's also the tangential subject of temporal databases. But for our discussion, we'll talk about instants, timespans and uncertainty in historical data; in particular, the rise and fall of empires, nations and polities.
Instants
An instant is a point in time. One attribute of an instant is its resolution. For historical events spanning millennia, a resolution of a year may be sufficient. So we can say that "Alexander III of Macedon was born in 356 BCE". A higher resolution would be of a day (i.e. a specific date): "Winston Churchill was born on 30 November 1874". Beyond a specific date, we can add higher and higher resolutions of time: "Britain declared war on Germany at 11 p.m. (GMT) on 4 August 1914". Interestingly the prepositions change for these three cases: "in", "on" and "at".
Storing Years
It's tempting to store historical years as signed integers (e.g. -356 for 356 BCE), but beware of year zero: the year following 1 BCE is 1 CE, not 0 BCE/CE. So the number of years "between" 356 BCE and 1874 CE is 2229, not 2230 (1874 minus -356). For human-readable data, unadjusted signed integers are intuitive; for calculations, adjusting negative values (such that -355 is used for 356 BCE) reduces the probability of off-by-one errors in arithmetic.
Storing Dates
Instants with a resolution of one day are often natively supported by data stores. Many people think that this completely circumvents the knotty problems of timezones and leap seconds. Alas, it doesn't!
Consider Howland Island of USA and the Line Islands of Kiribati:
| Source |
For this reason, it may still be prudent to store timezone/location information alongside local dates.
Another issue is calendars. The "standard" Western Gregorian calendar is difficult enough to convert to and from serial (integer) dates, but consider the calendars where dates change at midday, not midnight.
Storing Times
Instants with a resolution of less than one day typically do so using seconds, or multiples/divisions thereof. However, beware of daylight savings (some days can have more or fewer seconds in them) and leap seconds (minutes can have 59, 60, 61 or 62 seconds in them).
Storing a timezone with each date goes some way to alleviate these problems. One could also use UTC or (to handle leap seconds) TAI with an appropriate adjustment flag. Leap seconds are particularly nasty because they can be announced seemingly randomly ("time-varying time"!?).
Storing ISO 8601
ISO 8601 can be useful for storing instants. This can be as simple as storing the instant as a string where the length of the string denotes the resolution of the data point:
- "yyyy" for years.
- "yyyy-mm-dd" for days.
- "yyyy-mm-ddThh:mm:ss" for seconds.
- "yyyy-mm-ddThh:mm:ss.ffffff" for microseconds.
- etc.
Timezones and UTC can be indicated by the appropriate suffixes.
Notice that the standard only covers four-digit years. Earlier years must be supported by what is euphemistically called "prior arrangement between parties". This can include a "-YYYYY" extension for BCE. Unfortunately, this negates one very useful property of ISO 8601 strings: lexicographical sorting equates to chronological sorting.
Instant Operators
Instants belong to a strictly total ordering:
- "a < b" implies that instant "a" is strictly before instant "b" chronologically.
- If not "a < b", then "b ≤ a" ("b" is the same as or before "a").
- If neither "a < b" nor "b < a", then "a ≡ b" ("a" is the same instant as "b").
- If "a < b" and "b < c", then "a < c".
- etc.
Instant Sentinels
Although there is no notion of "zero" for instants, "+∞" can be used to denote an instant infinitely far in the future. Similarly, "-∞" is infinitely far in the past.
Timespans
If instant "a" is non-strictly before instant "b", i.e. "a ≤ b", then the difference between the two instants, "b - a", is the elapsed time between them, measured in the same units as "a" and "b".
Zero-length timespans imply "a ≡ b" . We call these "instantaneous timespans".
Open or Closed
When storing timespans, we must decide whether to include or exclude each endpoint:
- "t ∊ [a, b)" implies "a ≤ t < b" (half-open)
- "t ∊ (a, b]" implies "a < t ≤ b" (half-open)
- "t ∊ (a, b)" implies "a < t < b" (open)
- "t ∊ [a, b]" implies "a ≤ t ≤ b" (closed)
Scheme 1 is often used and we'll use "a ~ b" to denote "[a, b)", i.e. the timespan from (and including) instant "a" to (but excluding) instant "b". We always assume "a ≤ b".
One problem with this formulation is that the expression "a ~ a" does not denote an instantaneous timespan; it denotes an empty (or null) timespan.
Three "infinite" timespans can also be defined:
- "a ~ +∞" denotes "a ≤ t"
- "-∞ ~ b" denotes "t < b"
- "-∞ ~ +∞" denotes "t" is unconstrained
Allen Intervals
Timespans can be thought of as Allen intervals. For "X = a ~ b" and "Y = c ~ d", we have:
Sketch |
Constraints* |
Allen Interval |
Allen Meaning |
Combined |
|
a b Y ├───┤ |
b < c |
X < Y Y > X |
X precedes Y Y is preceded by X |
Does not combine |
|
a b Y ├───┤ |
b ≡ c |
X m Y Y mi X |
X meets Y Y is met by X |
a ~ d |
|
a b Y ├───┤ |
c < b |
X o Y Y oi X |
X overlaps with Y Y is overlapped by X |
a ~ d |
|
a b Y ├─────┤ |
a ≡ c b < d |
X s Y Y si X |
X starts Y Y is started by X |
a ~ d |
|
a b Y ├─────┤ |
c < a b < d |
X d Y Y di X |
X during Y Y contains X |
c ~ d |
|
a b Y ├─────┤ |
c < a b ≡ d |
X f Y Y fi X |
X finishes Y Y is finished by X |
c ~ d |
|
a b Y ├───┤ |
a ≡ c b ≡ d |
X = Y |
X is equal to Y |
a ~ d |
* In addition to "a ≤ b" and "c ≤ d".
This leads to a whole algebra along with sanity checks that can be performed on related timespans.
Timespan Resolution
The resolution of a timespan corresponds to the resolution used for both the endpoints. This can sometimes lead to suspicious-looking (but valid) expressions for low resolutions:
- "World War II started in 1939 and ended in 1945"
- 1939 ≤ year ≤ 1945
- 1939 ≤ year < 1946
- 1939 ~ 1946
Storing Timespans
It is tempting to always store timespans as pairs of instants on the grounds that storing absolute (non-relative) values is somehow better (e.g. storing date of birth instead of age). Thus, "a ~ b" is stored as two values: "a" and "b". However, it is sometimes better to store "a" (start) and "b-a" (duration) instead. This allows the resolutions of the two quantities to be different and solves the problem of representing instantaneous timespans.
Uncertainty
Consider the following statement:
"The Voynich manuscript was written between 1404 and 1438."
This could be interpretted in at least two ways:
- A certain timespan: the writing of the manuscript was started in 1404 and completed in 1438.
- An uncertain timespan: the writing of the manuscript was started between 1404 and 1438 and completed within the same timespan.
The latter interpretation is more likely, in this case. The timespan, "[a, b]", for the writing of the manuscript is given by:
1404 ≤ a ≤ b ≤ 1438
Care must be taken to distinguish certain timespans (Interpretation 1) from uncertain instants or uncertain timespans (Interpretation 2). Sometimes, this can only be achieved with extra contextual information. Consider:
"Germany expanded into Denmark, Norway, Belgium, the Netherlands, Luxembourg and France between April and June 1940."
Storing Uncertain Instants
Within historical datasets, we can use chronologically-ordered lists of values to store uncertain instants. Here, we use the notation "<x, y, z>" for such lists.
- Totally unknown instants are stored as an empty list: "< >".
- Certain instants are stored as singletons: "<x>" means the event unequivocally took place at instant "x".
- Uncertain instants with a uniform range are stored as pairs: "<x, y>" means the event took place between instant "x" and instant "y".
- Uncertain instants with a non-uniform range are stored as triplets: "<x, y, z>" means the event took place between instant "x" and instant "z" with a median value of "y", where x ≤ y ≤ z.
- More quantiles can be added to the distribution by adding elements to the list. The more elements we add, the more precise the probability distribution becomes.
A nice property of these lists is that the average (median) value of the uncertain instant is the "central" value:
- <1066> → 1066
- <1404, 1438> → 1421
- <1404, 1415, 1438> → 1415
- <1404, 1413, 1417 1438> → 1415 = (1413+1417)÷2
- etc.
Storing Uncertain Timespans
As with certain timespans, we could store uncertain timespans as two uncertain instants. However, care must be taken to prevent "negative" durations.
For example, if we assume the Voynich manuscript was actually written in the interval "[a, b]", some time between 1404 and 1438 inclusive, then merely constraining "a ∊ [1404, 1438]" and "b ∊ [1404, 1438]" opens up the possibility of violating "a ≤ b" (consider "a=1420" and "b=1410").
Representing uncertain timespans as "[a, a+d]" where "d ≥ 0" may be more appropriate and/or natural. This is particularly true for historical references where the duration of the event is more or less certain than its start or end date.
Gantt Charts
Once we start treating uncertain timespans as a "floating" start instant and a "flexible" duration, relationships/dependencies between timespans begins to look a lot like Gantt chart analysis. There is a wealth of literature and algorithms that we can leverage from this field.
Charmingly, the Wikipedia page for Gantt charts starts with an uncertain timespan:
"It was designed and popularized by Henry Gantt c. 1910–1915."



