chilliant

Friday, 31 December 2010

Computus 3

After reminiscing with Dr J R Stockton about old computers, it got to converting my 16-bit algorithm for Easter to 8-bit Z80 assembly language. I spent quite some time hand-optimising it (which is an enjoyable past-time in itself, if there's no deadline involved!) and posted the code with a short commentary.

I had thought about writing a bespoke super-optimiser to try to find the "ultimate" Z80 code, but with just under 500 bytes of machine code, that's over 10¹⁰⁰⁰ permutations. I think I might have to divide and conquer!

Sunday, 26 December 2010

Modulo Arithmetic in Z80 assembler

I've done a little 8-bit assembly coding in the past, mostly Z80 (on Sinclair machines) and 6502 (on BBC Micro), but I've never really thought about arithmetic optimisations for those platforms as much as I have the last couple of months.

Imagine that you are (in the 21st century) writing Z80 date manipulation functions where the day of the week was important. You'd naturally want to perform "modulo 7" arithmetic on 16-bit quantities. If you had a routine to perform integer division by 7, you might find "z = x mod 7" given "y = floor(x / 7)" for unsigned 16-bit values as follows:

; Z80 assembly to compute 'x mod 7' from 'x' and 'x div 7'
; 'x' in HL
; 'x div 7' in DE
LD B, D ; 4T 1B -- Take a copy of DE
LD C, E   ; 4T 1B -- BC := x div 7
EX DE, HL ; 4T 1B -- DE := x, HL := x div 7
ADD HL, HL    ; 11T 1B -- HL := (x div 7) * 2
ADD HL, HL    ; 11T 1B -- HL := (x div 7) * 4
ADD HL, HL    ; 11T 1B -- HL := (x div 7) * 8
AND A       ; 4T 1B -- Clear carry flag
SBC HL, BC ; 15T 2B -- HL := (x div 7) * 7
EX DE, HL    ; 4T 1B -- HL := x, DE := (x div 7) * 7
AND A       ; 4T 1B -- Clear carry flag
SBC HL, BC ; 15T 2B -- HL := x - (x div 7) * 7
LD A, L ; 4T 1B -- A:= x mod 7
; 'x mod 7' in A

At first glance, this seems fairly fast, compact code (91 T-cycles and 14 code bytes), but notice that, by the end of it, one side-effect is setting H to zero. Not very surprising as "x mod 7" is clearly an 8-bit quantity. So why go to the trouble of computing full 16-bit intermediates that have only partial bearing on the final result? Providing that the upper eight bits don't "pollute" the lower eight (e.g. during division or change of sign), you need not compute them. Especially as 16-bit arithmetic is disproportionately expensive on Z80:

Obviously uses more precious registers,
Addition of 16-bit quantites takes at least 11 T-cycles compared to 4 T-cycles for 8-bit quantities,
Subtraction via 'SBC' requires worrying about the carry flag as there is no 16-bit 'SUB' instruction,
Shifting 16-bit quantities (multiplication and division by two) other than 'ADD HL, HL' requires two, long instructions.

So the 8-bit equivalent could be:

; Z80 assembly to compute 'x mod 7' from 'x' and 'x div 7'
; 'x' in HL
; 'x div 7' in DE
LD A, E ; 4T 1B -- A := x div 7 [low bits]
ADD A, A    ; 4T 1B -- A := (x div 7) * 2 [low bits]
ADD A, A    ; 4T 1B -- A := (x div 7) * 4 [low bits]
ADD A, A    ; 4T 1B -- A := (x div 7) * 8 [low bits]
SUB E       ; 4T 1B -- A := (x div 7) * 7 [low bits]
NEG      ; 8T 2B -- A := (x div 7) * -7 [low bits]
ADD A, L ; 4T 1B -- A := x mod 7
; 'x mod 7' in A

This is both faster and smaller (32 T-cycles and 8 code bytes) with no loss of accuracy, so I'm a bit bemused that it's taken me nearly three decades to stumble across it. And a lot of good it will do me too!

P.S. Why is 'NEG' so expensive on Z80? Given that 'CPL' is only 4 T-cycles and 2 code bytes, it seems a waste to dedicate precious silicon on a similar instruction that can be represented with existing instructions ('CPL' then 'INC A') for the same cost. Bah humbug!

Friday, 10 December 2010

Computus 2

Of course, if you're as old as I am, you'll remember programming on machines with no native division instruction (neither integer nor floating-point), so all those divisions and modulos aren't going to get you very far on a "old school" 16-bit computer. Using the information from Hacker's Delight, you can re-cast these operations to multiplications and shifts:

void easter_u16_nodiv(u16 year, u16* month, u16* day)
{
    u16 a = year >> 1;
    u16 b = year >> 2;
    u16 c = year >> 4;
    u16 d = b + c;
    u16 e = (a + d + (d >> 4) + (d >> 5) + (d >> 10)) >> 4;
    u16 f = year - e * 19;
    u16 g = f - ((f + 13) >> 5) * 19;
    d = (b >> 1) + (b >> 3);
    e = (d + (d >> 6) + (d >> 7)) >> 4;
    f = b - e * 25 + 39;
    d = e + (f >> 5);
    u16 h = (d * 3) >> 2;
    d = (d * 8) + 5;
    e = (d >> 1) + (d >> 3);
    e = (e + (e >> 6) + (e >> 7)) >> 4;
    f = d - (e * 25) + 7;
    d = (g * 19) + h - e - (f >> 5) + 15;
    e = ((d >> 1) + (d >> 5)) >> 4;
    f = d - (e * 30);
    d = f - ((f + 2) >> 5) * 30;
    g = d + ((29578 - g - (d * 32)) >> 10);
    f = b - h + g + 2;
    d = a + c + (f >> 1) + (f >> 4);
    e = (d + (d >> 6) + (d >> 12)) >> 2;
    f = year + f - (e * 7);
    e = g - f + ((f * 2341) >> 14) * 7;
    d = e >> 5;
    *day = e - d * 31;
    *month = d + 3;
}

Again, if you want to verify the algorithm (only valid for Gregorian years 0 to 65535), you can do so with Dr J R Stockton's test harness with the following Javascript snippet:

    var a, b, c, d, e, f, g, h;
    a = YR >>> 1;
    b = YR >>> 2;
    c = YR >>> 4;
    d = b + c;
    e = (a + d + (d >>> 4) + (d >>> 5) + (d >>> 10)) >>> 4;
    f = YR - e * 19;
    g = f - ((f + 13) >>> 5) * 19;
    d = (b >>> 1) + (b >>> 3);
    e = (d + (d >>> 6) + (d >>> 7)) >>> 4;
    f = b - e * 25 + 39;
    d = e + (f >>> 5);
    h = (d * 3) >>> 2;
    d = (d * 8) + 5;
    e = (d >>> 1) + (d >>> 3);
    e = (e + (e >>> 6) + (e >>> 7)) >>> 4;
    f = d - (e * 25) + 7;
    d = (g * 19) + h - e - (f >>> 5) + 15;
    e = ((d >>> 1) + (d >>> 5)) >>> 4;
    f = d - (e * 30);
    d = f - ((f + 2) >>> 5) * 30;
    g = d + ((29578 - g - (d * 32)) >>> 10);
    f = b - h + g + 2;
    d = a + c + (f >>> 1) + (f >>> 4);
    e = (d + (d >>> 6) + (d >>> 12)) >>> 2;
    f += YR - (e * 7);
    e = g - f + ((f * 2341) >>> 14) * 7;
    return e;

Computus 1

Computus is the calculation of Easter. Calendrical calculations being a bit of an obsession of mine, I've been looking at this with a view to an implementation using limited-range integer arithmetic. There's a lot of interesting stuff still being researched on this, but I haven't seen a good implementation for 16-bit unsigned arithmetic. So here's my attempt:

void easter_u16(u16 year, u16* month, u16* day)
{
    u16 a = year % 19;
    u16 b = year >> 2;
    u16 c = (b / 25) + 1;
    u16 d = (c * 3) >> 2;
    u16 e = ((a * 19) - ((c * 8 + 5) / 25) + d + 15) % 30;
    e += (29578 - a - e * 32) >> 10;
    e -= ((year % 7) + b - d + e + 2) % 7;
    d = e >> 5;
    *day = e - d * 31;
    *month = d + 3;
}

This is valid for all Gregorian years 0 to 65535. In fact, the core algorithm is valid for all non-negative years: just change the integer type to a wider, signed one.

If you want to test it using Dr J R Stockton's wonderful script tester, use the following Javascript snippet:

    var a = YR % 19;
    var b = YR >>> 2;
    var c = ((b / 25)|0) + 1;
    var d = (c * 3) >>> 2;
    var e = ((a * 19) - (((c * 8 + 5) / 25)|0) + d + 15) % 30;
    e += (29578 - a - (e << 5)) >>> 10;
    e -= ((YR % 7) + b - d + e + 2) % 7;
    return e;

Sunday, 14 November 2010

RGB/HSV in HLSL

My present work has brought me into contact with HLSL pixel shaders. Many of these involve colour manipulations, so a fast and efficient conversions between RGB and HSV (or HSL) colour spaces would seem to be readily available and optimised to death. Strangely, this doesn't seem to be the case. Even shaders given as examples by NVIDIA and ATI are somewhat simplistic. The Wikipedia page's pseudo-code suggested the following HLSL code:

float3 HSVtoRGB(float3 HSV) { float3 RGB = 0; float C = HSV.z * HSV.y; float H = HSV.x * 6; float X = C * (1 - abs(fmod(H, 2) - 1)); if (HSV.y != 0) { float I = floor(H); if (I == 0) { RGB = float3(C, X, 0); } else if (I == 1) { RGB = float3(X, C, 0); } else if (I == 2) { RGB = float3(0, C, X); } else if (I == 3) { RGB = float3(0, X, C); } else if (I == 4) { RGB = float3(X, 0, C); } else { RGB = float3(C, 0, X); } } float M = HSV.z - C; return RGB + M; } float3 RGBtoHSV(float3 RGB) { float3 HSV = 0; float M = min(RGB.r, min(RGB.g, RGB.b)); HSV.z = max(RGB.r, max(RGB.g, RGB.b)); float C = HSV.z - M; if (C != 0) { HSV.y = C / HSV.z; float3 D = (((HSV.z - RGB) / 6) + (C / 2)) / C; if (RGB.r == HSV.z) HSV.x = D.b - D.g; else if (RGB.g == HSV.z) HSV.x = (1.0/3.0) + D.r - D.b; else if (RGB.b == HSV.z) HSV.x = (2.0/3.0) + D.g - D.r; if ( HSV.x < 0.0 ) { HSV.x += 1.0; } if ( HSV.x > 1.0 ) { HSV.x -= 1.0; } } return HSV; }

Even to my eyes, these looked far than optimal. However, a quick glance at the HSV graph on the Wiki page suggests an optimisation for 'HSVtoRGB' which doesn't involve branching or conditional predicates.

float3 Hue(float H) { float R = abs(H * 6 - 3) - 1; float G = 2 - abs(H * 6 - 2); float B = 2 - abs(H * 6 - 4); return saturate(float3(R,G,B)); } float3 HSVtoRGB(in float3 HSV) { return ((Hue(HSV.x) - 1) * HSV.y + 1) * HSV.z; }

This is particularly efficient because 'abs' and 'saturate' are "free" operations on a lot of GPU pipelines.

The reverse conversion is a bit more tricky, and if you're really obsessed with performance on Xbox360, or the like, you'll need to drop down into shader assembler to utilise the vector 'max4' instruction instead of the scalar alternatives:

float3 RGBtoHSV(in float3 RGB) { float3 HSV = 0; #if NO_ASM HSV.z = max(RGB.r, max(RGB.g, RGB.b)); float M = min(RGB.r, min(RGB.g, RGB.b)); float C = HSV.z - M; #else float4 RGBM = RGB.rgbr; asm { max4 HSV.z, RGBM }; asm { max4 RGBM.w, -RGBM }; float C = HSV.z + RGBM.w; #endif if (C != 0) { HSV.y = C / HSV.z; float3 Delta = (HSV.z - RGB) / C; Delta.rgb -= Delta.brg; Delta.rg += float2(2,4); if (RGB.r >= HSV.z) HSV.x = Delta.b; else if (RGB.g >= HSV.z) HSV.x = Delta.r; else HSV.x = Delta.g; HSV.x = frac(HSV.x / 6); } return HSV; }

Although we haven't managed to get rid of the 'if' statements, they typically compile down to one conditionally predicated block and two conditional assignments.

Even against the startlingly successful optimisations produced by the current batch of HLSL compilers, these refactorings produce excellent results. The round-trip conversions (RGB-to-HSV-to-RGB) are typically three times faster than the simplistic implementations. For a pixel shader, that's not to be sneezed at.

Sunday, 7 November 2010

SketchUp to Reality

Google's SketchUp is an amazing bit of tech; but how many people use it in anger? From drawing to real-life, instead of the other way round? The work-in-progress known as Pear Tree House has actually benefited from the application of SketchUp to produce working plans for the building of a cedar and glass "loggia" (a posh word for "a lean-to without sides"). The original technical drawings and load calcs were done professionally, but I transferred the plans to SketchUp to help us get a feel for how everything fits together, and, more importantly, getting the roof lines correct.

The real-life results are surprisingly close to the plans I put together back in March 2010:

Saturday, 6 November 2010

ZX Spectrum Fine Art

I've been trawling through old demoscene, low-resolution art lately; in particular, images that can be displayed on a bog-standard Sinclair ZX Spectrum. These were limited to 256-by-192 pixels, with each 8-by-8 patch limited to only two colours from a very small palette. Most artwork of the 1980s was confined to loading screens, some very clever indeed:

But, even today, people still dabble in the arcane skills involved in getting something half-decent out of the huge constraints. As a quick exercise, I looked for an iconic painting to recreate in ZX Spectrum graphics. Something simplistic with blocks of solid colour would be ideal. I chose David Hockney's "A Bigger Splash":

Oh dear...