But first, I tried to wring some more mileage out of the polynomial approximation of the sRGB gamma curve, which, as you'll remember, is:
if (srgb <= 0.04045)
lin = srgb / 12.92;
else
lin = pow((srgb + 0.055) / 1.055, 2.4);
It is common that the input 'srgb' values are bytes (0..255 for each of red, green and blue channels) and so are the output 'lin' values. My initial thought was to use a quartic integer polynomial of the form:
uint32 x = srgb;
uint32 y = a*x*x*x*x + b*x*x*x + c*x*x + d*x + e;
lin = (uint8)(y >> 24);
That is, pick the constants 'a', 'b', 'c', 'd' and 'e' such that, after multiplying through as 32-bit unsigned integers, the result is conveniently in the uppermost eight bits of 'y'. I dutifully prepared a table of values and fed them into an on-line tool such as the excellent Polynomial Regression Data Fit page by Paul Lutus. It came up with the following polynomial for 'y' in terms of 'x':
y
= -1.296·10-1 x4 + b x3 + c x2 + d x + e
This was a little disappointing: the value 'a' turned out to be about -0.1296. That's definitely not an integer coefficient. So I thought I'd be clever and approximate the first term as
-(x * x * x * x / 8)
Serendipitously, the division can be performed using a right-shift. This approximation is all very well, but the remaining coefficients are no longer correct. So I rearranged the table so that the residuals could be approximated as a cubic polynomial:
y + x*x*x*x/8 = b*x*x*x + c*x*x + d*x + e
The next data fit suggested a polynomial with a value of 'b' of approximately 142. I continued the process of fitting, rounding to an integer coefficient and computing the residuals until I got all the values:
a = -1/8
b = 142
c = 35322
d = 602000
e = 11918016
Plugging these numbers back into the formula and testing all 256 input possibilities, I got exact results for 238 inputs and a maximum error of ±1 for the remaining 18. That's not too bad, but hardly stellar, especially as a lookup table of 256 bytes would be more accurate and almost certainly faster!
However, if the input/output integers were 16 bits, instead of 8, a polynomial solution might be worthwhile as the size of a 65536-element array would be cache-unfriendly. This is not as unlikely as it seems: high quality graphics processing is expected to deal with 16-bits-per-channel sRGB inputs.
So, going through the whole process again, but this time using 16- and 64-bit integers, I came up with:
uint64 x1 = srgb;
uint64 x2 = x1 * x1;
uint64 x4 = x2 * x2;
uint64 y = (x1 + 3441) * (x1 + 72046) * x1 * 32627 - x4 / 10;
lin = (uint16)(y >> 48);
Finally, I bit the bullet and went for the somewhat more prosaic piecewise linear approximation. This is an undervalued technique that allows you to balance the speed of table-based lookups with the immediacy of full computation. Here's the algorithm:
const uint32 lut[256][2];
uint8 i = (uint8)(srgb >> 8);
uint32 y = lut[i][0] * srgb + lut[i][1];
lin = (uint16)(y >> 16);
One kilobyte of constant lookup table data isn't too cache-unfriendly and the code has a certain beautiful simplicity.
I divided the 65536-element sRGB curve table into 256 equal parts. Each part was fitted to the appropriate portion of the curve using simple linear regression, being careful to handle rounding correctly. The gradient coefficients are in 'lut[i][0]' and the y-intercepts are in 'lut[i][1]'. They are biased so that the required 16-bit result is in the upper half of 'y'. For completeness, the full table is listed at the end of this post.
So, how does it perform? Actually, very well. The error is never more than ±1 in 65535, with 61009 of the 65536 mappings being exact results. As an aside, if you replaced the integer entries in 'lut' with appropriate floating-point values, you could also construct a fairly accurate integer-sRGB-to-float-linear conversion.
I'll leave the reverse transformation (linear to sRGB) as an exercise for the reader.
Here's that table...
const uint32 lut[256][2] = {
{ 5076, 0x00007CEA },
{ 5068, 0x000086E6 },
{ 5071, 0x00008368 },
{ 5069, 0x00008C66 },
{ 5078, 0x000067EB },
{ 5073, 0x00007B68 },
{ 5071, 0x00008A68 },
{ 5070, 0x000090E7 },
{ 5065, 0x0000C064 },
{ 5078, 0x00004BEB },
{ 5192, 0xFFFBB824 },
{ 5497, 0xFFEEA33C },
{ 5798, 0xFFE08653 },
{ 6103, 0xFFD1136C },
{ 6415, 0xFFC00C08 },
{ 6739, 0xFFAD16AA },
{ 7059, 0xFF99184A },
{ 7391, 0xFF830BF0 },
{ 7719, 0xFF6BF894 },
{ 8054, 0xFF531BBB },
{ 8390, 0xFF38D863 },
{ 8734, 0xFF1C9F0F },
{ 9079, 0xFEFEF63C },
{ 9428, 0xFEDF956A },
{ 9778, 0xFEBEC519 },
{ 10138, 0xFE9B9CCD },
{ 10500, 0xFE76D482 },
{ 10870, 0xFE4FC93B },
{ 11227, 0xFE28B46E },
{ 11600, 0xFDFE71A8 },
{ 11972, 0xFDD2D862 },
{ 12347, 0xFDA5709E },
{ 12726, 0xFD7612DB },
{ 13114, 0xFD440F9D },
{ 13505, 0xFD1021E0 },
{ 13891, 0xFCDB5BA2 },
{ 14287, 0xFCA3AB68 },
{ 14673, 0xFC6BE828 },
{ 15082, 0xFC2F3775 },
{ 15489, 0xFBF134C0 },
{ 15894, 0xFBB1EC0B },
{ 16273, 0xFB753D48 },
{ 16699, 0xFB2F7A1E },
{ 17126, 0xFAE7C973 },
{ 17549, 0xFA9F13C6 },
{ 17963, 0xFA565196 },
{ 18384, 0xFA0AAFE8 },
{ 18823, 0xF9BA1644 },
{ 19246, 0xF96AC597 },
{ 19679, 0xF917E4F0 },
{ 20107, 0xF8C44FC6 },
{ 20545, 0xF86D16A0 },
{ 20989, 0xF812EA7E },
{ 21418, 0xF7BA20D5 },
{ 21845, 0xF7603D2A },
{ 22327, 0xF6F8AE1C },
{ 22784, 0xF694B180 },
{ 23228, 0xF631D25E },
{ 23678, 0xF5CBE33F },
{ 24141, 0xF56135A6 },
{ 24608, 0xF4F3C210 },
{ 25070, 0xF485A8F7 },
{ 25541, 0xF4139362 },
{ 25986, 0xF3A611C1 },
{ 26477, 0xF32B5836 },
{ 26945, 0xF2B48320 },
{ 27423, 0xF2394610 },
{ 27896, 0xF1BD7A7C },
{ 28387, 0xF13B0CF2 },
{ 28864, 0xF0BA7360 },
{ 29347, 0xF0365ED2 },
{ 29830, 0xEFB06A43 },
{ 30318, 0xEF272B37 },
{ 30815, 0xEE9972B0 },
{ 31308, 0xEE0AEF26 },
{ 31800, 0xED7ACB1C },
{ 32285, 0xECEAD38E },
{ 32770, 0xEC592501 },
{ 33305, 0xEBB6168C },
{ 33798, 0xEB1DF803 },
{ 34309, 0xEA7E5182 },
{ 34817, 0xE9DD9A80 },
{ 35333, 0xE9385682 },
{ 35844, 0xE892AB02 },
{ 36361, 0xE7E90B84 },
{ 36886, 0xE73AB80B },
{ 37403, 0xE68D058E },
{ 37919, 0xE5DDA990 },
{ 38447, 0xE5282998 },
{ 38970, 0xE472551D },
{ 39493, 0xE3BA7BA2 },
{ 40021, 0xE2FED4AA },
{ 40556, 0xE23E9636 },
{ 41099, 0xE1794FC6 },
{ 41633, 0xE0B53CD0 },
{ 42167, 0xDFEF0FDC },
{ 42710, 0xDF236F6B },
{ 43255, 0xDE54E7FC },
{ 43812, 0xDD7F9892 },
{ 44339, 0xDCB3CD1A },
{ 44892, 0xDBDBC4AE },
{ 45439, 0xDB03F040 },
{ 45983, 0xDA2B3150 },
{ 46545, 0xD9491468 },
{ 47082, 0xD86EEDF5 },
{ 47654, 0xD7845613 },
{ 48211, 0xD69DB2AA },
{ 48764, 0xD5B6923E },
{ 49313, 0xD4CF0AD0 },
{ 49891, 0xD3D8F7F2 },
{ 50459, 0xD2E4EE0E },
{ 51033, 0xD1EC112C },
{ 51599, 0xD0F47248 },
{ 52156, 0xCFFE9BDE },
{ 52745, 0xCEF85C84 },
{ 53324, 0xCDF44726 },
{ 53896, 0xCCF11344 },
{ 54487, 0xCBE2FBEC },
{ 55055, 0xCADD2508 },
{ 55641, 0xC9CCC12C },
{ 56217, 0xC8BEC54C },
{ 56799, 0xC7ABB970 },
{ 57394, 0xC6903619 },
{ 57990, 0xC571DA43 },
{ 58585, 0xC4519FEC },
{ 59169, 0xC3347710 },
{ 59770, 0xC20CA4BD },
{ 60357, 0xC0E97262 },
{ 60947, 0xBFC2798A },
{ 61554, 0xBE90A439 },
{ 62151, 0xBD6180E4 },
{ 62758, 0xBC2AEA93 },
{ 63359, 0xBAF50C40 },
{ 63976, 0xB9B482F4 },
{ 64565, 0xB880359A },
{ 65313, 0xB6F5B510 },
{ 65594, 0xB6609D1D },
{ 66392, 0xB4B5DDAC },
{ 67048, 0xB35433F4 },
{ 67654, 0xB20B1C23 },
{ 68259, 0xB0C03BD2 },
{ 68881, 0xAF69A808 },
{ 69496, 0xAE1485BC },
{ 70120, 0xACB7FAF4 },
{ 70740, 0xAB5B3F2A },
{ 71367, 0xA9F822E4 },
{ 71987, 0xA896921A },
{ 72614, 0xA72E93D3 },
{ 73252, 0xA5BDC612 },
{ 73882, 0xA44F214D },
{ 74524, 0xA2D6F48E },
{ 75163, 0xA15E054E },
{ 75799, 0x9FE4628C },
{ 76435, 0x9E683ECA },
{ 77063, 0x9CEE7B04 },
{ 77702, 0x9B6B9DC3 },
{ 78338, 0x99E81901 },
{ 79002, 0x9850E64D },
{ 79643, 0x96C5460E },
{ 80292, 0x95322DD2 },
{ 80938, 0x939E6E15 },
{ 81602, 0x91FCD061 },
{ 82225, 0x90728F18 },
{ 82889, 0x8ECBCE64 },
{ 83541, 0x8D2A20AA },
{ 84197, 0x8B8352F2 },
{ 84856, 0x89D806BC },
{ 85512, 0x882C1B04 },
{ 86176, 0x86785C50 },
{ 86832, 0x84C75298 },
{ 87509, 0x8305BC6A },
{ 88164, 0x81503732 },
{ 88831, 0x7F901500 },
{ 89506, 0x7DC7EAD1 },
{ 90164, 0x7C08B21A },
{ 90840, 0x7A3A906C },
{ 91493, 0x7879A832 },
{ 92187, 0x7699D98E },
{ 92859, 0x74C695DE },
{ 93538, 0x72EBD0B1 },
{ 94212, 0x7111E102 },
{ 94887, 0x6F34A1D4 },
{ 95566, 0x6D51ECA7 },
{ 96251, 0x6B683D7E },
{ 96941, 0x69784BD6 },
{ 97615, 0x67913128 },
{ 98304, 0x659CA500 },
{ 98993, 0x63A547D8 },
{ 99670, 0x61B421AB },
{ 100369, 0x5FB01288 },
{ 101055, 0x5DB2ECE0 },
{ 101744, 0x5BB0E1B8 },
{ 102443, 0x59A4A496 },
{ 103138, 0x5798AD71 },
{ 103835, 0x55887B4E },
{ 104527, 0x53795EA8 },
{ 105232, 0x515D9D88 },
{ 105935, 0x4F40A268 },
{ 106634, 0x4D23F945 },
{ 107327, 0x4B094C20 },
{ 108041, 0x48DB7F84 },
{ 108731, 0x46BDC1DE },
{ 109469, 0x4477684E },
{ 110164, 0x4250472A },
{ 110872, 0x401C138C },
{ 111573, 0x3DEABC6A },
{ 112282, 0x3BB03E4D },
{ 113000, 0x396BB7B4 },
{ 113717, 0x3725289A },
{ 114417, 0x34E9B4F8 },
{ 115149, 0x32914B66 },
{ 115865, 0x30432DCC },
{ 116593, 0x2DE85338 },
{ 117317, 0x2B8DEAA2 },
{ 118038, 0x2933358B },
{ 118760, 0x26D4D1F4 },
{ 119487, 0x246F66E0 },
{ 120214, 0x220721CB },
{ 120938, 0x1F9E9935 },
{ 121667, 0x1D2EF422 },
{ 122401, 0x1AB82990 },
{ 123141, 0x18394902 },
{ 123858, 0x15CB7DE9 },
{ 124586, 0x13515B55 },
{ 125319, 0x10D00244 },
{ 126063, 0x0E421EB8 },
{ 126796, 0x0BBB07A6 },
{ 127535, 0x092BC398 },
{ 128289, 0x068C3910 },
{ 129043, 0x03E9AC8A },
{ 129778, 0x01553C79 },
{ 130452, 0xFEF517CA },
{ 131275, 0xFC0B4AE6 },
{ 131988, 0xF9825BCA },
{ 132739, 0xF6D3F2C2 },
{ 133496, 0xF41D06BC },
{ 134231, 0xF17774AC },
{ 134991, 0xEEB7E328 },
{ 135745, 0xEBFAE3A0 },
{ 136491, 0xE9426F16 },
{ 137238, 0xE686280B },
{ 137998, 0xE3BAB987 },
{ 138758, 0xE0EC5103 },
{ 139518, 0xDE1AE87F },
{ 140265, 0xDB52F474 },
{ 141036, 0xD8711D76 },
{ 141788, 0xD59E83EE },
{ 142558, 0xD2B7A06F },
{ 143327, 0xCFCEAA70 },
{ 144092, 0xCCE6966E },
{ 144853, 0xC9FF696A },
{ 145619, 0xC71066EA },
{ 146384, 0xC41F5CE8 },
{ 147190, 0xC102CA7B },
{ 147924, 0xBE2A77EA },
{ 148707, 0xBB1E88F2 }
};
Isn't the final table two kilobytes, not one?
ReplyDeleteThis is interesting, but it's kind of unfortunate that your focus is on converting to linear instead of from linear. I'm not sure what your use case is for this because I don't know when you would ever have 16-bit data in the sRGB gamma, let alone needing to convert it to linear as fast as possible. In the real world, most input data would be 8-bit sRGB, and you want to convert it to 16-bit linear to do high-precision linear transformations. This can be done directly with a 512 byte LUT. The hard part is converting 16-bit linear back down to 8-bit sRGB efficiently without a huge 64kb LUT.
I imagine your solution here would work just as well though, and since typically we only care about 8-bit sRGB, it would probably work with just a 16-bit LUT for the linear piecewise approximation, so it really would be 1kb. It's not exactly straightforward to compute the tables (and get rounding right) but I can't find any better solution.
If you're still reading after all these years, this is my attempt:
ReplyDeletehttps://github.com/ncruces/go-image/blob/master/imageutil/srgb.go
My approach is different, with two 512 byte LUTs for the forward and reverse transforms.
Linear to sRGB is the hard part here. The linear to sRGB conversion is reminiscent of a square root and doesn't work very well with polynomial approximations, making it extremely difficult to round trip all 256 values. The lookup table method involves a 3295 byte lookup table that maps linear 0—3294 to sRGB 0—255, with the 1/12.92 slope of linear mapped to step size of 1. The bit scan reverse or find first set might be used to construct exponent and mantissa of linear value to possibly compress the lookup table, but that doesn't seem to be applicable to vectorized implementations such as for MMX or SSE2. One possible way to approximate sRGB to linear from 8-bit i is by int16_t j=i<<7; int16_t a=10810; a=(a*j)>>16; a+=7424; a=(a*j)>>16; a+=498; a=(a*j)>>16; which converts to a range of 0—3424 and may be implemented in terms of PUNPCKLBW, PSLLW, PMULHW, and PADDW instructions. However, the reverse doesn't seem to be numerically stable enough for a single polynomial, and a piecewise method by PCMPGTW instruction might run out of registers if you tried to splice two quadratics and store the constants in registers. I'm new to SIMD/vector instructions so I'm still trying to figure things out here. I've left the GNU/LLVM duopoly so I'm trying to optimize all I can.
ReplyDelete