Attila T. Áfra's blog about pixels, voxels and threads

Improving BVH Ray Tracing Speed Using the AVX Instruction Set

with 2 comments

Last week I’ve attended Eurographics 2011 in Llandudno, UK and I’ve presented my poster entitled Improving BVH Ray Tracing Speed Using the AVX Instruction Set. If you are interested, you can download a personal copy of the paper and the poster. My work, along with all the other posters presented at the conference, can also be found at the Eurographics Digital Library.

Update: now you can download an interactive demo of the methods described in the paper!

The beautiful town of Llandudno in Wales

Written by Attila Áfra

April 19, 2011 at 1:22 pm

Doboz: compression library with very fast decompression

with 2 comments

Another side project of mine is Doboz (Hungarian for ‘box’), a small LZ-based data compression library written in C++ with very high decompression speed and close to zlib compression ratio. The decompression speed is typically between 700-1200 MB/s on an Intel Core i7-620M processor. However, compression is quite slow: about 2-3 MB/s. Both compression and decompression are memory safe.

There are many similar libraries (e.g., QuickLZ, FastLZ, LZO), but most of them were designed for both fast compression and decompression. Doboz is different: its compressor is really slow, but it can achieve better compression ratios and much higher decompression speeds than other libraries.

You can read more information (including detailed performance comparisons) about Doboz and grab the source code from here. I’ve made the source available under the zlib/libpng license. You can also download a package containing Windows binaries of a simple command-line compressor and a benchmark tool from here. The benchmark tool compares the performance of Doboz, QuickLZ, and zlib using the user-specified test file. I recommend using the 64-bit version because it’s slightly faster.

I took many ideas from QuickLZ and 7-zip (the dictionary), and combined them with my own tricks. Doboz is a really simple LZSS compressor that uses a relatively large, 2 MB dictionary and variable-length matches. Instead of decoding the matches with lots of branches, I use a small lookup table. This provides a nice performance boost.

static const struct {
	uint32_t mask; // the mask for the entire encoded match
	uint8_t offsetShift;
	uint8_t lengthMask;
	uint8_t lengthShift;
	int8_t size; // the size of the encoded match in bytes
} lut[] = {
	{0xff,        2,   0, 0, 1}, // (0)00
	{0xffff,      2,   0, 0, 2}, // (0)01
	{0xffff,      6,  15, 2, 2}, // (0)10
	{0xffffff,    8,  31, 3, 3}, // (0)11
	{0xff,        2,   0, 0, 1}, // (1)00 = (0)00
	{0xffff,      2,   0, 0, 2}, // (1)01 = (0)01
	{0xffff,      6,  15, 2, 2}, // (1)10 = (0)10
	{0xffffffff, 11, 255, 3, 4}, // 111

uint32_t word = fastRead(source, 4);
uint32_t i = word & 7;

match.offset = (word & lut[i].mask) >> lut[i].offsetShift;
match.length = ((word >> lut[i].lengthShift) & lut[i].lengthMask) + MIN_MATCH_LENGTH;

Another crucial part of the decompression algorithm is match copying. We must be careful with that because matches may overlap. Obviously, copying the bytes one by one is not the fastest method, but at least it works correctly for overlapping matches. Both QuickLZ and Doboz copy matches in chunks of 4 bytes, but they solve the overlap problem differently. Copying in chunks is valid as long as the distance between the source and destination pointers is not less than the chunk size. QuickLZ’s approach is to limit the minimum match offset to 2 and advance the pointers with only 3 bytes in every copy iteration. This means that, theoretically, 25% of the performance is wasted, even if the match is non-overlapping. In order to avoid this, I first check whether the match offset is less than the chunk size. If it is, I copy the first 3 bytes (the minimum match length) one by one and move back the source pointer a few bytes to increase the distance to the destination pointer. Then, I copy the rest of the match in chunks. This method works without limiting the minimum match offset.

int i = 0;

if (match.offset < 4) {
	do {
		fastWrite(outputIterator + i, fastRead(matchString + i, 1), 1);
	} while (i < 3);
	matchString -= 2 + (match.offset & 1);

do {
	fastWrite(outputIterator + i, fastRead(matchString + i, 4), 4);
	i += 4;
} while (i < match.length);

outputIterator += match.length;

Doboz is memory safe, which means that it will never read or write beyond the specified buffers, even if the compressed data is corrupted. Of course, this slows down the decompression because of the additional buffer checks. I’ve managed to decrease the amount of necessary checks by appending a few dummy bytes to the compressed data. This doesn’t really hurt the compression ratio and it makes safe decompression so fast it’s not worth disabling (about 5-7% slower than unsafe).

Writing this little library was quite fun and I’ve learned a lot during the process. I hope you’ll find it useful in some way. 🙂

Written by Attila Áfra

March 19, 2011 at 5:27 pm

Posted in Compression, Doboz

ICC color management in Media Player Classic Home Cinema

with 26 comments

Display calibration and profiling are becoming more and more popular thanks to relatively cheap colorimeters and wide gamut monitors. It’s important to know that an application must support color management in order to display accurate colors.

While most of the popular graphics editors, viewers and web browsers have this feature, I haven’t heard about any video players with ICC color management. So, I’ve decided to implement this in one of the most popular open source media players for Windows: Media Player Classic Home Cinema, or simply MPC-HC.

The problem with current color management systems (CMSs) is that they use the CPU. Even with lookup tables and SIMD optimizations, they are quite slow. Way too slow for real-time HD video playback. Fortunately, color management can be implemented very efficiently on the GPU.

The ideal solution would be to write a GPU-optimized CMS from scratch, but that’s a lot of work and I’m too busy right now with my ray tracing stuff. A much easier way is to build a 3D LUT with an existing CMS (I’ve opted for Little CMS), which you can sample in a little pixel shader to transform the pixels. 3D LUTs are frequently used in the film industry and are starting to get serious attention in the gaming industry too.

I’m using a 64x64x64 LUT with 16-bit per channel floating point entries, which provides results virtually indistinguishable from those obtained directly with Little CMS. Trilinear interpolation is crucial, and it’s natively supported for this texture format by most (if not all) GPUs released in the past few years. I’ve also added dithering as a final pass to avoid introducing banding artifacts.

GPU accelerated color management has been introduced in the latest stable version (1.4) of MPC-HC. If you experience some problems with that, you should download a recent SVN build. Make sure to select the EVR Custom (recommended for Vista and 7 users) or the VMR-9 (renderless) renderer in View -> Options -> Playback -> Output. Then, you can enable color management in the View menu as shown above. For a detailed description of the renderer settings, please check out the wiki.

Written by Attila Áfra

September 20, 2010 at 10:36 am

VoxLOD global illumination video (pre-rendered)

with 3 comments

Work in progress. Runs at about 0.5-1 fps at 1280×720 on my Intel Core i7-620m and NVIDIA NVS 5100M.

Written by Attila Áfra

September 5, 2010 at 1:31 am

Posted in GI, Graphics, Ray Tracing, VoxLOD

Shadows and global illumination

with 6 comments

I’ve been quite busy in the past few months writing my Master’s dissertation (which is about my real-time massive model visualization method), so I didn’t have time to update my blog.

First of all, I’ve implemented shadows in VoxLOD, which has thus become a ray tracer. Of course, level-of-detail is applied to the shadow rays too. For example, this is how the ray traced 354 million triangle Mandelbulb model looks like:

Currently, only point light sources are supported, which cast hard shadows. In the future, I would like to implement area lights too. It’s not easy to render soft shadows with ray tracing at high quality and speed, so I will have to do some research on this.

While shadows make the rendered image a lot more realistic, the parts in shadow are completely flat, devoid of any details, thanks to the constant ambient light. One possible solution is ambient occlusion, but I wanted to go further: global illumination in real-time.

GI in VoxLOD is very experimental and unoptimized for now. It’s barely interactive: it runs at only 1-2 fps at 640×480 on my dual-core Core i7 notebook. Fortunately, there are lots of optimization opportunities. Now let’s see an example image:

Note that most of the scene is not directly lit, and color bleeding caused by indirect lighting is clearly visible. There are two light sources: a point light (the Sun) and a hemispherical one (the sky). I use Monte Carlo integration to compute the GI with one bounce of indirect lighting. Nothing is precomputed (except the massive model data structure of course).

I trace only two GI rays per pixel, and therefore, the resulting image must be heavily filtered in order to eliminate the extreme noise. While all the ray tracing is done on the CPU, the noise filter runs on the GPU and is implemented in CUDA. Since diffuse indirect lighting is quite low frequency, it is adequate to use low LODs for the GI rays.

Here are some more screenshots:

Written by Attila Áfra

July 25, 2010 at 6:14 pm

Mandelbulb walkthrough with VoxLOD

leave a comment »

354 million triangles
Recorded in real-time on a Core 2 Duo T5500

Since I couldn’t find any publicly available, truly huge data sets, I’ve written a simple triangle model generator, which extracts an isosurface of a procedural scalar field. I’ve used this program to generate the detailed Mandelbulb fractal model shown in the video. So far, this is the most complex data set I’ve tested VoxLOD with. Unfortunately, it’s too artificial. I would be really grateful if somebody could provide me with a real-world massive model!

Written by Attila Áfra

March 1, 2010 at 11:34 pm

XYZ RGB Dragon walkthrough with VoxLOD

leave a comment »

7 million triangles
Recorded in real-time on a Core 2 Quad Q9300

Written by Attila Áfra

March 1, 2010 at 1:52 pm