Attila T. Áfra's blog about pixels, voxels and threads

Embree Ray Tracing Kernels: Overview and New Features

leave a comment »

Embree Ray Tracing Kernels: Overview and New Features
Attila T. Áfra, Ingo Wald, Carsten Benthin, Sven Woop
SIGGRAPH 2016 Talk


Embree is an open source ray tracing library consisting of high-performance kernels optimized for modern CPUs with increasingly wide SIMD units. Since its original release, it has become the state-of-the-art for professional CPU-based rendering applications. In the first half of this talk, we will give a brief overview of the Embree framework and how to use it. In the second half, we will present recent improvements and features introduced since the initial publication of the system. These additions include new geometry types commonly used in production renderers (quads, subdivision surfaces, and hair), improved motion blur support, and ray streams that can be traversed more efficiently than single rays and ray packets.

Abstract (personal copy): PDF
Abstract (definitive version): available at dl.acm.org
Slides: PDF
Citation: BibTeX

Written by Attila Áfra

January 13, 2017 at 4:03 pm

Local Shading Coherence Extraction for SIMD-Efficient Path Tracing on CPUs

leave a comment »

Local Shading Coherence Extraction for SIMD-Efficient Path Tracing on CPUs
Attila T. Áfra, Carsten Benthin, Ingo Wald, Jacob Munkberg (Intel Corporation)
High-Performance Graphics 2016


Accelerating ray traversal on data-parallel hardware architectures has received widespread attention over the last few years, but much less research has focused on efficient shading for ray tracing. This is unfortunate since shading for many applications is the single most time consuming operation. To maximize rendering performance, it is therefore crucial to effectively use the processor’s wide vector units not only for the ray traversal step itself, but also during shading. This is non-trivial as incoherent ray distributions cause control flow divergence, making high SIMD utilization difficult to maintain. In this paper, we propose a local shading coherence extraction algorithm for CPU-based path tracing that enables efficient SIMD shading. Each core independently traces and sorts small streams of rays that fit into the on-chip cache hierarchy, allowing to extract coherent ray batches requiring similar shading operations, with a very low overhead. We show that operating on small independent ray streams instead of a large global stream is sufficient to achieve high SIMD utilization in shading (90% on average) for complex scenes, while avoiding unnecessary memory traffic and synchronization. For a set of scenes with many different materials, our approach reduces the shading time with 1.9-3.4x compared to simple structure-of-arrays (SoA) based packet shading. The total rendering speedup varies between 1.2-3x, which is also determined by the ratio of the traversal and shading times.

Paper (personal copy): PDF
Paper (definitive version): available at diglib.eg.org
Slides: PDF
Citation: BibTeX

Written by Attila Áfra

May 25, 2016 at 1:39 pm

Stackless Multi-BVH Traversal for CPU, MIC and GPU Ray Tracing

leave a comment »

Stackless Multi-BVH Traversal for CPU, MIC and GPU Ray Tracing
Attila T. Áfra and László Szirmay-Kalos
Computer Graphics Forum 33(1), February 2014


Stackless traversal algorithms for ray tracing acceleration structures require significantly less storage per ray than ordinary stack-based ones. This advantage is important for massively parallel rendering methods, where there are many rays in flight. On SIMD architectures, a commonly used acceleration structure is the multi bounding volume hierarchy (MBVH), which has multiple bounding boxes per node for improved parallelism. It scales to branching factors higher than two, for which, however, only stack-based traversal methods have been proposed so far.
In this paper, we introduce a novel stackless traversal algorithm for MBVHs with up to 4-way branching. Our approach replaces the stack with a small bitmask, supports dynamic ordered traversal, and has a low computation overhead. We also present efficient implementation techniques for recent CPU, MIC (Intel Xeon Phi), and GPU (NVIDIA Kepler) architectures.

Paper (personal copy): PDF
Paper (definitive version): available at wileyonlinelibrary.com and diglib.eg.org
Poster (EGSR 2014): PDF
Citation: BibTeX

If you have any questions, feel free to contact me!

Written by Attila Áfra

November 21, 2013 at 3:11 pm

Faster Incoherent Ray Traversal Using 8-Wide AVX Instructions

leave a comment »

Faster Incoherent Ray Traversal Using 8-Wide AVX Instructions
Attila T. Áfra
Technical Report


Efficiently tracing randomly distributed rays is a highly challenging problem on wide-SIMD processors. The MBVH (multi bounding volume hierarchy) is an acceleration structure specifically designed for incoherent ray tracing on processors with explicit SIMD architectures like the CPU. Existing MBVH traversal methods for CPUs target 4-wide SIMD architectures using the SSE instruction set. Recently, a new 8-wide SIMD instruction set called AVX has been introduced as an extension to SSE. Adapting a data-parallel algorithm to AVX can lead to significant, albeit not necessarily linear, speed improvements, but this is often not straightforward. In this paper we present an improved MBVH ray traversal algorithm optimized for AVX, which outperforms the state-of-the-art SSE-based method by up to 25%.

Paper: PDF
Citation: BibTeX

Any feedback is welcome!

Written by Attila Áfra

August 14, 2013 at 5:27 pm

Massive model ray tracing using an SSD

with 3 comments

SSDs are simply amazing. They can make a huge difference in certain cases. One of these is massive model rendering.

VoxLOD, my out-of-core ray tracer, streams data from an SSD more than 8x faster than from two 7200 RPM HDDs in RAID! It runs pretty well with the HDDs too, but you can easily see some streaming artifacts if you move too fast (see the original video for the paper). With the SSD the streaming is almost completely seamless.

Here’s a video which was recorded real-time at 720p on a freshly booted (so nothing was precached) Intel Core i7-3770 system with an SSD. The model is the Boeing 777, which contains more than 300 million triangles and occupies about 30 GB of space on the SSD. You only need about 4-6 GB RAM. This is an updated version of the renderer which now uses the AVX instruction set.

Written by Attila Áfra

August 31, 2012 at 4:46 pm

Incoherent Ray Tracing without Acceleration Structures

with 6 comments

At Eurographics 2012, I presented an efficient incoherent ray tracing method that does not use any acceleration structures, and is optimized for AVX. It is based on the algorithm introduced in the HPG 2011 poster Efficient Ray Tracing without Auxiliary Acceleration Data Structure by Alexander Keller and Carsten Wächter.

Incoherent Ray Tracing without Acceleration Structures
Attila T. Áfra
Eurographics 2012 Short Paper

Recently, a new family of dynamic ray tracing algorithms, called divide-and-conquer ray tracing, has been introduced. This approach partitions the primitives on-the-fly during ray traversal, which eliminates the need for an acceleration structure. We present a new ray traversal method based on this principle, which efficiently handles incoherent rays, and takes advantage of the SSE and AVX instruction sets of the CPU. Our algorithm offers notable performance improvements over similar existing solutions, and it is competitive with powerful static ray tracers.

Paper (personal copy): PDF
Paper (definitive version): available at diglib.eg.org
Slides: PPTX (with notes, video), PDF

Written by Attila Áfra

April 24, 2012 at 1:51 am

Interactive Ray Tracing of Large Models Using Voxel Hierarchies

leave a comment »

Interactive Ray Tracing of Large Models Using Voxel Hierarchies
Attila T. Áfra
Computer Graphics Forum 31(1)
Presented at Eurographics 2013

We propose an efficient approach for interactive visualization of massive models with CPU ray tracing. A voxel-based hierarchical level-of-detail (LOD) framework is employed to minimize rendering time and required system memory. In a preprocessing phase, a compressed out-of-core data structure is constructed, which contains the original primitives of the model and the LOD voxels, organized into a kd-tree. During rendering, data is loaded asynchronously to ensure a smooth inspection of the model regardless of the available I/O bandwidth. With our technique, we are able to explore data sets consisting of hundreds of millions of triangles in real-time on a desktop PC with a quad-core CPU.

Paper (personal copy, high-res): PDF
Paper (definitive version): Wiley, Eurographics
EG2013 slides (with new videos): PPTX
Video (real-time): MP4, YouTube
Citation: BibTeX
Project page

The Boeing 777 data set was provided by and used with permission of The Boeing Company.

Written by Attila Áfra

January 31, 2012 at 8:48 pm

Improving BVH Ray Tracing Speed Using the AVX Instruction Set

with 2 comments

Last week I’ve attended Eurographics 2011 in Llandudno, UK and I’ve presented my poster entitled Improving BVH Ray Tracing Speed Using the AVX Instruction Set. If you are interested, you can download a personal copy of the paper and the poster. My work, along with all the other posters presented at the conference, can also be found at the Eurographics Digital Library.

Update: now you can download an interactive demo of the methods described in the paper!

The beautiful town of Llandudno in Wales

Written by Attila Áfra

April 19, 2011 at 1:22 pm

Doboz: compression library with very fast decompression

with 2 comments

Another side project of mine is Doboz (Hungarian for ‘box’), a small LZ-based data compression library written in C++ with very high decompression speed and close to zlib compression ratio. The decompression speed is typically between 700-1200 MB/s on an Intel Core i7-620M processor. However, compression is quite slow: about 2-3 MB/s. Both compression and decompression are memory safe.

There are many similar libraries (e.g., QuickLZ, FastLZ, LZO), but most of them were designed for both fast compression and decompression. Doboz is different: its compressor is really slow, but it can achieve better compression ratios and much higher decompression speeds than other libraries.

You can read more information (including detailed performance comparisons) about Doboz and grab the source code from here. I’ve made the source available under the zlib/libpng license. You can also download a package containing Windows binaries of a simple command-line compressor and a benchmark tool from here. The benchmark tool compares the performance of Doboz, QuickLZ, and zlib using the user-specified test file. I recommend using the 64-bit version because it’s slightly faster.

I took many ideas from QuickLZ and 7-zip (the dictionary), and combined them with my own tricks. Doboz is a really simple LZSS compressor that uses a relatively large, 2 MB dictionary and variable-length matches. Instead of decoding the matches with lots of branches, I use a small lookup table. This provides a nice performance boost.

static const struct {
	uint32_t mask; // the mask for the entire encoded match
	uint8_t offsetShift;
	uint8_t lengthMask;
	uint8_t lengthShift;
	int8_t size; // the size of the encoded match in bytes
} lut[] = {
	{0xff,        2,   0, 0, 1}, // (0)00
	{0xffff,      2,   0, 0, 2}, // (0)01
	{0xffff,      6,  15, 2, 2}, // (0)10
	{0xffffff,    8,  31, 3, 3}, // (0)11
	{0xff,        2,   0, 0, 1}, // (1)00 = (0)00
	{0xffff,      2,   0, 0, 2}, // (1)01 = (0)01
	{0xffff,      6,  15, 2, 2}, // (1)10 = (0)10
	{0xffffffff, 11, 255, 3, 4}, // 111

uint32_t word = fastRead(source, 4);
uint32_t i = word & 7;

match.offset = (word & lut[i].mask) >> lut[i].offsetShift;
match.length = ((word >> lut[i].lengthShift) & lut[i].lengthMask) + MIN_MATCH_LENGTH;

Another crucial part of the decompression algorithm is match copying. We must be careful with that because matches may overlap. Obviously, copying the bytes one by one is not the fastest method, but at least it works correctly for overlapping matches. Both QuickLZ and Doboz copy matches in chunks of 4 bytes, but they solve the overlap problem differently. Copying in chunks is valid as long as the distance between the source and destination pointers is not less than the chunk size. QuickLZ’s approach is to limit the minimum match offset to 2 and advance the pointers with only 3 bytes in every copy iteration. This means that, theoretically, 25% of the performance is wasted, even if the match is non-overlapping. In order to avoid this, I first check whether the match offset is less than the chunk size. If it is, I copy the first 3 bytes (the minimum match length) one by one and move back the source pointer a few bytes to increase the distance to the destination pointer. Then, I copy the rest of the match in chunks. This method works without limiting the minimum match offset.

int i = 0;

if (match.offset < 4) {
	do {
		fastWrite(outputIterator + i, fastRead(matchString + i, 1), 1);
	} while (i < 3);
	matchString -= 2 + (match.offset & 1);

do {
	fastWrite(outputIterator + i, fastRead(matchString + i, 4), 4);
	i += 4;
} while (i < match.length);

outputIterator += match.length;

Doboz is memory safe, which means that it will never read or write beyond the specified buffers, even if the compressed data is corrupted. Of course, this slows down the decompression because of the additional buffer checks. I’ve managed to decrease the amount of necessary checks by appending a few dummy bytes to the compressed data. This doesn’t really hurt the compression ratio and it makes safe decompression so fast it’s not worth disabling (about 5-7% slower than unsafe).

Writing this little library was quite fun and I’ve learned a lot during the process. I hope you’ll find it useful in some way. 🙂

Written by Attila Áfra

March 19, 2011 at 5:27 pm

Posted in Compression, Doboz

ICC color management in Media Player Classic Home Cinema

with 26 comments

Display calibration and profiling are becoming more and more popular thanks to relatively cheap colorimeters and wide gamut monitors. It’s important to know that an application must support color management in order to display accurate colors.

While most of the popular graphics editors, viewers and web browsers have this feature, I haven’t heard about any video players with ICC color management. So, I’ve decided to implement this in one of the most popular open source media players for Windows: Media Player Classic Home Cinema, or simply MPC-HC.

The problem with current color management systems (CMSs) is that they use the CPU. Even with lookup tables and SIMD optimizations, they are quite slow. Way too slow for real-time HD video playback. Fortunately, color management can be implemented very efficiently on the GPU.

The ideal solution would be to write a GPU-optimized CMS from scratch, but that’s a lot of work and I’m too busy right now with my ray tracing stuff. A much easier way is to build a 3D LUT with an existing CMS (I’ve opted for Little CMS), which you can sample in a little pixel shader to transform the pixels. 3D LUTs are frequently used in the film industry and are starting to get serious attention in the gaming industry too.

I’m using a 64x64x64 LUT with 16-bit per channel floating point entries, which provides results virtually indistinguishable from those obtained directly with Little CMS. Trilinear interpolation is crucial, and it’s natively supported for this texture format by most (if not all) GPUs released in the past few years. I’ve also added dithering as a final pass to avoid introducing banding artifacts.

GPU accelerated color management has been introduced in the latest stable version (1.4) of MPC-HC. If you experience some problems with that, you should download a recent SVN build. Make sure to select the EVR Custom (recommended for Vista and 7 users) or the VMR-9 (renderless) renderer in View -> Options -> Playback -> Output. Then, you can enable color management in the View menu as shown above. For a detailed description of the renderer settings, please check out the wiki.

Written by Attila Áfra

September 20, 2010 at 10:36 am