Local Shading Coherence Extraction for SIMD-Efficient Path Tracing on CPUs

Attila T. Áfra, Carsten Benthin, Ingo Wald, Jacob Munkberg (Intel Corporation)
High-Performance Graphics 2016


Accelerating ray traversal on data-parallel hardware architectures has received widespread attention over the last few years, but much less research has focused on efficient shading for ray tracing. This is unfortunate since shading for many applications is the single most time consuming operation. To maximize rendering performance, it is therefore crucial to effectively use the processor’s wide vector units not only for the ray traversal step itself, but also during shading. This is non-trivial as incoherent ray distributions cause control flow divergence, making high SIMD utilization difficult to maintain. In this paper, we propose a local shading coherence extraction algorithm for CPU-based path tracing that enables efficient SIMD shading. Each core independently traces and sorts small streams of rays that fit into the on-chip cache hierarchy, allowing to extract coherent ray batches requiring similar shading operations, with a very low overhead. We show that operating on small independent ray streams instead of a large global stream is sufficient to achieve high SIMD utilization in shading (90% on average) for complex scenes, while avoiding unnecessary memory traffic and synchronization. For a set of scenes with many different materials, our approach reduces the shading time with 1.9-3.4x compared to simple structure-of-arrays (SoA) based packet shading. The total rendering speedup varies between 1.2-3x, which is also determined by the ratio of the traversal and shading times.

Paper (personal copy): PDF

Stackless Multi-BVH Traversal for CPU, MIC and GPU Ray Tracing

Attila T. Áfra and László Szirmay-Kalos
Computer Graphics Forum 33(1), February 2014


Stackless traversal algorithms for ray tracing acceleration structures require significantly less storage per ray than ordinary stack-based ones. This advantage is important for massively parallel rendering methods, where there are many rays in flight. On SIMD architectures, a commonly used acceleration structure is the multi bounding volume hierarchy (MBVH), which has multiple bounding boxes per node for improved parallelism. It scales to branching factors higher than two, for which, however, only stack-based traversal methods have been proposed so far.
In this paper, we introduce a novel stackless traversal algorithm for MBVHs with up to 4-way branching. Our approach replaces the stack with a small bitmask, supports dynamic ordered traversal, and has a low computation overhead. We also present efficient implementation techniques for recent CPU, MIC (Intel Xeon Phi), and GPU (NVIDIA Kepler) architectures.

Paper (personal copy): PDF
Paper (definitive version): available at wileyonlinelibrary.com and diglib.eg.org
Poster (EGSR 2014): PDF
Citation: BibTeX

Faster Incoherent Ray Traversal Using 8-Wide AVX Instructions

Attila T. Áfra
Technical Report


Efficiently tracing randomly distributed rays is a highly challenging problem on wide-SIMD processors. The MBVH (multi bounding volume hierarchy) is an acceleration structure specifically designed for incoherent ray tracing on processors with explicit SIMD architectures like the CPU. Existing MBVH traversal methods for CPUs target 4-wide SIMD architectures using the SSE instruction set. Recently, a new 8-wide SIMD instruction set called AVX has been introduced as an extension to SSE. Adapting a data-parallel algorithm to AVX can lead to significant, albeit not necessarily linear, speed improvements, but this is often not straightforward. In this paper we present an improved MBVH ray traversal algorithm optimized for AVX, which outperforms the state-of-the-art SSE-based method by up to 25%.

Paper: PDF
Citation: BibTeX

Massive model ray tracing using an SSD

with 3 comments

SSDs are simply amazing. They can make a huge difference in certain cases. One of these is massive model rendering.

VoxLOD, my out-of-core ray tracer, streams data from an SSD more than 8x faster than from two 7200 RPM HDDs in RAID! It runs pretty well with the HDDs too, but you can easily see some streaming artifacts if you move too fast (see the original video for the paper). With the SSD the streaming is almost completely seamless.

Here’s a video which was recorded real-time at 720p on a freshly booted (so nothing was precached) Intel Core i7-3770 system with an SSD. The model is the Boeing 777, which contains more than 300 million triangles and occupies about 30 GB of space on the SSD. You only need about 4-6 GB RAM. This is an updated version of the renderer which now uses the AVX instruction set.

Incoherent Ray Tracing without Acceleration Structures

with 6 comments

At Eurographics 2012, I presented an efficient incoherent ray tracing method that does not use any acceleration structures, and is optimized for AVX. It is based on the algorithm introduced in the HPG 2011 poster Efficient Ray Tracing without Auxiliary Acceleration Data Structure by Alexander Keller and Carsten Wächter.

Attila T. Áfra
Eurographics 2012 Short Paper

Recently, a new family of dynamic ray tracing algorithms, called divide-and-conquer ray tracing, has been introduced. This approach partitions the primitives on-the-fly during ray traversal, which eliminates the need for an acceleration structure. We present a new ray traversal method based on this principle, which efficiently handles incoherent rays, and takes advantage of the SSE and AVX instruction sets of the CPU. Our algorithm offers notable performance improvements over similar existing solutions, and it is competitive with powerful static ray tracers.

Paper (personal copy): PDF
Paper (definitive version): available at diglib.eg.org
Slides: PPTX (with notes, video), PDF

Interactive Ray Tracing of Large Models Using Voxel Hierarchies

Attila T. Áfra
Computer Graphics Forum 31(1)
Presented at Eurographics 2013

We propose an efficient approach for interactive visualization of massive models with CPU ray tracing. A voxel-based hierarchical level-of-detail (LOD) framework is employed to minimize rendering time and required system memory. In a preprocessing phase, a compressed out-of-core data structure is constructed, which contains the original primitives of the model and the LOD voxels, organized into a kd-tree. During rendering, data is loaded asynchronously to ensure a smooth inspection of the model regardless of the available I/O bandwidth. With our technique, we are able to explore data sets consisting of hundreds of millions of triangles in real-time on a desktop PC with a quad-core CPU.

Paper (personal copy, high-res): PDF
Paper (definitive version): Wiley, Eurographics
EG2013 slides (with new videos): PPTX
Video (real-time): MP4, YouTube
Citation: BibTeX
Project page

The Boeing 777 data set was provided by and used with permission of The Boeing Company.

Improving BVH Ray Tracing Speed Using the AVX Instruction Set

with 2 comments

Last week I’ve attended Eurographics 2011 in Llandudno, UK and I’ve presented my poster entitled Improving BVH Ray Tracing Speed Using the AVX Instruction Set. If you are interested, you can download a personal copy of the paper and the poster. My work, along with all the other posters presented at the conference, can also be found at the Eurographics Digital Library.

Update: now you can download an interactive demo of the methods described in the paper!

The beautiful town of Llandudno in Wales

