As I found myself looking into the enormous glowing eye of Smaug, whose head looming just a few feet in front of me appeared to be about the size of a MAC truck, the furthest thing from my mind was how many polygons and pixel shaders the developers could squeeze into this demo while maintaining 70 frames per second in VSYNCH.
As the public learns to expect a certain level of performance and comfort in any given VR experience, developers often find themselves in the stressful position of figuring out which tricks, shortcuts, illusions, loopholes and trade-offs they need to utilize to meet both the expectations of the end user and the performance standards of VR best practices. In many ways, developers find themselves in shoes similar to those of an indie film maker trying to make a studio-quality picture. How can they wow the audience with a significantly limited budget? Budget, in this case, not referring to money but rather the resources of the CPU/GPU.
The Artists and Engineers at Epic Games and WETA Digital, including Alasdair Coull, Head of R&D at WETA, Daniel Smith, Software Dev at WETA, Nick Donaldson, Lead designer Epic Games and Tim Elek, Senior VFX artist at Epic Games, faced these problems head on in developing “A Thief in the Shadows”; a large scale, high fidelity VR demo experienced from the perspective of Bilbo Baggins just as he disturbs Smaug from his slumber.
The problems were multitude. Their mission was to replicate the scene from the movie as perfectly as possible, despite the fact that rendering it in VR required major changes to basic scene composition, visual tonality and perspective. They had to maintain a narrative flow with key audio and visual ques contributing to dramatic build-up while accommodating the freedom of the viewer to look wherever they pleased. They had to take all the stunning visual assets of the film and rebuild them from the ground up to run on a high-end home PC keeping loss in visual fidelity to a minimum.
Certain key lessons of successfully transplanting a traditional audio/visual storytelling experience into VR are being learned and re-learned by industry professionals in real time. Trial and error is the norm. At first, the devs had Bilbo conversing with Smaug, as he does in the film. They soon found that having the viewer inhabit the body of a character that moves or speaks independently of the viewer’s volition causes immediate immersion-breaking detachment. Someone else is speaking and moving, and the viewer is just along for the ride like an ethereal passenger in another character’s body. An interesting concept, but one that inhibits the participant from feeling a sense of self in the virtual world, which is important in establishing the ever-coveted feeling of bodily Presence.
Another, perhaps more obvious lesson was that traditional 2D filming techniques, including basic principles of scene and shot composition, all go out the window. Cinematic flourishes we’re all used to seeing in theaters, like sweeping establishing shots, and quick cuts from closeups to wide-shots to establish scale and surroundings, are absolutely disorienting and more critically simply don’t make sense for an experience that is fixated on the 1st person perspective. On film, a sense of scale is best communicated through a shot taken at a far distance. In VR, the opposite is true. On film, a sense of movement and dynamic action is often augmented by rapid transitions in viewing perspective. In VR, the best experience usually requires a single fixed perspective no matter what is going on in the environment.
Once all of the narrative and storytelling issues are worked out, the task of building something visually compelling on a limited computational budget begins. Fans of the source material have probably already seen a very expensive, very high quality rendering of the main visual attraction. How do you meet those expectations, at sub 20 millisecond latency, at 70 Hz, in a video game engine, displayed in stereo in real-time?
According to the Folks from WETA, the work started with rebuilding every asset in highly reduced resolution. Then they had to address screen real estate. Apparently the average number of frames of CGI animation in any given scene from a traditional blockbuster film is around 300. In VR, due to the consistent fixed perspective, the average number of frames tops 5000. For the purposes of their 5 minute demo, the time spent compiling all of the graphics assets or “bake” time, was over 3 days for every iteration.
Real-time CG rendering has always required decisions to be made about trade-offs between quality and performance. The high minimum performance standards of VR raise the stakes of those decisions and put pressure on developers to squeeze every drop of performance from the CPU and GPU. It also requires some coding shortcuts that create the illusion of visual complexity without actually taxing the GPU with complex computations. Dynamic lighting and shadows for example, use a outsize amount of processing power, leaving that much less to render everything else. By simplifying the light and shadow models, the programmers at Epic and WETA were able to save vital resources without sacrificing too much in the visual quality of the demo.
The makers of “A Thief in the Dark” were operating under specific requirements that made their job harder than it might have been otherwise. In sharing the experience of their attempt to translate a super high end pre-rendered piece of traditional media into an equally compelling VR experience, they provide an educational service to other development teams that pushes the medium forward.