‘Jump’ing in the Deep End: Diving Deep on Google’s VR

by Ryan Damm • May 29th, 2015

VR took another big Jump forward (sorry, had to) with their announcement at Google IO yesterday.  The Jump platform promises to completely reshape the VR content production landscape, from capture to display.  Understanding how and why requires really grokking Jump itself.  So based on screenshots and published specs, we’ll do our best.

First, Jump is a physical platform.  It’s a GoPro array, like you might get from 360 Heroes or any one of a dozen other GoPro-holder companies.  Google VP of Product Management and head of VR/AR Clay Bavor claimed the rig works with ‘off the shelf cameras,’ and the presentation implied it was camera agnostic, but practically speaking it’s a GoPro rig.  Google isn’t selling it, but they are releasing the CAD drawings so anyone can manufacture them.  Here’s a rendering, but they also had photos, so this isn’t vaporware:

Google couldn't afford that many GoPros, so they had to render this image

Comes with NSA software pre-installed on all 16 cameras.

And yes, they even made one out of cardboard ($8,000 worth of camera and 8 cents worth of cardboard, like a Ferrari engine in, well, a car like mine).  But before you go start a company manufacturing these things, know that GoPro is also going to sell this Jump array, supposedly by ‘this summer.’  GoPro has been pretty good at hitting published delivery dates, if not speculated ones, so this is probably accurate.  But you can still make your own if you want it in a different color, say, or a different material.  I can almost guarantee there will be a half-dozen carbon fiber versions available right away, for example, and plastic ones will be on Alibaba pretty much instantly.  Just keep in mind: any rig probably has to be built to tight tolerances, because the camera positions are important.  See below, where we discuss ‘calibrated camera arrays.’

At first glance, the rig is a little bigger than equivalent GoPro holders.  It holds 16 cameras, and there are no up-facing or down-facing cameras at all.  Based on the GoPro 4 Black’s specs, the rig captures a full 360 laterally, but just over 120 degrees vertically.  So it will be missing about 30 degrees around each pole, directly above your head and at your feet.  (Probably not a big deal, I think — but maybe a small differentiator for one of the existing companies? Like Mao’s Great Leap, Google’s is going to create a few casualties.)  Also, Jump is arranged like a mono rig, where each camera is pointed in a different direction.  Most 3D GoPro rigs have two cameras pointed in each direction like this:

This used to matter, yesterday

A 3D GoPro rig from 360 Heroes. It holds 12 GoPros, 10 in stereo pairs.  So 5 cameras per eye.

By having two cameras pointed in each direction, you get a different view for each eye, which gives you parallax and your 3D effect.  You also get seams, because the ‘left’ camera pointing in one direction is pretty far from the ‘left’ camera facing another direction.  There’s an entire micro industry built around painting together the seams left by that arrangement (and the smaller seams in mono 360, though that’s usually more tractable).

But this is where Google blows everything else out of the water: their rig is a 3D rig.  But the cameras aren’t dedicated left eye or right eye; each camera captures pixels that are used for both eyes.  Google gets their 3D effect the same way Jaunt does, by using machine vision to infer 3D positions, then remapping those 3D positions so there is no seam.  None.  (In theory.)  That’s a big deal, because now with $10k in semi-professional gear you can produce flawless 360 3D VR.  That’s amazing.  (It’s also not the whole story — I’ll discuss limitations shortly.)

Here’s how it works: because their software knows exactly where each camera is relative to the other, if they can find a patch of pixels that matches between two cameras, they can figure out where that point must be in 3D space in the real world.  Using that 3D data, they can bend and warp the footage to give you what’s known as a ‘stereo-orthogonal’ view: basically, each vertical strip of footage has exactly the correct parallax.  For normal rigs, that’s only the case with subject matter directly between two of the taking lenses, and it falls off with the cosine of the angle that the subject matter makes with the lens axis.  (In other words, the further you are off to the side, the less stereo depth you normally have.)  Now, this requires a ton of computation, but Google can do this pretty fast — probably faster than existing off-the-shelf stitchers — because they are dictating how and where you put the cameras.

Here’s the deal, in depth (sorry again): when you shift from one camera in the array to another, the perspective shifts by a little bit.  To stitch the images (or build a 3D model), you find matching regions in the image from adjacent cameras.  These are called ‘point correspondences.’  So if you start with a few pixels from camera A, and you’re looking in camera B’s footage for a similar patch of pixels, you actually know where to start and which direction to start looking in.  This speeds up the search process massively, compared to an uncalibrated rig.  To be fair, existing stitching software like Videostitch or Kolor (conveniently part of GoPro now) probably make some pretty good algorithmic assumptions about your camera rig — but the pure mathematical advantage of a calibrated rig is huge.

So, once you find the matching patches, and you know how far you had to ‘travel’ in the source image to find them, you can compute the distance to the camera array for that region.  If the object is at infinity, it will be some small distance away in the two images (it’s not in the same region of the images because the cameras are pointed in slightly different directions).  As the subject moves closer to the camera, that distance will drop to zero and then start to get larger as the subject approaches the rig.  The exact distance in the images is fully dictated by the distance of the subject, so you can compute 3D positions with scary accuracy.

Getting' all mathy on you, yo

For a given point at X in the left view, the corresponding point must lie along line a single epipolar line in the Right view, indicated by the line Er-Xr. Obviously that helps when you’re looking for point correspondences: you only have to search a line instead of the entire image. (Image shamelessly cribbed from the excellent Wikipedia article on epipolar geometry.)

To be totally precise, you’re searching along epipolar lines in the second image.  (Because GoPro lenses distort heavily, the lines are actually arcs, unless they’re remapped in memory… but that diagram would totally break my head.)  Note that the hyper-efficient algorithms behind this are only a few years old; those interested in geeking out heavily should consider reading “Multiple View Geometry,” which covers both the specific case here and the general case of uncalibrated camera rigs.  Available here; full disclosure, I’ve only read the first couple hundred pages, but I probably read each page ten times.  It’s dense, and you’ll spend some time trying to visualize things like ‘planes at infinity.’  Plus it’s heavy on linear algebra; enjoy.

More about what this all means after the jump….

Continued on page 2…

Tagged with: , , , , , , ,