Filmmaker Josh Gladstone has recently begun working with light field video, an incredible technology allows him to capture volumetric video with a single camera rig and produce viewable video content in virtual reality (VR) and on the Looking Glass volumetric displays.
This isn’t Gladstone’s first foray into novel video technology and volumetric content. Nearly a year ago, he published a video about multiplane video, or volumetric video, and how he used machine learning to improve his workflow.
In the video above, Gladstone discusses breakthrough research by Google that created incredible volumetric video results but required a massive rig with many cameras and an obscene amount of computing time — 28.5 CPU hours per frame of video.
Gladstone has been developing ways to reduce the equipment and computational demand while creating incredible volumetric video, and a new project scales down gear demands to five GoPro Hero8 Black cameras in a frame that is primarily 3D-printed material.
Gladstone says that the specific camera model doesn’t matter. “The software is camera agnostic, so there’s nothing special about the GoPro cameras other than that they’re portable. I’m also downsampling to 1080p in order to run the neural network, so larger cameras might be overkill. But sharpness is a factor, so it definitely is something I’m interested in testing with other cameras and lenses,” he explains.
Gladstone’s software is a custom pipeline that he wrote based on open-source projects. The software “Takes the images from each camera, computes their camera poses, and then uses AI to render the Layered Depth Images (LDI). This LDI consists of 8 layers of RGB + Alpha + Depth. It’s similar to the Multiplane Images implementation. It stretches each layer into the z-axis using per-layer depth information. In the multiplane implementation, I was using 32 layers of flat images. This layered depth implementation is eight layers, so it’s more efficient,” he explains to PetaPixel.
The LDIs are “arranged into a grid with the color images on top, and the combined depth and alpha on the bottom,” Gladstone adds. The video below shows the rendering process.
PetaPixel asked Gladstone how important AI is to the rendering process and if it’s something that could be done manually.
“I don’t think it’s something that could be done manually. Or at least I wouldn’t want to. The nice thing is that as long as it gets a good camera pose solve, it just kind of goes on its own. It takes about 15 seconds per frame on my 3090 graphics card,” he explains.
The process requires forward-facing inputs, so there’s no benefit to capturing input footage from additional perspectives, such as from the side. However, while there’s no reason to record data from additional angles simultaneously, there’s plenty of benefits offered by adding more cameras to the rig. Gladstone says he’s currently dialing in the optimal number of cameras and testing to determine the best distance between each camera.
While creating volumetric video includes more computational demand by virtue of comprising many frames, there’s nothing inherently more challenging about volumetric video versus a volumetric photo.
“Each frame is rendered independently, so video footage doesn’t complicate anything for the neural network. Of course, on the flip side of that, it also means that it’s not taking advantage of the information from other frames,” Gladstone says.
Capturing and rendering volumetric video is one thing, playing it back is something else entirely. While getting a sense of volumetric video on flat screens is possible, Gladstone’s project is best viewed in a virtual reality (VR) headset or using Looking Glass.
Gladstone says playback is delivered using Unity, and the final file is an .MP4. The Unity project uses custom shaders to decode and project the volumetric video layers into three-dimensional space.
Josh Gladstone is doing incredible work in many high-tech video segments, including light field video. His work is available on his website, YouTube, and Instagram.
Image credits: Josh Gladstone