GENERATING 3D-EFFECT ON A STILL IMAGE
You must’ve seen Google Photos’ new feature: The Cinematic Effect. This results in an effect quite similar to the dolly zoom which is used in many movies. This is the effect when one physically moves the camera away from the subject while simultaneously zooming in. We wondered how to create a similar effect from a still image.
DOLLY ZOOM EFFECT
Artists may easily enhance photographs with depth information and animate virtual environments, using motion parallax as the camera scans over a still scene using advanced image- and video-editing capabilities. This cinematic style, known as the 3D Ken Burns effect, has grown in popularity in documentaries, advertising, and other forms of media.
The photo must be manually divided into segments according to the depth, which must then be properly organized in the virtual 3D space. After that, we need to fill in the gaps in the image for the final output.
Let’s explore how we can automatically generate the 3D Ken Burns effect from a single image in this blog.
What is the 3D Ken Burns Effect?
A crop and zoom are often used while making videos, this is known as the Ken Burns effect. To give the images a more cinematic feel, we employ the 3D Ken Burns effect which would yield a result similar to a video shot between two different viewpoints thereby adding parallax. Other methods to make this effect automatically requires multiple input images from varying viewpoints. The main challenges we will face during generating this effect from a single input image would be:
- Depth Estimation: Estimating the depth from a single image so we can separate layers and map them to a point cloud.
- Novel View Synthesis: Generating the final image from a different viewpoint and inpainting the blank areas.
Depth Estimation:
To synthesize the 3D Ken Burns effect, we need to first estimate the depth of the input image. There were three issues we identified while applying depth estimation.
- These techniques have trouble understanding geometric relations(for example: The edges of a wall might be assigned inconsistent values resulting in them distorting in the 3D space).
- These techniques also might assign different depth values in a single object which may lead to the different parts of the same object being torn apart in the 3D space.
This pipeline uses a VGG-19 model to find the rough depth boundaries. It extracts information from the pool_4 layer of the VGG-19 model; Extracting from this layer allows the model to preserve some information about the geometry of large-scale structures in the image.
To avoid semantic distortions, we try to mimic what artists do when creating this effect manually:
Identify the object segments and approximate each object with a frontal plane positioned upright on the ground plane. We use instance-level segmentation masks extracted from Mask R-CNN. We select the masks of important objects such as humans, cars, and animals and adjust the estimated depth values. Although this approach is not perfect, it is effective in producing realistic results for a majority of cases.
Lastly, we feed this to a depth refinement network that upscales the depth predicted on the low-resolution image and ensures accurate depth boundaries.
Novel View Synthesis:
To synthesize the 3D Ken Burns effect from the estimated depth, this method maps the input image to points in a point cloud. This 3D view, however, is only a partial view of the geometry as seen from the input image. Therefore, the resulting renders are incomplete with holes and require inpainting. We can perform content-aware inpainting on each frame for the video but this is computationally expensive and lacks consistency over time.
A GridNet architecture is employed for the inpainting network due to its ability to learn how to combine representations at multiple scales. Specifically, A grid with four rows and four columns with a per-row channel size of 32, 64, 128, and 256 respectively are utilized. It accepts the colour, depth, and context of the incomplete novel view rendering and returns the inpainted colour and depth. This allows us to fill the holes in the point cloud.
Check out the full code on this repo:
https://github.com/sniklaus/3d-ken-burns
Implement it yourself!
https://replicate.com/sniklaus/3d-ken-burns
LIMITATIONS:
The task of estimating the depth of a single image is extremely difficult, and the depth estimation network we have used is not perfect. Although we observe that there are little to no distortions in most cases, in some situations, those involving reflecting surfaces or thin structures etc., our results might not accurately predict depth maps. This can be observed in the top right part of the above image where we can see that some of the leaves are on a different plane than that of the tree.
Although the combined colour and depth inpainting is an excellent approach for extending the predicted scene geometry, because it has only been trained on synthetic data, there are inaccuracies whenever the input image is not similar to the training data. This could’ve been improved by training on a bigger dataset of real images.
CONCLUSION
In this article, we explored how to generate a 3D parallax effect from a single image. This task was divided into the tasks of estimating the depth and completing the scene geometry using inpainting. This approach enables users to achieve cinema-quality results while requiring little effort. Hoping that this brief article gave you an idea of how this technique works.