Neuralangelo: Where Code becomes Canvas

13 min readSep 5, 2024

Introduction:

Last year Nvidia published one paper, It promised that we would be able to take a simple picture or images that we already have and transform them into highly accurate and detailed 3D models. Well, NVIDIA has accomplished its claim.

In the rapidly evolving field of computer vision and 3D modeling, the ability to accurately reconstruct high-fidelity 3D surfaces from simple RGB images has remained a significant challenge. Traditional methods often struggle with balancing detail, accuracy, and computational efficiency, leading to compromises in the final output.

Enter Neuralangelo — a groundbreaking approach that redefines what’s possible in neural surface reconstruction. Building upon Instant NeRF, this new approach pushes the boundaries of visual fidelity and realism.

Instant NeRF, introduced a year ago, impressed everyone with its ability to transform images into stunning 3D scenes in seconds instead of hours. It improved the best NeRF’s quality while providing a way to be extremely efficient. However, the results fell short of achieving the crispness and intricate structures found in the real world. But now, with Neuralangelo, NVIDIA has taken up the challenge of surpassing these limitations.

Neuralangelo is a new AI model by NVIDIA Research for 3D reconstruction using neural networks, which turns 2D video clips into detailed 3D structures — generating lifelike virtual replicas of buildings, sculptures, and other real-world objects. Using a 2D video of an object or scene filmed from various angles, the model selects several frames that capture different viewpoints — like an artist considering a subject from multiple sides to get a sense of depth, size, and shape. Once the camera position of each frame is determined, Neuralangelo’s AI creates a rough 3D representation of the scene, as if a sculptor is starting to chisel the subject’s shape.

In this blog, we’ll delve into the innovative mechanisms behind Neuralangelo, explore its implications for fields like virtual reality, gaming, and digital content creation, and examine how it is set to transform the future of 3D modeling.

Neuralangelo: What it’s about

Neuralangelo's name is derived from the name of the great Italian sculptor and painter Michelangelo, who created stunning, life-like visions from blocks of marble.

Neuralangelo generates 3D structures with intricate details and textures. It promises that we will be able to take a simple picture or images that we already have and transform them into highly accurate and detailed 3D models. Creative professionals can then import these 3D objects into design applications, editing them further for use in art, video game development, robotics, and industrial digital twins.

The aim of 3D image reconstruction is to remake a scenic structure using multiple images from different viewpoints. These generated surfaces are used for 3D modeling in augmented/virtual/mixed realities. Classically, multi-view stereo algorithms had been the method of choice for sparse 3D reconstruction. However, an inherent drawback of these algorithms is their inability to handle ambiguous observations, e.g., regions with large areas of homogeneous colors, repetitive texture patterns, or strong color variations. This would result in inaccurate reconstructions with noisy or missing surfaces.

Recently, neural surface reconstruction methods have shown great potential in addressing these limitations. This new class of methods uses coordinate-based multi-layer perceptrons (MLPs) to represent the scene as an implicit function, such as occupancy fields or signed distance functions (SDF). Leveraging the inherent continuity of MLPs and neural volume rendering, these techniques allow the optimized surfaces to meaningfully interpolate between spatial locations, resulting in smooth and complete surface representations. Neuralangelo is the latest model that has come out to Neuralangelo adopts InstantNGP as a neural SDF representation of the underlying 3D scene, optimized from multi-view image observations via neural surface rendering

Neuralangelo’s ability to translate the textures of complex materials — including roof shingles, panes of glass, and smooth marble — from 2D videos to 3D assets significantly surpasses prior methods. The high fidelity makes its 3D reconstructions easier for developers and creative professionals to rapidly create usable virtual objects for their projects using footage captured by smartphones rather than designing it on some computer-aided design software like AutoCAD or Fusion 360.

Related Works

Before getting into the working of Neuralangelo itself, let’s first understand some previous works and models on which this is based and draws inspiration.

Multi-view surface reconstruction: Early image-based photogrammetry techniques use a volumetric occupancy grid to represent the scene. The basic idea of the occupancy grid is to represent a map of the environment as an evenly spaced field of binary random variables, each representing the presence of an obstacle at that location in the environment. However, the photometric consistency assumption typically fails due to auto-exposure or non-Lambertian materials, which are aplenty in nature. Relaxing such color constraints is important for realistic 3D reconstruction.

Neural Radiance Fields (NeRF): NeRF is very adept at making photorealistic view-synthesis. NeRF encodes 3D scenes with an MLP mapping 3D spatial locations to color and volume density. These predictions are composited into pixel colors using neural volume rendering. However, NeRF also suffers from some drawbacks. It is not able to process how an isosurface of the volume density could be defined to represent the underlying 3D geometry. Current practice often relies on heuristic thresholding on the density values, but these surfaces are often noisy and may not model the scene structures accurately. Therefore, direct methods are preferred for model surfaces.

Neural surface reconstruction: For Well-defined representations of 3D surfaces, implicit functions such as occupancy grids or SDFs are preferred over simple column density fields. For easy integration with neural volume rendering, various techniques have been thought up to reparametrize the representations back to volume density. These designs enable more accurate surface prediction.

How It Works:

So, now that we have discussed the different related works that Neuralangelo used in its model, let us get into the nitty-gritty of it and understand the workings of the model itself. There are multiple layers to understanding the working of this model, as it is very complex and layered, so we will go into in depth details of each layer one by one.

Preliminaries:

So, for Neuralangelo to create these 3D models from pictures, we need to teach a neural network to understand how light behaves as it passes through a 3D object. Neuralangelo uses neural volume rendering. We train it with examples of the object from different angles. It uses a continuous function that maps 3D spatial locations. So, any 3D space will be broken down to many points in space, and it will associate each point in this 3D space with a particular color and density value so that when we render the object, it looks exactly like the original. The neural network is trained to learn this function from inputs like images or even videos. Once this function has been learned by the neural network, it can then be used to generate new images of the same object but from new points of view or under different lighting conditions.

Neural volume rendering: NeRF models a 3D scene as fields of volume density and color. With a specified camera position and ray direction, the volume rendering process combines the color radiance from sampled points along the ray’s path. The i-th sampled 3D position xi is at a distance ti from the camera center. The volume density σi and color ci of each sampled point are predicted using a coordinate MLP. The rendered color of a given pixel is approximated as the Riemann sum:

Here, αi = 1 — exp( -σi δi) is the opacity of the i-th ray segment, δi = ti+1 − ti is the distance between adjacent samples, and Ti = Πi-1j=1(1 − αj) is the accumulated transmittance, indicating the fraction of light that reaches the camera. To supervise the network, a color loss is used between input images c and rendered images c^:

However, surfaces are not clearly defined using even this density formulation.

Volume rendering of SDF: One widely used surface representation is the Signed Distance Function (SDF). The surface S of an SDF can be implicitly defined by its zero-level set, expressed as S = { x ∈ R3 ∣ f (x) = 0 } , where f (x) represents the SDF value. In the realm of neural SDFs, We transform volume density predictions in NeRF into SDF representations using a logistic function, enabling optimization through neural volume rendering. Given a 3D point xi and SDF value f(xi), the

corresponding opacity value αi used in Eq. 1 is computed as :

Where 𝝓s is the sigmoid function.

Neuralangelo builds upon a multi-resolution hash grid representation with SDF-based volume rendering, basically dividing the 3D space into small cubes or what we call voxels, similar to a pixel in 2D image, 3D is broken into voxels. Each voxel stores information about the shape, color, and other characteristics of the 3D object. We have then grids with different resolutions, Neuralangelo can accurately capture and produce both the big picture and the finest details of the object. After that, it uses neural surface rendering, basically using neural networks to render surfaces from the 3D data we now have.

Multi-resolution hash encoding: Multi-resolution hash encoding generates fine-grained details for tasks such as neural scene representations. Neuralangelo uses its power of hash encoding to recover high-fidelity surfaces. It uses multi-dimensional grids, with each grid cell corner mapped with a hash value. Each hash value stores the encoding feature. The encoded features are then passed to an MLP. It outsmarts its alternative, sparse voxel structures, which uniquely define grid corners without collision but require hierarchical spatial decomposition (e.g., octrees) to manage memory usage( the memory grows cubically with spatial resolution ). However, this hierarchy limits the ability to recover surfaces misrepresented at coarser resolutions. In contrast, hash encoding avoids spatial hierarchy and automatically resolves collisions through gradient averaging.

What makes it better?

Previous methods needed more information because for a vision algorithm to create a 3D model, information like segmentation or depth is usually needed. We needed to divide the image into multiple segments, usually separated based on color or texture, and this kind of data can be expensive and time-consuming. Neuralangelo produces results just from an image. It eliminates the need for additional equipment or data collection.

Numerical Gradient Computation

The analytical gradient w.r.t. the position of hash encoding suffers from localities. Therefore, optimization updates only propagate to local hash grids, lacking non-local smoothness. The paper proposes a simple fix to such a locality problem by using numerical gradients.

Numerical Gradient Computation

Using numerical gradients for higher-order derivatives distributes the back-propagation updates beyond the local hash grid cell, thus becoming a smoothed version of analytical gradients

To enforce the optimized neural representation to be a valid SDF, the eikonal loss [8] is typically imposed on the SDF predictions:

Where N=total number of sampled points.

To allow for end-to-end optimization, a double backward operation on the SDF prediction f(x) is required.

The de facto method for computing surface normals of SDFs ∇f(x) is to use analytical gradients. Analytical gradients of hash encoding w.r.t. position are not continuous across space. To find the sampling location, each 3D point would first be scaled by the grid resolution VL, written as xi, L = xi· VL.

Let the coefficient for (tri-)linear interpolation be β = xi, L − ⌊xi, L⌋. The resulting feature vectors :

The derivative of hash encoding w.r.t. the position can be obtained as:

To overcome the locality of the analytical gradient of hash encoding, the use of numerical gradients has been proposed. If the step size of the numerical gradient is smaller than the grid size of hash encoding, the numerical gradient would be equivalent to the analytical gradient; otherwise, hash entries of multiple grid cells would participate in the surface normal computation. Backpropagating through the surface normals thus allows hash entries of multiple grids to receive optimization updates simultaneously.

To compute the surface normals using the numerical gradient, additional SDF samples are needed. Given a sampled point Xi = (xi, yi, zi), we additionally sample two points along each axis of the canonical coordinate around Xi within the vicinity of a step size of ϵ. For example, the x-component of the surface normal can be found as

where ϵx = [ϵ, 0, 0]. In total, six additional SDF samples are required for numerical surface normal computation.

Progressive levels of details

Coarse-to-fine optimization can be used to better shape the loss landscape, to avoid falling into false local minima. This strategy has been used n computer vision, for image-based registration. Neuralangelo also adopts the coarse-to-fine optimization method to reconstruct the surfaces with progressive levels of detail. Neuralangelo performs coarse-to-fine optimization from two perspectives.

Step Size ϵ : As we discussed previously, numerical gradients can be used as a smoothing operation where the step size ϵ can be varied to control the resolution and the number of recovered details. Imposing L with a larger ϵ ensures that the surface normal is consistent at a larger scale, therefore producing more consistent and continuous surfaces. On the other side, imposing L with a smaller value of ϵ affects a smaller region and avoids smoothing out details.

In practice, step size ϵ is initialized to the coarsest hash grid size and exponentially decreases to match different hash grid sizes in the optimization process.

Hash grid resolution V : If all hash grids are activated from the start of the optimization, to capture geometric details, fine hash grids must first “unlearn” for the coarse optimizations with large step size ϵ and “relearn” with smaller ϵ. This process is crucial, and if it is unsuccessful, geometric details will be lost. Therefore, only an initial set of coarse hash grids is enabled, and they progressively activate finer hash grids throughout the optimization when ϵ to their spatial size. In practice, we also apply weights that decay over all parameters to avoid single-resolution features dominating the final results.

Quality comparison between Neuralangelo and other 3D generation models

As you can see in the image, Neuralangelo produces more accurate and higher-fidelity surfaces.

Optimization

To encourage the smoothness of the reconstructed surfaces, we impose a prior by regularizing the mean curvature of SDF. The mean curvature is computed from Laplacian similar to the surface normal computation. The curvature loss L is defined as:

The total loss is defined as the weighted sum of losses:

All network parameters, including MLPs and hash encoding, are trained jointly end-to-end.

Applications of Neuralangelo

3D reconstruction from 2D images has revolutionized various industries, offering a wide range of applications. Here are some key areas where this technology is making a significant impact:

Architecture and Construction

Virtual Reality (VR) and Augmented Reality (AR): Creating immersive experiences for architects, designers, and clients to visualize building projects in 3D.
Heritage Preservation: Preserving historical structures by creating digital twins for documentation and restoration.
Interior Design: Visualizing interior spaces with realistic 3D models, aiding in design decisions.

Entertainment and Gaming

Movie and Game Development: Generating realistic 3D models for characters, environments, and props.
Animation: Creating lifelike animations and special effects.
Virtual Reality Gaming: Providing immersive experiences for gamers.

Medical Imaging

Surgical Planning: Creating patient-specific 3D models for pre-operative planning and simulation.
Prosthetic Design: Designing customized prosthetics based on patient-specific measurements.

Robotics and Automation

Autonomous Navigation: Enabling robots to navigate and map environments using 3D models.
Object Recognition: Identifying and interacting with objects in the real world.
Industrial Automation: Optimizing manufacturing processes by creating 3D models of machines and products.

Cultural Heritage

Digital Preservation & Archaeological Reconstruction: Creating digital archives of historical artifacts and sites.
Virtual Museums: Offering immersive experiences for visitors to explore cultural heritage.

Other Applications

Forensic Science: Analyzing crime scenes and reconstructing accidents.
Urban Planning: Simulating urban development and traffic patterns.
Real Estate: Creating virtual tours of properties for potential buyers.

In essence, 3D reconstruction from 2D images has the potential to transform industries by providing new insights, improving efficiency, and creating innovative experiences. As technology continues to advance, we can expect even more exciting applications to emerge in the future.