KwagCheolsu1
HwangSung Soo2
-
(Department of Computer Science and Electrical Engineering, Handong Global University,
Pohang, South Korea charse65@handong.ac.kr)
-
(School of Computer Science and Electrical Engineering, Handong Global University,
Pohang, South Korea sshwang@handong.edu)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Keywords
Neural rendering, Quality advancement, Training/rendering speed advancement, 3D geometry reconstruction, Neural scene editing
1. Introduction
The pursuit of photorealistic rendering and novel view synthesis has been a fundamental
challenge in the domains of computer graphics and computer vision. This endeavor,
centered on the generation of high-quality images from 3D scenes, has undergone significant
evolutionary progress. Within this context, Image-Based Rendering (IBR) [1] emerged as a predominant technique, integrating geometric and photographic data to
render new perspectives. This method has been instrumental in synthesizing images
that closely resemble those found in real-world scenarios.
IBR has been a critical component in the field of view synthesis. Utilizing an array
of scene images obtained from various angles and distances, facilitated the generation
of scenes from unexplored viewpoints. This methodology proved particularly efficacious
in applications demanding high photorealism, such as in video games and virtual reality
environments. Through the strategic reuse and manipulation of existing images, IBR
could achieve a significant degree of photorealism, enhancing its applicability across
various sectors.
Despite its usefulness, IBR had some inherent constraints. It struggled to capture
complex lighting conditions, transparency, and intricate details in 3D scenes. The
quality of synthesized views often suffered, leading to perceptible artifacts and
a lack of realism. Moreover, the computational demands of IBR could be quite high,
especially when it came to real-time or interactive rendering, limiting its scalability
and accessibility.
In this context, the advent of neural rendering, particularly marked by the introduction
of neural radiance fields (NeRF) [5] in 2020, represents a paradigm shift. Leveraging the capabilities of deep learning
models and neural networks, neural rendering has significantly advanced the process
of view synthesis. NeRF, in particular, transcends the traditional constraints of
IBR. Its novel approach to data-driven scene representation, predicated on 3D coordinates
and viewing directions, sets new standards for photorealism and quality in neural
rendering.
While NeRF significantly enhances the quality of synthesized views, it is also not
immune to certain limitations:
1) NeRF's computational demands can be substantial, especially in scenarios requiring
real-time or interactive rendering. Despite advances, achieving instantaneous rendering
remains a challenge for this method.
2) Achieving consistently high-quality results can be challenging in complex scenes
with intricate lighting conditions or transparent materials, where NeRF may still
exhibit artifacts.
3) While NeRF excels in view synthesis and scene representation, it may not provide
the level of accuracy required for detailed 3D geometry reconstruction. It is primarily
designed to capture scene appearance and rendering.
4) NeRF's core design focuses on representation and rendering, making it less amenable
to intuitive neural scene editing, which often necessitates specific tools and techniques
designed for this purpose.
These limitations highlight the ongoing challenges in the field of neural rendering.
Subsequent research efforts have been aimed at addressing these issues, focusing on:
• Rendering speed improvement: With the advent of neural networks specialized for
real-time or interactive applications, rendering speed has seen dramatic improvements,
allowing for seamless integration into various real-world scenarios and applications.
• Quality improvement: Neural rendering techniques have shown an unparalleled ability
to generate photorealistic images, addressing issues like complex lighting, intricate
materials, and object interactions. They produce scenes that are virtually indistinguishable
from real-world photographs.
• 3D Geometry reconstruction: Neural rendering goes beyond image-based methods by
enabling the reconstruction of accurate 3D geometry from 2D images, thereby enriching
the understanding of a scene's structure and spatial relationships.
• Neural scene editing: These methods empower users to manipulate and edit scenes
with unparalleled flexibility, allowing for the creation of entirely new and customized
visual content.
The integration of neural rendering [3] techniques marks a significant milestone in the evolution of computer graphics and
computer vision. This paradigm shift not only enhances the quality of synthesized
views but also catalyzes the development of novel applications previously beyond the
scope of traditional methods. The objective of this survey paper is to conduct a thorough
examination and analysis of the advancements in neural rendering, specifically focusing
on areas such as accelerated rendering speeds, quality enhancement, 3D geometry reconstruction,
and neural scene editing. In this context, the paper will explore the progressive
techniques, achievements, and challenges inherent within these domains. It aims to
elucidate the considerable strides neural rendering has made, underscoring its impact
and significance in the realm of computer graphics and vision research. Through an
in-depth investigation of these diverse facets of neural rendering, the survey intends
to provide a comprehensive overview and a nuanced understanding of its current state
of the art. Additionally, it seeks to offer valuable insights into the emerging trends
and potential challenges that shape the future trajectory of this transformative technology.
The remainder of this paper is organized as follows. Earlier studies on neural rendering
and the specifics of neural radiance fields (NeRF) are discussed in Section 2. Novel
contributions in four areas: efficiency in rendering and training speed, image quality,
3D geometry reconstruction, and scene editing capabilities will be presented in Section
3. The summary of our findings and thoughts on future research directions will be
in Section 4.
2. Related Work
2.1 Neural Rendering Before NeRF
Neural rendering [7] represents a groundbreaking shift in the fields of computer graphics and computer
vision, leveraging the capabilities of deep learning to create high-quality, photorealistic
images and novel views of 3D scenes. This approach significantly deviates from traditional
rendering techniques, such as image-based rendering (IBR) [1], offering a more data-driven and versatile method of scene synthesis. The fundamental
idea behind neural rendering is to learn a mapping function from a set of inputs,
typically comprising a 3D scene representation and a desired viewpoint, to generate
an image. Neural networks work behind this process, trained on extensive datasets
of images and associated scene information. These networks can take on various architectures
from convolutional neural networks (CNNs) to more specialized designs, such as NeRF
and its derivatives.
The development of neural rendering is marked by a historical progression of techniques
and innovations:
Early deep learning in computer graphics: The initial foray of deep learning into
computer graphics primarily focused on enhancing image quality. Convolutional Neural
Networks (CNNs), a class of deep neural networks, were pivotal in this phase. They
were applied to tasks such as image denoising and super-resolution, where the networks
learned to identify and correct errors in images, or to upscale and improve the resolution
of lower-quality images. This period set the groundwork for integrating deep learning
into more complex graphic applications.
Deep learning for 3D reconstruction: The intersection of deep learning and 3D reconstruction
marked a transformative phase. Techniques such as Multi-View Stereo (MVS) and Structure-from-Motion
(SfM) were adapted to incorporate neural networks, greatly enhancing their capabilities.
In the context of MVS, neural networks were used to improve the matching of corresponding
points across different views, significantly enhancing the depth estimation and reconstruction
accuracy. For SfM, deep learning algorithms were integrated to better interpret sequential
image data, facilitating more accurate extraction of 3D structure from motion patterns.
These adaptations allowed for more precise and detailed reconstructions of 3D scenes
from 2D images, overcoming some of the limitations inherent in purely algorithmic
approaches.
GQN and 3D generative models: The introduction of the Generative Query Network (GQN)
represented a significant leap in neural scene representation. GQN is a neural network
architecture that learns to represent scenes implicitly from 2D observations. It functions
by encoding scenes into a latent representation, which can then be queried to generate
new views of the scene, essentially synthesizing novel perspectives. However, while
GQN was a breakthrough in terms of learning scene dynamics and structure, it had its
limitations. It struggled with rendering complex scenes with high levels of detail
and achieving photorealistic output. The GQN's approach to scene understanding and
synthesis, despite its limitations, laid the groundwork for more advanced generative
models and neural rendering techniques, pushing the boundaries of what could be achieved
in synthesizing 3D environments from 2D data. Fig. 1 represents research directions before the introduction of NeRF on the left side.
In the middle, the overview of the NeRF network process is presented. On the input
side, $x$ represents the point on the ray, and d represents the viewing direction.
On the output side, $c$ represents the predicted color and $\sigma$ represents the
volume density. On the right side, various research directions post-NeRF are presented.
Fig. 1. Research directions before and after the introduction of NeRF.
2.2 Neural Radiance Field (NeRF)
Neural Radiance Fields (NeRF) is a groundbreaking technique in neural rendering. This
method represents a substantial shift in approach by using a fully connected neural
network to model a continuous volumetric scene function. Unlike traditional methods,
NeRF does not directly map 2D pixel coordinates to 3D voxel coordinates. Instead,
it operates on a set of 5D coordinates, encompassing both spatial location $(x$, $y$,
$z)$ and viewing direction $(\theta$, $\varphi)$.
The essence of NeRF lies in its ability to synthesize novel views of a scene. It does
this by querying the 5D coordinates corresponding to specific points in space, considering
the viewing direction. For each of these points, the network predicts two crucial
pieces of information: the color (RGB) and the volume density ($\sigma$). These predicted
values are integral to the process of volume rendering, which combines the color and
density along the path of a camera ray to construct the final 2D image.
During its training phase, NeRF is optimized against a set of training images with
known camera poses. The neural network learns to approximate a function that takes
as input the 5D coordinates and outputs the RGB color and volume density at each point.
This sophisticated modeling allows NeRF to render highly detailed, photorealistic
views of complex scenes. It significantly advances the field by surpassing previous
methods in neural rendering and view synthesis in terms of realism and detail in the
generated images.
The process of view synthesis in NeRF involves several steps:
1) 5D coordinate input: NeRF takes a continuous 5D coordinate as input, comprising
a 3D location $(x$, $y$, $z)$ and a 2D viewing direction $(\theta$, $\varphi )$. This
input is used to predict the emitted color and volume density at that specific point
in space.
2) Ray marching: For each pixel in the desired image, NeRF casts a camera ray into
the scene. The path of each ray is determined by the camera's position and orientation.
The method then samples a set of points along each ray. These points represent potential
locations where light might interact with the scene, contributing to the final pixel
color.
3) Neural network prediction: Each sampled point along the ray, along with its corresponding
viewing direction, is fed into a fully connected neural network. The network functions
as a mapping function F$\Theta:$ $(x$, $d) \to (c$, $\sigma )$, where $x$ is the 3D
coordinate, d is the direction, $c$ is the color, and $\sigma$ is the volume density.
The network outputs the predicted color and volume density for each point. Fig. 1 shows how the process has been done. In the figure, sampled 3D points along the ray
and their viewing direction are inserted into the MLP. On the input side, $x$ represents
the point on the ray, and d represents the viewing direction. On the output side,
c represents the predicted color and $\sigma$ represents the volume density.
4) Volume density and color rendering: The volume density $\sigma$ at a point can
be interpreted as the probability of a ray terminating at that point. The emitted
color at each point is dependent on both the viewing direction and the scene's content
at that point.
5) Accumulated transmittance: A function named $T(t)$ is calculated to accumulate
transmittance along the ray, representing the probability that light travels from
the start of the ray to each point without being absorbed. This function is crucial
for understanding how light interacts with the scene as it travels along the ray.
Fig. 2 is presented for better understanding. The figure shows the predicted colors and
densities accumulation along the ray to form the final pixel color on the image plane.
Volume rendering integrates these values to simulate the light transport through a
medium.
6) Volume rendering integral: The final color $C(r)$ of each camera ray is computed
using a volume rendering integral. This integral accounts for the color and density
of each sampled point along the ray, weighted by the accumulated transmittance. NeRF
numerically estimates this continuous integral using a quadrature rule based on stratified
sampling. This method allows for a better approximation of the continuous scene representation
compared to deterministic quadrature.
7) Differentiable rendering: The entire rendering process is differentiable, allowing
the use of gradient descent to optimize the neural network. By minimizing the difference
between the rendered images and the ground truth images, the network learns to accurately
predict color and density values that produce photorealistic renderings of the scene
from novel viewpoints. Fig. 3 shows how the optimization is done. The loss is calculated by obtaining the difference
between rendered image pixels and the ground truth image pixels and repeating that
process until the loss value obtained draws near to zero value.
Fig. 2. Image of predicted colors and densities accumulation in the NeRF process.
Fig. 3. Loss computation from the difference between the rendered and ground truth
images in the NeRF process.
2.3 Benchmark Dataset
In evaluating the effectiveness of these techniques, researchers frequently utilize
a variety of benchmarks. Popular datasets such as the DTU Dataset [2], BlendedMVS Dataset [3], Tanks and Temples [4], NeRF Synthetic and Real-world Datasets [5], and the Redwood-3dscan Dataset [6] play a pivotal role. These datasets, encompassing scenarios from controlled laboratory
settings to complex real-world environments, offer comprehensive platforms for testing
and refining neural rendering algorithms.
3. Recent Advancements in Neural Rendering
3.1 Rendering or Training Speed
In the landscape of neural rendering, the need for swift rendering and training speed
is paramount, particularly in the context of real-time and interactive applications.
Despite NeRF has impressive capabilities, it has several limitations, particularly
regarding the traditional method's rendering and training speed:
1) Ill-posed geometry estimation: NeRF struggles with estimating accurate scene geometries,
especially with a limited number of input views. While it can learn to render training
views accurately, it often generalizes poorly to novel views, a phenomenon known as
overfitting. This is because the traditional volumetric rendering does not enforce
constraints on the geometry, leading to wildly inaccurate reconstructions that only
look correct from the training viewpoints.
2) Time-consuming training: NeRF requires a lengthy optimization process to fit the
volumetric representation, which is a significant drawback. Training a single scene
can take anywhere from ten hours to several days on a single GPU. This slow training
is attributed to expensive ray-casting operations and the optimization process required
to learn the radiance and density functions within the volume.
The advancements post-NeRF have specifically targeted these limitations, seeking
to enhance both the efficiency and practicality of neural rendering to be accessible
in various real-world scenarios. This section unveils the latest pioneering research
in rendering and training speed advancements, where a focus on speed is the key objective.
3.1.1 Voxel-based Method for Rendering Efficiency
Voxel-based approaches represent a leap in efficiency by refining geometry estimation
and reducing the computational waste of traditional volumetric rendering. These methods
strategically focus on non-empty space, thereby addressing the issue of ill-posed
geometry estimation by concentrating on relevant scene areas. NSVF [8]'s sparse voxel octrees target only non-empty space, cutting down unnecessary calculations.
DVGO [9] optimizes voxel grids in a two-tiered process, capturing the broader scene structure
before zooming in on detail, avoiding the gradual and exhaustive refinement process
seen in previous methods. Plenoxels [10]forgo complex neural networks, instead using voxel grids and spherical harmonics to
directly optimize scenes, which translates to rapid convergence and the ability to
render in real time. These voxel-based methods signify a pivotal change from exhaustive
volume sampling to strategic, content-focused rendering, streamlining the process
significantly.
3.1.2 Factorization Techniques for Real-time Rendering
Factorization techniques break new ground in rendering speed by deconstructing the
scene into manageable elements reducing time consumption in training and rendering.
FastNeRF [11]'s separation of static geometry from dynamic effects allows for precomputation and
caching, minimizing the on-the-fly calculations required during rendering. KiloNeRF
[12]'s division of scenes into smaller segments processed by individual MLPs introduces
a parallel computation model, drastically speeding up rendering times and enabling
interactive applications. TensoRF [13] utilizes tensor decomposition to reduce computational demands while preserving scene
detail, offering a solution that is both efficient and maintains high fidelity. These
methods excel by dissecting complex scenes into simpler, more computationally accessible
parts, facilitating a quicker rendering pipeline. Importantly, TensoRF's approach
to handling complex lighting and geometric details enhances the rendering quality,
capturing intricate scene elements with high fidelity. This aspect underscores its
significance in the domain of quality improvement in neural rendering, illustrating
its capacity to simultaneously expedite rendering and retain high-quality scene representations.
3.1.3 Data Structure Innovations for Fast Training and Rendering
Advances in data structures such as Instant-NGP [14]'s multiresolution hash encoding and PlenOctrees [15] use of octrees harness GPU capabilities more effectively. Instant-NGP accelerates
feature access and update, essential for iterative training and rendering, while PlenOctrees
exploits precomputed information for real-time rendering speeds. Fig. 4 represents the space subdivided into octants. The divided octant in the scene can
be divided again into eight voxel-based octants. Whereas Fig. 5 shows a ray passing through a neural scene which is divided into octree voxel grids.
The dots on the ray represent Spherical Harmonics (SH). This approach allows the model
to skip large voxels which represent empty spaces and focus on small voxels which
might have color and density in there. These innovations are transformative, moving
away from the dense, uniform data structures that have traditionally hindered performance,
to more nuanced and efficient storage models that exploit the parallel nature of modern
GPUs.
Fig. 4. Space subdivision into octants in PlenOctree rendering technique.
Fig. 5. A ray passing through a neural scene in PlenOctree rendering technique.
3.1.4 Adapting NeRF for Mobile and Low-Power Devices
Adapting NeRF for mobile devices, as MobileNeRF [16] does, brings efficient 3D rendering to less powerful hardware. By leveraging native
GPU rasterization pipelines, these methods reduce the complexity of rendering tasks,
enabling fluid performance in mobile VR/AR settings. This adaptation is pivotal, as
it brings the sophistication of NeRF to a broader range of devices, overcoming previous
limitations due to hardware constraints.
3.1.5 Enhancing Efficiency and Generalization
The limitation of ill-posed geometry estimation due to limited or arbitrary input
is solved in this section. MVSNeRF [17] introduces a method for reconstructing neural radiance fields from a small number
of views, significantly improving generalizability across different scenes and reducing
the per-scene optimization time. F2-NeRF (Fast-Free-NeRF) [18] focuses on novel view synthesis with arbitrary input camera trajectories. It efficiently
handles large, unbounded scenes with diverse camera paths, maintaining rapid convergence
speed and offering a general space-warping scheme applicable to arbitrary camera trajectories.
Depth-supervised NeRF (DS-NeRF) [19] addresses the issue of NeRF's susceptibility to overfitting and slow training by
incorporating depth supervision. This method uses sparse 3D points from structure-from-motion
as additional guidance, improving NeRF's ability to render realistic images from fewer
views and speeding up the training process by 2-3 times.
3.1.6 Scalability and Real-time Application
Real-time applications demand methods such as nVDB [20] and 3D Gaussian Splatting [21], which prioritize immediate rendering capabilities. nVDB's real-time scene optimization
and 3D Gaussian Splatting's efficient rendering strategy for high-resolution outputs
cater to the pressing need for scalability and instantaneity in applications such
as live video and interactive simulations. These methods are critical as they align
with the need for on-demand rendering, crucial in an era where delay can impede user
experience and system functionality.
3.2 Quality
The advancements in this field are poised to generate scenes that are virtually indistinguishable
from real-world photographs. Notably, the search for research papers on this specific
topic proved challenging, as the overarching goal in the field is to enhance photorealism
and visual fidelity. Thus, our focus here is on introducing papers that intensively
concentrate on improving rendering quality.
GANeRF (Leveraging discriminators to optimize neural radiance fields) [22] introduces the use of adversarial training, specifically discriminators, to optimize
Neural Radiance Fields. It incorporates the generative adversarial network (GAN) framework
where the generator aims to produce realistic images while the discriminator evaluates
them. By using a patch-based rendering constraint, GANeRF is designed to address typical
imperfections and rendering artifacts that arise from traditional NeRF methods. The
introduction of adversarial training allows GANeRF to push the boundaries of realism
in NeRF-rendered images, especially in regions with limited coverage or complex textures
where traditional NeRF might struggle. This approach highlights the potential for
combining rendering priors with novel view synthesis, leading to qualitative and quantitative
improvements in rendered scenes.
Tetra-NeRF (Representing neural radiance fields using tetrahedra) [23] presents a novel representation of Neural Radiance Fields by utilizing tetrahedral
tessellation of the scene. This method enhances the efficiency of NeRF by breaking
down the scene into a network of interconnected tetrahedra, which allows for faster
rendering and reduced computational load. The key significance of Tetra-NeRF lies
in its ability to improve memory efficiency and accelerate the rendering process,
which is particularly beneficial for dynamic scenes or applications requiring real-time
interaction. Using a tetrahedral structure can provide a more computationally tractable
approach to represent complex volumes, which is a step forward in addressing the scalability
challenges faced by traditional NeRF methods. Figs. 6-8 represent the overall stages of the Tetra-NeRF tetrahedral tessellation process.
Fig. 6 is the image of an example point cloud which will be inserted as input. Fig. 7 shows the middle stage where the tetrahedral set is being used to represent the radiance
field. The technique known as barycentric interpolation is used to interpolate tetrahedra
vertices in the image. Fig. 8 shows the final color Lego image output created through the Tetra-NeRF model process.
DP-NeRF (deblurred neural radiance field with physical scene priors) [24] addresses issues that frequently occur in real-world photography, such as motion
blur and defocus blur, using physical-based priors to resolve NeRF's blur problems.
It leverages the actual physical blurring process that occurs during image acquisition
by a camera to produce renderings with more accurate 3D consistency. This is particularly
important for enhancing NeRF quality even when the quality of data is compromised.
Nerfbusters [25] focuses on removing ghostly artifacts that appear in casually captured NeRFs and
improving scene geometry. It introduces a local 3D geometric prior learned with a
diffusion model trained on synthetic data to encourage plausible geometry during NeRF
optimization, which helps clean up floaters and cloudy artifacts from NeRFs. This
approach is essential for producing clearer and more coherent scenes from casually
captured data.
NeRFLiX [26] is dedicated to high-quality neural view synthesis by learning a degradation-driven
inter-viewpoint mixer. It simulates NeRF-style degradations to create training data,
which is then used to train a restorer to enhance NeRF-rendered views. This method
is crucial for fusing high-quality training images to improve the performance of NeRF
models, leading to more photorealistic synthetic views. It demonstrates the necessity
of handling degradation in the rendered images to achieve higher fidelity in the final
output.
Fig. 6. Point cloud input of Tetra-NeRF's tetrahedral tessellation process.
Fig. 7. Tetrahedral set representing the radiance field in Tetra-NeRF's tessellation
process.
Fig. 8. Color Lego image result of Tetra-NeRF in Tetra-NeRF's tessellation process.
3.3 3D Geometry or 3D Reconstruction
3.3.1 Surface-Based Reconstruction and Rendering
Traditional NeRF methods can often result in volume rendering biases, leading to less
accurate surface definitions and difficulties in achieving water-tight models. The
advancement in surface-based reconstruction and rendering, as seen in NeuS [27] and NeRS [28], was designed to address NeRF's shortcomings in accurately capturing complex surface
geometries and intricate details, such as thin structures and self-occlusion. The
specific methods employed by NeuS and NeRS were necessary to improve the fidelity
of the reconstruction and to ensure that the models can capture both the geometry
and reflectance properties of surfaces, especially in uncontrolled real-world settings.
These methods enable more accurate reconstructions of objects and scenes with complex
surface interactions and detailed textural properties.
NeuS (Neural Surface Reconstruction) [27] advances neural surface reconstruction by interpreting surfaces as the zero-level
set of a signed distance function, robustly representing complex objects, and improving
upon traditional volume rendering techniques that can be biased and inaccurate. NeRS
[28] reconstructs 3D scenes from sparse-view images by learning bidirectional surface
reflectance functions, capturing not just shape but also texture and illumination
properties, providing water-tight reconstructions. Fig. 9(a) shows the fundamental idea of the Signed Distance Function. The signs in the image
indicate the position of the point relative to the surface. It is positive when the
point is outside, negative when the point is inside, and zero when the point is exactly
on the surface of the object. This method enables to form of a robust boundary of
an object. Using SDF in 3D geometry reconstruction has benefits such as solving different
objects interacting with each other as shown in Fig. 9(b) and constructing watertight models without any opening on any sides as shown in Fig. 9(c) because natural object geometries does not have the incomplete opened structure [29].
Fig. 9. Image representing signed distance function (SDF) and its benefits. (a) The
fundamental idea of signed distance function (SDF). (b) Object with intersection problem.
(c) Object with incomplete structure at one side.
3.3.2 Enhanced Volume Rendering for Geometry and Quality
The motivation behind enhanced volume rendering techniques such as VolSDF [30] and Neural RGB-D Surface Reconstruction [31] was to overcome the limitations of NeRF in representing sophisticated scene geometry
and lighting effects. Traditional NeRF struggles to disentangle shape and appearance,
which can lead to inaccuracies in scenes with detailed geometric structures. By applying
Laplace's cumulative distribution function to SDFs, as in VolSDF, [30] and utilizing both RGB and depth information, as in Neural RGB-D [31], these methods provide a more nuanced understanding of volume density and surface
geometry. This leads to higher-quality reconstructions with more precise sampling
and improved appearance modeling, essential for multi-view datasets with significant
viewpoint changes.
VolSDF (volume rendering of neural implicit surfaces) [30] utilizes Laplace's cumulative distribution function applied to SDFs to model volume
density as a function of geometry. This approach disentangles shape and appearance
in volumetric rendering, leading to more accurate geometric reconstructions. {Neural
RGB-D surface reconstruction} [31] advances 3D reconstruction by using both RGB and depth data, reconstructing surface
geometry more accurately. It refines camera poses and intrinsic through optimization,
which enhances the reconstruction quality.
3.3.3 Approaches for Single-View Reconstruction
Single-view reconstruction techniques such as Zero-1-to-3 [32] and Make-it-3D [33] were developed in response to NeRF's requirement for multiple views to reconstruct
a scene. This is a significant limitation when only a single image is available or
when capturing multiple views is impractical. The use of geometric priors and diffusion
models allows for the extrapolation of 3D information from a single viewpoint, enabling
the generation of new views and detailed reconstructions from limited data. These
approaches are crucial for applications where data is scarce and for creating 3D models
from images where no prior 3D information is available.
Zero-1-to-3 [32]2] tackles the challenge of synthesizing new views of a 3D object from a single RGB
image, using geometric priors from large-scale diffusion models to generate images
from various perspectives and enable 3D reconstruction from minimal data. Make-it-3D
[33] aims to create high-quality 3D content from a single image, this method first optimizes
a neural radiance field and then transforms it into textured point clouds. It uses
a well-trained 2D diffusion model to estimate accurate geometry and generate plausible
textures.
3.3.4 Shadow and Light Interaction for Scene Reconstruction
ShadowNeuS's [34] incorporation of shadow ray supervision directly tackles NeRF's difficulties in capturing
the intricate interplay between light and geometry, which is essential for reconstructing
scenes with realistic lighting conditions. Traditional NeRF can struggle with accurate
shadow modeling, which is critical for understanding the spatial relationships within
a scene. By using shadow information, ShadowNeuS [34] can optimize sample locations along rays more effectively, resulting in improved
SDF representations and more complete reconstructions from single-view images under
various lighting conditions.
ShadowNeuS (Neural SDF Reconstruction by Shadow Ray Supervision) [34] integrates shadow ray information to enhance shape reconstruction tasks, optimizing
both the samples along the ray and its location for more effective reconstruction
of neural SDF representations.
3.3.5 Speed and Detail in Multi-View Reconstruction
The development of methods such as PermutoSDF [35] is a direct response to the computationally intensive nature of NeRF, which can be
a bottleneck for real-time applications and detailed reconstructions. By leveraging
permutohedral lattices and hash-based positional encoding, PermutoSDF [35] significantly accelerates the reconstruction process while focusing on recovering
fine details. Such improvements in speed and detail are vital for practical applications
that require quick processing times and high levels of detail, such as digital content
creation for virtual reality or visual effects industries.
PermutoSDF (fast multi-view reconstruction with implicit surfaces using permutohedral
lattices) [35] improves multi-view reconstruction speed and detail recovery, combining hash-based
positional encoding with density-based methods to recover fine geometric details efficiently.
3.4 Scene Editing
3.4.1 Text and Image-Driven Manipulation
Traditional NeRF methods have limited ability to control scene attributes directly
from high-level inputs such as text or reference images. This limitation inspired
the development of methods such as CLIP-NeRF [36] and NeRF-Art [37], which integrate language and image understanding models (e.g., CLIP [38]) with NeRF. These methods allow users to manipulate NeRFs using text prompts or images,
offering an intuitive interface for editing the shape and appearance of objects within
a 3D scene. The necessity for such methods arises from the desire to bridge the gap
between human language or visual inputs and the control over digital environments,
making the editing process more accessible and creative.
CLIP-NeRF [36] integrates text and image inputs to control the shape and appearance of objects within
a NeRF framework. Fig. 10 shows a simplified version of the CLIP-NeRF process. The figure shows that it uses
CLIP embeddings to map language and image inputs to a latent space, allowing users
to manipulate NeRFs with text prompts or exemplar images. The architecture employs
disentangled latent codes for shape and appearance, which are matched with CLIP embeddings
for precise control. Shape conditioning is achieved by applying learned deformation
fields to positional encoding while color conditioning is deferred until the volumetric
rendering stage. To bridge this disentangled latent representation with the CLIP embedding
space, two code mappers are designed. These mappers take input from CLIP embeddings
and update latent codes accordingly for targeted editing purposes. They are trained
with a matching loss based on CLIP embeddings to ensure accurate manipulation results.
Additionally, an inverse optimization method is proposed which accurately projects
an input image onto latent codes for manipulation purposes even with real-world images
as inputs. NeRF-Art [37] offers a method for altering appearance and geometry in pre-trained NeRF models using
text descriptions. It introduces a global-local contrastive learning strategy, a directional
constraint, and a weight regularization method to control the style transformation
process and maintain consistency across views. This allows for stylization changes
based on textual prompts without the need for mesh guidance.
Fig. 10. Simplified image of the CLIP-NeRF process.
3.4.2 2D and 3D Style Transfer Consistency
Ensuring stylistic consistency across different views in a 3D scene is a challenge
that traditional NeRF does not address. Stylized-NeRF [39] and 3D Cinemagraphy [40] were introduced to maintain stylistic coherence when applying 2D artistic effects
to 3D scenes rendered by NeRF. These advancements are critical for applications in
virtual reality, film, and gaming, where artistic style needs to be consistent regardless
of viewpoint.
Stylized NeRF [39] proposes a mutual learning framework that fuses 2D image stylization networks with
NeRFs, maintaining stylistic consistency across different viewpoints of a 3D scene.
It replaces NeRF's color prediction module with a style network and introduces consistency
loss and mimic loss to ensure spatial coherence between the stylized NeRF outputs
and 2D stylization results. 3D Cinemagraphy [40] aims to animate 2D images and create videos with camera motions, combining 2D animation
and 3D photography. It addresses the inconsistencies observed when integrating 2D
animation with NeRF-generated 3D scenes, providing a more natural and immersive experience.
3.4.3 Interactive Editing and Control Frameworks
NeRF traditionally does not support interactive manipulation or editing of the rendered
scenes. NeRFshop [41], Instruct-NeRF2NeRF [42] and CoNeRF [43] provide solutions that allow for deformations, modifications, and text-based scene
editing.
NeRFshop [41] presents an interactive method for editing NeRFs, allowing users to perform deformations
and modifications with a semi-automatic cage creation and volumetric manipulation.
It introduces a volumetric membrane interpolation inspired by Poisson image editing
to reduce artifacts from object deformations. Instruct-NeRF2NeRF [42] introduces a text-based editing framework that utilizes an image-conditioned diffusion
model alongside NeRF to iteratively edit and optimize 3D scenes based on textual instructions,
making scene editing more accessible to users without specialized knowledge. CoNeRF
[43] enhances NeRF with the ability to manipulate specific scene attributes with minimal
user input. It allows users to control attributes such as facial expressions or object
movements within a scene using a small number of annotations. The development of these
frameworks is vital for making 3D scene editing user-friendly and for expanding the
practical usage of NeRF in creative industries where rapid and intuitive control is
required.
4. Conclusion
This paper has comprehensively explored the advancements in neural rendering, particularly
post the advent of Neural Radiance Fields (NeRF). We observed significant progress
across multiple domains, including image quality enhancement, rendering speed, 3D
geometry reconstruction, and neural scene editing. These advancements have notably
elevated the standards of photorealism, enabled real-time and interactive applications,
and expanded the scope of scene understanding and manipulation.
However, our analysis indicates that current research exhibits a marked proficiency
in handling synthetic data as opposed to real-world data. This disparity underscores
the need for future research to focus on enhancing the robustness of neural rendering
techniques in real-world scenarios. While synthetic datasets offer controlled environments
for developing and testing algorithms, the complexity and unpredictability of real-world
data present unique challenges that are yet to be fully addressed.
Moreover, we encountered issues where certain experiments did not function as intended
when attempting to replicate the results documented in papers. In some cases, attempting
to replicate the results presented in papers yielded different outcomes, which could
be attributed to variations in GPU performance, hardware specifications, or differences
in parameter specifications. Figs. 11-13 illustrate a common issue in this field. Fig. 11 displays ground truth RGB images. Fig. 12 depicts a 3D surface reconstruction achieved using Neuralangelo [44], a state-of-the-art neural rendering model, employing the original researcher's hardware
configuration. This reconstruction is characterized by its high fidelity. In contrast,
Fig. 13 presents the surface reconstruction results obtained with the same model but using
an arbitrary commercial hardware setup. Although considering the fact that the data
is poorly generated due to the complexity and unpredictability of real-world data,
the scene's details have undergone noticeable degradation in this instance. These
discrepancies in parameter settings, often overlooked, play a crucial role in determining
the output of neural rendering experiments. Such variability in outcomes underscores
the need for standardization in experimental setups or the development of techniques
less sensitive to hardware and parameter variations. This approach would facilitate
more consistent and reliable replication of results across different research setups,
advancing the field's overall robustness and reproducibility.
Another area requiring attention is the scalability of neural rendering methods. Current
techniques often struggle with large-scale scenes, limiting their applicability in
expansive environments. Future research should aim to increase the efficiency and
effectiveness of neural rendering in handling such large-scale scenes, possibly through
advanced data structures or more efficient rendering algorithms.
The requirement for initial pose estimation remains a foundational necessity in most
neural rendering pipelines. This dependency poses limitations in scenarios where obtaining
accurate initial pose information is challenging. Overcoming this dependency could
significantly broaden the applicability and ease of use of neural rendering technologies.
In conclusion, while the field of neural rendering has made remarkable strides, the
effort to achieve universally robust, scalable, and realistic rendering in all types
of environments continues. Future directions might include the integration of multi-sensory
data, the development of user-friendly scene editing tools, and ensuring the ethical
application of these rapidly evolving technologies. As we progress, the field of neural
rendering is poised to not only refine its technical capabilities but also broaden
its impact across various industries, revolutionizing our interaction with digital
content.
Fig. 11. Ground-truth RGB image of meeting room dataset.
Fig. 12. 3D surface reconstruction output of the scene original researchers' setting
with meeting room dataset.
Fig. 13. 3D surface reconstruction output of the scene using arbitrary setting with
meeting room dataset.
ACKNOWLEDGMENTS
This research was supported by the National Research Foundation of Korea (NRF)
grant funded by the Korean government (MSIT) (NRF-2022R1C1C1011084).
REFERENCES
H.-Y. Shum, S. B. Kang, and S.-C. Chan, ``Survey of image-based representations and
compression techniques,'' IEEE Transactions on Circuits and Systems for Video Technology,
vol. 13, no. 11, pp. 1020-1037, Nov. 2003.

R. Jensen, A. Dahl, G. Vogiatzis, E. Tola, and H. Aanæs, ``Large scale multi-view
stereopsis evaluation,'' Proc. of IEEE Conference on Computer Vision and Pattern Recognition,
pp. 406-413, Jun. 2014.

Y. Yao, Z. Luo, S. Li, J. Zhang, Y. Ren, and L. Zhou, ``Blendedmvs: A large-scale
dataset for generalized multi-view stereo networks,'' Proc. of IEEE Conference on
Computer Vision and Pattern Recognition, pp. 1790-1799, Jun. 2020.

A. Knapitsch, J. Park, Q.-Y. Zhou, and V. Koltun, ``Tanks and temples: Benchmarking
large-scale scene reconstruction,'' ACM Transactions on Graphics, vol. 36, no. 4,
pp. 1-13, Jul. 2017.

B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng,
``NeRF: Representing scenes as neural radiance fields for view synthesis,'' Communications
of the ACM, vol. 65, no. 1, pp. 99-106, Dec. 2021.

S. Choi, Q.-Y. Zhou, S. Miller, and V. Koltun, ``A large dataset of object scans,''
arXiv preprint arXiv:1602.02481, 2016.

A. Tewari, O. Fird, J. Ties, V. Sitzmann, S. Lombardi, et al., ``State of the art
on neural rendering,'' Computer Graphics Forum, vol. 39, no. 2 pp. 701-727, Jul. 2020.

L. Liu, J. Gu, K. Z. Lin, T.-S. Chua, and C. Theobalt, ``Neural sparse voxel fields,''
Advances in Neural Information Processing Systems, vol. 33, pp. 15651-15663, 2020.

C. Sun, M. Sun, and H.-T. Chen, ``Direct voxel grid optimization: Super-fast convergence
for radiance fields reconstruction,'' Proc. of IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR), pp. 5459-5469, Jun. 2022.

S. Fridovich-Keil, A. Yu, M. Tancik, W. Chen, B. Recht, and A. Kanazawa, ``Plenoxels:
Radiance fields without neural networks,'' Proc. of IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), pp. 5501-5510, Jun. 2022.

S. J. Garbin, M. Kowalski, M. Johnson, J. Shotton, and J. Valentin, ``FastNeRF: High-fidelity
neural rendering at 200fps,'' Proc. of IEEE/CVF International Conference on Computer
Vision (ICCV), pp. 14346-14355, Oct. 2021.

C. Reiser, S. Peng, Y. Liao, and A. Geiger, ``KiloNeRF: Speeding up neural radiance
fields with thousands of tiny mlps,'' Proc. of IEEE/CVF International Conference on
Computer Vision (ICCV), pp. 14335-14345, Oct. 2021.

A. Chen, Z. Xu, A. Geiger, J. Yu, and H. Su, ``TensoRF: Tensorial radiance fields,''
Proc. of European Conference on Computer Vision, vol. 13692, pp. 333-350, Nov. 2022.

T. Müller, A. Evans, C. Schied, and A. Keller, ``Instant neural graphics primitives
with a multiresolution hash encoding,'' ACM Transactions on Graphics, vol. 41, no.
4 pp. 1-15, Jul. 2022.

A. Yu, R. Li, M. Tancik, H. Lio, R. Ng, and A. Kanazawa, ``PlenOctrees for real-time
rendering of neural radiance fields,'' Proc. of IEEE/CVF International Conference
on Computer Vision (ICCV), pp. 5752-5761, Oct. 2021.

Z. Chen, T. Funkhouser, P. Hedman, and A. Tagliasacchi, ``MobileNeRF: Exploiting the
polygon rasterization pipeline for efficient neural field rendering on mobile architectures,''
Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp. 16569-16578,
Jun. 2023.

A. Chen, Z. Xu, F. Zhao, X. Zhang, F. Xiang, and J. Yu, ``MVSNeRF: Fast generalizable
radiance field reconstruction from multi-view stereo,'' Proc. of IEEE/CVF International
Conference on Computer Vision (ICCV), pp. 14124-14133, Oct. 2021.

P. Wang, Y. Liu, Z. Chen, L. Liu, Z. Liu, and T. Komura, ``F$^2$-NeRF: Fast neural
radiance field training with free camera trajectories,'' Proc. of IEEE Conference
on Computer Vision and Pattern Recognition, pp. 4150-4159, Jun 2023.

K. Deng, A. Liu, J.-Y. Zhu, and D. Ramanan, ``Depth-supervised NeRF: Fewer views and
faster training for free,'' Proc. of IEEE Conference on Computer Vision and Pattern
Recognition, pp. 12882-12891, Jun. 2022.

R. Clark, ``Volumetric bundle adjustment for online photorealistic scene capture,''
Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp. 6124-6132,
Jun. 2022.

B. Kerble, G. Kopanas, T. Leimkühler, and G. Drettakis, ``3D Gaussian splatting for
real-time radiance field rendering,'' ACM Transactions on Graphics, vol. 42, no. 4,
pp. 1-14, Mar. 2023.

B. Roessle, N. Müller, L. Porzi, S. R. Bulò, P. Kontschieder, and M. NieSSner, ``GANeRF:
Leveraging Discriminators to Optimize Neural Radiance Fields,'' arXiv preprint arXiv:2306.06044,
2023.

J. Kulhanek and T. Sattler, ``Tetra-NeRF: Representing neural radiance fields using
tetrahedra,'' arXiv preprint arXiv:2304.09987, 2023.

D. Lee, M. Lee, C. Shin, and S. Lee, ``Deblurred neural radiance field with physical
scene priors,'' arXiv preprint arXiv:2211.12046, 2022.

F. Warburg, E. Weber, M. Tancik, A. Holynski, and A. Kanazawa, ``Nerfbusters: Removing
ghostly artifacts from casually captured NeRFs,'' arXiv preprint arXiv:2304.10532,
2023.

K. Zhou, W. Li, Y. Wang, T. Hu, N. Jiang, and X. Han, ``NeRFLiX: High-quality neural
view synthesis by learning a degradation-driven inter-viewpoint MiXer,'' Proc. of
IEEE Conference on Computer Vision and Pattern Recognition, pp. 12363-12374, Jun.
2023.

P. Wang, L. Liu, Y. Liu, C. Theobalt, T. Komura, and W. Wang, ``Neus: Learning neural
implicit surfaces by volume rendering for multi-view reconstruction,'' arXiv preprint
arXiv:2106.10689, 2021.

J. Y. Zhang, G. Yang, S. Tulsiani, and D. Ramanan, ``NeRS: Neural reflectance surfaces
for sparse-view 3d reconstruction in the wild,'' Advances in Neural Information Processing
Systems, vol. 34, pp. 29835-29847, 2021.

T. Takikawa, A. Glassner, and M. McGuire, ``A dataset and explorer for 3D signed distance
functions,'' Journal of Computer Graphics Techniques, vol. 11, no. 2, Apr. 2022.

L. Yariv, J. Gu, Y, Kasten, and Y. Lipman, ``Volume rendering of neural implicit surfaces,''
Advances in Neural Information Processing Systems, vol 34, pp. 4805-4815, 2021.

D. Azinovic, R. Martin-Brualla, D. B. Goldman, M. ´ Nießner, and J. Thies, ``Neural
RGB-D surface reconstruction,'' Proc. of IEEE Conference on Computer Vision and Pattern
Recognition, pp. 6290-6301, Jun. 2022.

R. Liu, R. Wu, B. V. Hoorick, P. Tokmakov, S. Zakharov, and C. Vondrick, ``Zero-1-to-3:
Zero-shot one image to 3d object,'' Proc. of IEEE/CVF International Conference on
Computer Vision (ICCV), pp. 9298-9309, Oct. 2023.

J. Tang, T. Wang, B. Zhang, T. Zhang, R. Yi, L. Ma, and D. Chen, ``Make-it-3d: High-fidelity
3d creation from a single image with diffusion prior,'' arXiv preprint arXiv:2303.14184,
2023.

J. Ling, Z. Wang, and F. Xu, ``ShadowNeuS: Neural sdf reconstruction by shadow ray
supervision,'' Proc. of IEEE Conference on Computer Vision and Pattern Recognition,
pp. 175-185, 2023.

R. A. Rosu and S. Behnke, ``PermutoSDF: Fast multi-view reconstruction with implicit
surfaces using permutohedral lattices,'' Proc. of IEEE Conference on Computer Vision
and Pattern Recognition, pp. 8466-8475, 2023.

C. Wang, M. Chai, M. He, D. Che, and J. Liao, ``Clip-NeRF: Text-and-image driven manipulation
of neural radiance fields,'' Proc. of IEEE Conference on Computer Vision and Pattern
Recognition, pp. 3835-3844, 2022.

C. Wang, R. Jiang, M. Chai, M. He, D. Chen, and J. Liao, ``NeRF-Art: Text-driven neural
radiance fields stylization,'' IEEE Transactions on Visualization and Computer Graphics,
vol. 30, no. 8, pp. 4983-4996, 2024.

A. Radford et al., ``Learning transferable visual models from natural language supervision,''
Proc. of the 38 th International Conference on Machine Learning, vol. 139, pp. 8748-8763,
Jul. 2021.

Y.-H. Huang, Y. He, Y.-J. Yuan, Y.-K. Lai, and L. Gao, ``StylizedNeRF: consistent
3D scene stylization as stylized nerf via 2D-3D mutual learning,'' Proc. of IEEE Conference
on Computer Vision and Pattern Recognition, pp. pp. 18342-18352, Jun. 2022.

X. Li, Z. Cao, H. Sun, J. Zhang, K. Xian, and G. Lin, ``3D cinemagraphy from a single
image,'' Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp.
4595-4605, Jun. 2023.

C. Jambon, B. Kerbl, G. Kopanas, S. Diolatzis, T. Leimkühler, and G. Drettakis, ``NeRFshop:
Interactive editing of neural radiance fields,'' Proc. of he ACM on Computer Graphics
and Interactive Techniques, vol. 6, no. 1, pp. 1-21, Mar. 2023.

A. Haque, M. Tancik, A. A. Efros, A. Holynski, and A. Kanazawa, ``Instruct-NeRF2NeRF:
Editing 3D scenes with instructions,'' arXiv preprint arXiv:2303.12789, 2023.

K. Kania, K. M. Yi, M. Kowalski, T. Trzciniski, and ´ A. Tagliasacchi, ``CoNeRF: Controllable
neural radiance fields,'' Proc. of IEEE Conference on Computer Vision and Pattern
Recognition, pp. 18623-18632, Jun. 2022.

Z. Li, T. Müller, A. Evans, R. H. Taylor, M. Unberath, and M.-Y. Liu, ``Neuralangelo:
High-fidelity neural surface reconstruction,'' Proc. of IEEE Conference on Computer
Vision and Pattern Recognition, pp. 8456-8465, Jun. 2023.

Author
Cheolsu Kwag received his B.S. degree in artificial intelligence, computer science,
and engineering from Handong Global University, Pohang, South Korea, in 2023. He is
currently in the course of his M.S. degree in Computer Graphics and Vision Lab at
Handong Global University. His research interests include neural rendering, 3D reconstruction,
and digital twins.
Sung Soo Hwang received his B.S. degree in computer science and electrical engineering
from Handong Global University, Pohang, South Korea, in 2008 and his M.S. and Ph.D
degrees in electrical engineering from Korea Advanced Institute of Science and Technology
(KAIST), Daejeon, South Korea, in 2010 and 2015, respectively. He is currently working
as an associate professor with the School of Computer Science and Electrical Engineering,
at Handong Global University, South Korea. His current research interests include
visual SLAM and neural rendering based 3D Reconstruction.