Mobile QR Code QR CODE

2024

Acceptance Ratio

21%


  1. (Department of Computer Science and Electrical Engineering, Handong Global University, Pohang, South Korea charse65@handong.ac.kr)
  2. (School of Computer Science and Electrical Engineering, Handong Global University, Pohang, South Korea sshwang@handong.edu)



Neural rendering, Quality advancement, Training/rendering speed advancement, 3D geometry reconstruction, Neural scene editing

1. Introduction

The pursuit of photorealistic rendering and novel view synthesis has been a fundamental challenge in the domains of computer graphics and computer vision. This endeavor, centered on the generation of high-quality images from 3D scenes, has undergone significant evolutionary progress. Within this context, Image-Based Rendering (IBR) [1] emerged as a predominant technique, integrating geometric and photographic data to render new perspectives. This method has been instrumental in synthesizing images that closely resemble those found in real-world scenarios.

IBR has been a critical component in the field of view synthesis. Utilizing an array of scene images obtained from various angles and distances, facilitated the generation of scenes from unexplored viewpoints. This methodology proved particularly efficacious in applications demanding high photorealism, such as in video games and virtual reality environments. Through the strategic reuse and manipulation of existing images, IBR could achieve a significant degree of photorealism, enhancing its applicability across various sectors.

Despite its usefulness, IBR had some inherent constraints. It struggled to capture complex lighting conditions, transparency, and intricate details in 3D scenes. The quality of synthesized views often suffered, leading to perceptible artifacts and a lack of realism. Moreover, the computational demands of IBR could be quite high, especially when it came to real-time or interactive rendering, limiting its scalability and accessibility.

In this context, the advent of neural rendering, particularly marked by the introduction of neural radiance fields (NeRF) [5] in 2020, represents a paradigm shift. Leveraging the capabilities of deep learning models and neural networks, neural rendering has significantly advanced the process of view synthesis. NeRF, in particular, transcends the traditional constraints of IBR. Its novel approach to data-driven scene representation, predicated on 3D coordinates and viewing directions, sets new standards for photorealism and quality in neural rendering.

While NeRF significantly enhances the quality of synthesized views, it is also not immune to certain limitations:

1) NeRF's computational demands can be substantial, especially in scenarios requiring real-time or interactive rendering. Despite advances, achieving instantaneous rendering remains a challenge for this method.

2) Achieving consistently high-quality results can be challenging in complex scenes with intricate lighting conditions or transparent materials, where NeRF may still exhibit artifacts.

3) While NeRF excels in view synthesis and scene representation, it may not provide the level of accuracy required for detailed 3D geometry reconstruction. It is primarily designed to capture scene appearance and rendering.

4) NeRF's core design focuses on representation and rendering, making it less amenable to intuitive neural scene editing, which often necessitates specific tools and techniques designed for this purpose.

These limitations highlight the ongoing challenges in the field of neural rendering. Subsequent research efforts have been aimed at addressing these issues, focusing on:

• Rendering speed improvement: With the advent of neural networks specialized for real-time or interactive applications, rendering speed has seen dramatic improvements, allowing for seamless integration into various real-world scenarios and applications.

• Quality improvement: Neural rendering techniques have shown an unparalleled ability to generate photorealistic images, addressing issues like complex lighting, intricate materials, and object interactions. They produce scenes that are virtually indistinguishable from real-world photographs.

• 3D Geometry reconstruction: Neural rendering goes beyond image-based methods by enabling the reconstruction of accurate 3D geometry from 2D images, thereby enriching the understanding of a scene's structure and spatial relationships.

• Neural scene editing: These methods empower users to manipulate and edit scenes with unparalleled flexibility, allowing for the creation of entirely new and customized visual content.

The integration of neural rendering [3] techniques marks a significant milestone in the evolution of computer graphics and computer vision. This paradigm shift not only enhances the quality of synthesized views but also catalyzes the development of novel applications previously beyond the scope of traditional methods. The objective of this survey paper is to conduct a thorough examination and analysis of the advancements in neural rendering, specifically focusing on areas such as accelerated rendering speeds, quality enhancement, 3D geometry reconstruction, and neural scene editing. In this context, the paper will explore the progressive techniques, achievements, and challenges inherent within these domains. It aims to elucidate the considerable strides neural rendering has made, underscoring its impact and significance in the realm of computer graphics and vision research. Through an in-depth investigation of these diverse facets of neural rendering, the survey intends to provide a comprehensive overview and a nuanced understanding of its current state of the art. Additionally, it seeks to offer valuable insights into the emerging trends and potential challenges that shape the future trajectory of this transformative technology.

The remainder of this paper is organized as follows. Earlier studies on neural rendering and the specifics of neural radiance fields (NeRF) are discussed in Section 2. Novel contributions in four areas: efficiency in rendering and training speed, image quality, 3D geometry reconstruction, and scene editing capabilities will be presented in Section 3. The summary of our findings and thoughts on future research directions will be in Section 4.

2. Related Work

2.1 Neural Rendering Before NeRF

Neural rendering [7] represents a groundbreaking shift in the fields of computer graphics and computer vision, leveraging the capabilities of deep learning to create high-quality, photorealistic images and novel views of 3D scenes. This approach significantly deviates from traditional rendering techniques, such as image-based rendering (IBR) [1], offering a more data-driven and versatile method of scene synthesis. The fundamental idea behind neural rendering is to learn a mapping function from a set of inputs, typically comprising a 3D scene representation and a desired viewpoint, to generate an image. Neural networks work behind this process, trained on extensive datasets of images and associated scene information. These networks can take on various architectures from convolutional neural networks (CNNs) to more specialized designs, such as NeRF and its derivatives.

The development of neural rendering is marked by a historical progression of techniques and innovations:

Early deep learning in computer graphics: The initial foray of deep learning into computer graphics primarily focused on enhancing image quality. Convolutional Neural Networks (CNNs), a class of deep neural networks, were pivotal in this phase. They were applied to tasks such as image denoising and super-resolution, where the networks learned to identify and correct errors in images, or to upscale and improve the resolution of lower-quality images. This period set the groundwork for integrating deep learning into more complex graphic applications.

Deep learning for 3D reconstruction: The intersection of deep learning and 3D reconstruction marked a transformative phase. Techniques such as Multi-View Stereo (MVS) and Structure-from-Motion (SfM) were adapted to incorporate neural networks, greatly enhancing their capabilities. In the context of MVS, neural networks were used to improve the matching of corresponding points across different views, significantly enhancing the depth estimation and reconstruction accuracy. For SfM, deep learning algorithms were integrated to better interpret sequential image data, facilitating more accurate extraction of 3D structure from motion patterns. These adaptations allowed for more precise and detailed reconstructions of 3D scenes from 2D images, overcoming some of the limitations inherent in purely algorithmic approaches.

GQN and 3D generative models: The introduction of the Generative Query Network (GQN) represented a significant leap in neural scene representation. GQN is a neural network architecture that learns to represent scenes implicitly from 2D observations. It functions by encoding scenes into a latent representation, which can then be queried to generate new views of the scene, essentially synthesizing novel perspectives. However, while GQN was a breakthrough in terms of learning scene dynamics and structure, it had its limitations. It struggled with rendering complex scenes with high levels of detail and achieving photorealistic output. The GQN's approach to scene understanding and synthesis, despite its limitations, laid the groundwork for more advanced generative models and neural rendering techniques, pushing the boundaries of what could be achieved in synthesizing 3D environments from 2D data. Fig. 1 represents research directions before the introduction of NeRF on the left side. In the middle, the overview of the NeRF network process is presented. On the input side, $x$ represents the point on the ray, and d represents the viewing direction. On the output side, $c$ represents the predicted color and $\sigma$ represents the volume density. On the right side, various research directions post-NeRF are presented.

Fig. 1. Research directions before and after the introduction of NeRF.

../../Resources/ieie/IEIESPC.2025.14.2.191/image1.png

2.2 Neural Radiance Field (NeRF)

Neural Radiance Fields (NeRF) is a groundbreaking technique in neural rendering. This method represents a substantial shift in approach by using a fully connected neural network to model a continuous volumetric scene function. Unlike traditional methods, NeRF does not directly map 2D pixel coordinates to 3D voxel coordinates. Instead, it operates on a set of 5D coordinates, encompassing both spatial location $(x$, $y$, $z)$ and viewing direction $(\theta$, $\varphi)$.

The essence of NeRF lies in its ability to synthesize novel views of a scene. It does this by querying the 5D coordinates corresponding to specific points in space, considering the viewing direction. For each of these points, the network predicts two crucial pieces of information: the color (RGB) and the volume density ($\sigma$). These predicted values are integral to the process of volume rendering, which combines the color and density along the path of a camera ray to construct the final 2D image.

During its training phase, NeRF is optimized against a set of training images with known camera poses. The neural network learns to approximate a function that takes as input the 5D coordinates and outputs the RGB color and volume density at each point. This sophisticated modeling allows NeRF to render highly detailed, photorealistic views of complex scenes. It significantly advances the field by surpassing previous methods in neural rendering and view synthesis in terms of realism and detail in the generated images.

The process of view synthesis in NeRF involves several steps:

1) 5D coordinate input: NeRF takes a continuous 5D coordinate as input, comprising a 3D location $(x$, $y$, $z)$ and a 2D viewing direction $(\theta$, $\varphi )$. This input is used to predict the emitted color and volume density at that specific point in space.

2) Ray marching: For each pixel in the desired image, NeRF casts a camera ray into the scene. The path of each ray is determined by the camera's position and orientation. The method then samples a set of points along each ray. These points represent potential locations where light might interact with the scene, contributing to the final pixel color.

3) Neural network prediction: Each sampled point along the ray, along with its corresponding viewing direction, is fed into a fully connected neural network. The network functions as a mapping function F$\Theta:$ $(x$, $d) \to (c$, $\sigma )$, where $x$ is the 3D coordinate, d is the direction, $c$ is the color, and $\sigma$ is the volume density. The network outputs the predicted color and volume density for each point. Fig. 1 shows how the process has been done. In the figure, sampled 3D points along the ray and their viewing direction are inserted into the MLP. On the input side, $x$ represents the point on the ray, and d represents the viewing direction. On the output side, c represents the predicted color and $\sigma$ represents the volume density.

4) Volume density and color rendering: The volume density $\sigma$ at a point can be interpreted as the probability of a ray terminating at that point. The emitted color at each point is dependent on both the viewing direction and the scene's content at that point.

5) Accumulated transmittance: A function named $T(t)$ is calculated to accumulate transmittance along the ray, representing the probability that light travels from the start of the ray to each point without being absorbed. This function is crucial for understanding how light interacts with the scene as it travels along the ray. Fig. 2 is presented for better understanding. The figure shows the predicted colors and densities accumulation along the ray to form the final pixel color on the image plane. Volume rendering integrates these values to simulate the light transport through a medium.

6) Volume rendering integral: The final color $C(r)$ of each camera ray is computed using a volume rendering integral. This integral accounts for the color and density of each sampled point along the ray, weighted by the accumulated transmittance. NeRF numerically estimates this continuous integral using a quadrature rule based on stratified sampling. This method allows for a better approximation of the continuous scene representation compared to deterministic quadrature.

7) Differentiable rendering: The entire rendering process is differentiable, allowing the use of gradient descent to optimize the neural network. By minimizing the difference between the rendered images and the ground truth images, the network learns to accurately predict color and density values that produce photorealistic renderings of the scene from novel viewpoints. Fig. 3 shows how the optimization is done. The loss is calculated by obtaining the difference between rendered image pixels and the ground truth image pixels and repeating that process until the loss value obtained draws near to zero value.

Fig. 2. Image of predicted colors and densities accumulation in the NeRF process.

../../Resources/ieie/IEIESPC.2025.14.2.191/image2.png

Fig. 3. Loss computation from the difference between the rendered and ground truth images in the NeRF process.

../../Resources/ieie/IEIESPC.2025.14.2.191/image3.png

2.3 Benchmark Dataset

In evaluating the effectiveness of these techniques, researchers frequently utilize a variety of benchmarks. Popular datasets such as the DTU Dataset [2], BlendedMVS Dataset [3], Tanks and Temples [4], NeRF Synthetic and Real-world Datasets [5], and the Redwood-3dscan Dataset [6] play a pivotal role. These datasets, encompassing scenarios from controlled laboratory settings to complex real-world environments, offer comprehensive platforms for testing and refining neural rendering algorithms.

3. Recent Advancements in Neural Rendering

3.1 Rendering or Training Speed

In the landscape of neural rendering, the need for swift rendering and training speed is paramount, particularly in the context of real-time and interactive applications. Despite NeRF has impressive capabilities, it has several limitations, particularly regarding the traditional method's rendering and training speed:

1) Ill-posed geometry estimation: NeRF struggles with estimating accurate scene geometries, especially with a limited number of input views. While it can learn to render training views accurately, it often generalizes poorly to novel views, a phenomenon known as overfitting. This is because the traditional volumetric rendering does not enforce constraints on the geometry, leading to wildly inaccurate reconstructions that only look correct from the training viewpoints.

2) Time-consuming training: NeRF requires a lengthy optimization process to fit the volumetric representation, which is a significant drawback. Training a single scene can take anywhere from ten hours to several days on a single GPU. This slow training is attributed to expensive ray-casting operations and the optimization process required to learn the radiance and density functions within the volume.

The advancements post-NeRF have specifically targeted these limitations, seeking to enhance both the efficiency and practicality of neural rendering to be accessible in various real-world scenarios. This section unveils the latest pioneering research in rendering and training speed advancements, where a focus on speed is the key objective.

3.1.1 Voxel-based Method for Rendering Efficiency

Voxel-based approaches represent a leap in efficiency by refining geometry estimation and reducing the computational waste of traditional volumetric rendering. These methods strategically focus on non-empty space, thereby addressing the issue of ill-posed geometry estimation by concentrating on relevant scene areas. NSVF [8]'s sparse voxel octrees target only non-empty space, cutting down unnecessary calculations. DVGO [9] optimizes voxel grids in a two-tiered process, capturing the broader scene structure before zooming in on detail, avoiding the gradual and exhaustive refinement process seen in previous methods. Plenoxels [10]forgo complex neural networks, instead using voxel grids and spherical harmonics to directly optimize scenes, which translates to rapid convergence and the ability to render in real time. These voxel-based methods signify a pivotal change from exhaustive volume sampling to strategic, content-focused rendering, streamlining the process significantly.

3.1.2 Factorization Techniques for Real-time Rendering

Factorization techniques break new ground in rendering speed by deconstructing the scene into manageable elements reducing time consumption in training and rendering. FastNeRF [11]'s separation of static geometry from dynamic effects allows for precomputation and caching, minimizing the on-the-fly calculations required during rendering. KiloNeRF [12]'s division of scenes into smaller segments processed by individual MLPs introduces a parallel computation model, drastically speeding up rendering times and enabling interactive applications. TensoRF [13] utilizes tensor decomposition to reduce computational demands while preserving scene detail, offering a solution that is both efficient and maintains high fidelity. These methods excel by dissecting complex scenes into simpler, more computationally accessible parts, facilitating a quicker rendering pipeline. Importantly, TensoRF's approach to handling complex lighting and geometric details enhances the rendering quality, capturing intricate scene elements with high fidelity. This aspect underscores its significance in the domain of quality improvement in neural rendering, illustrating its capacity to simultaneously expedite rendering and retain high-quality scene representations.

3.1.3 Data Structure Innovations for Fast Training and Rendering

Advances in data structures such as Instant-NGP [14]'s multiresolution hash encoding and PlenOctrees [15] use of octrees harness GPU capabilities more effectively. Instant-NGP accelerates feature access and update, essential for iterative training and rendering, while PlenOctrees exploits precomputed information for real-time rendering speeds. Fig. 4 represents the space subdivided into octants. The divided octant in the scene can be divided again into eight voxel-based octants. Whereas Fig. 5 shows a ray passing through a neural scene which is divided into octree voxel grids. The dots on the ray represent Spherical Harmonics (SH). This approach allows the model to skip large voxels which represent empty spaces and focus on small voxels which might have color and density in there. These innovations are transformative, moving away from the dense, uniform data structures that have traditionally hindered performance, to more nuanced and efficient storage models that exploit the parallel nature of modern GPUs.

Fig. 4. Space subdivision into octants in PlenOctree rendering technique.

../../Resources/ieie/IEIESPC.2025.14.2.191/image4.png

Fig. 5. A ray passing through a neural scene in PlenOctree rendering technique.

../../Resources/ieie/IEIESPC.2025.14.2.191/image5.png

3.1.4 Adapting NeRF for Mobile and Low-Power Devices

Adapting NeRF for mobile devices, as MobileNeRF [16] does, brings efficient 3D rendering to less powerful hardware. By leveraging native GPU rasterization pipelines, these methods reduce the complexity of rendering tasks, enabling fluid performance in mobile VR/AR settings. This adaptation is pivotal, as it brings the sophistication of NeRF to a broader range of devices, overcoming previous limitations due to hardware constraints.

3.1.5 Enhancing Efficiency and Generalization

The limitation of ill-posed geometry estimation due to limited or arbitrary input is solved in this section. MVSNeRF [17] introduces a method for reconstructing neural radiance fields from a small number of views, significantly improving generalizability across different scenes and reducing the per-scene optimization time. F2-NeRF (Fast-Free-NeRF) [18] focuses on novel view synthesis with arbitrary input camera trajectories. It efficiently handles large, unbounded scenes with diverse camera paths, maintaining rapid convergence speed and offering a general space-warping scheme applicable to arbitrary camera trajectories. Depth-supervised NeRF (DS-NeRF) [19] addresses the issue of NeRF's susceptibility to overfitting and slow training by incorporating depth supervision. This method uses sparse 3D points from structure-from-motion as additional guidance, improving NeRF's ability to render realistic images from fewer views and speeding up the training process by 2-3 times.

3.1.6 Scalability and Real-time Application

Real-time applications demand methods such as nVDB [20] and 3D Gaussian Splatting [21], which prioritize immediate rendering capabilities. nVDB's real-time scene optimization and 3D Gaussian Splatting's efficient rendering strategy for high-resolution outputs cater to the pressing need for scalability and instantaneity in applications such as live video and interactive simulations. These methods are critical as they align with the need for on-demand rendering, crucial in an era where delay can impede user experience and system functionality.

3.2 Quality

The advancements in this field are poised to generate scenes that are virtually indistinguishable from real-world photographs. Notably, the search for research papers on this specific topic proved challenging, as the overarching goal in the field is to enhance photorealism and visual fidelity. Thus, our focus here is on introducing papers that intensively concentrate on improving rendering quality.

GANeRF (Leveraging discriminators to optimize neural radiance fields) [22] introduces the use of adversarial training, specifically discriminators, to optimize Neural Radiance Fields. It incorporates the generative adversarial network (GAN) framework where the generator aims to produce realistic images while the discriminator evaluates them. By using a patch-based rendering constraint, GANeRF is designed to address typical imperfections and rendering artifacts that arise from traditional NeRF methods. The introduction of adversarial training allows GANeRF to push the boundaries of realism in NeRF-rendered images, especially in regions with limited coverage or complex textures where traditional NeRF might struggle. This approach highlights the potential for combining rendering priors with novel view synthesis, leading to qualitative and quantitative improvements in rendered scenes.

Tetra-NeRF (Representing neural radiance fields using tetrahedra) [23] presents a novel representation of Neural Radiance Fields by utilizing tetrahedral tessellation of the scene. This method enhances the efficiency of NeRF by breaking down the scene into a network of interconnected tetrahedra, which allows for faster rendering and reduced computational load. The key significance of Tetra-NeRF lies in its ability to improve memory efficiency and accelerate the rendering process, which is particularly beneficial for dynamic scenes or applications requiring real-time interaction. Using a tetrahedral structure can provide a more computationally tractable approach to represent complex volumes, which is a step forward in addressing the scalability challenges faced by traditional NeRF methods. Figs. 6-8 represent the overall stages of the Tetra-NeRF tetrahedral tessellation process. Fig. 6 is the image of an example point cloud which will be inserted as input. Fig. 7 shows the middle stage where the tetrahedral set is being used to represent the radiance field. The technique known as barycentric interpolation is used to interpolate tetrahedra vertices in the image. Fig. 8 shows the final color Lego image output created through the Tetra-NeRF model process.

DP-NeRF (deblurred neural radiance field with physical scene priors) [24] addresses issues that frequently occur in real-world photography, such as motion blur and defocus blur, using physical-based priors to resolve NeRF's blur problems. It leverages the actual physical blurring process that occurs during image acquisition by a camera to produce renderings with more accurate 3D consistency. This is particularly important for enhancing NeRF quality even when the quality of data is compromised.

Nerfbusters [25] focuses on removing ghostly artifacts that appear in casually captured NeRFs and improving scene geometry. It introduces a local 3D geometric prior learned with a diffusion model trained on synthetic data to encourage plausible geometry during NeRF optimization, which helps clean up floaters and cloudy artifacts from NeRFs. This approach is essential for producing clearer and more coherent scenes from casually captured data.

NeRFLiX [26] is dedicated to high-quality neural view synthesis by learning a degradation-driven inter-viewpoint mixer. It simulates NeRF-style degradations to create training data, which is then used to train a restorer to enhance NeRF-rendered views. This method is crucial for fusing high-quality training images to improve the performance of NeRF models, leading to more photorealistic synthetic views. It demonstrates the necessity of handling degradation in the rendered images to achieve higher fidelity in the final output.

Fig. 6. Point cloud input of Tetra-NeRF's tetrahedral tessellation process.

../../Resources/ieie/IEIESPC.2025.14.2.191/image6.png

Fig. 7. Tetrahedral set representing the radiance field in Tetra-NeRF's tessellation process.

../../Resources/ieie/IEIESPC.2025.14.2.191/image7.png

Fig. 8. Color Lego image result of Tetra-NeRF in Tetra-NeRF's tessellation process.

../../Resources/ieie/IEIESPC.2025.14.2.191/image8.png

3.3 3D Geometry or 3D Reconstruction

3.3.1 Surface-Based Reconstruction and Rendering

Traditional NeRF methods can often result in volume rendering biases, leading to less accurate surface definitions and difficulties in achieving water-tight models. The advancement in surface-based reconstruction and rendering, as seen in NeuS [27] and NeRS [28], was designed to address NeRF's shortcomings in accurately capturing complex surface geometries and intricate details, such as thin structures and self-occlusion. The specific methods employed by NeuS and NeRS were necessary to improve the fidelity of the reconstruction and to ensure that the models can capture both the geometry and reflectance properties of surfaces, especially in uncontrolled real-world settings. These methods enable more accurate reconstructions of objects and scenes with complex surface interactions and detailed textural properties.

NeuS (Neural Surface Reconstruction) [27] advances neural surface reconstruction by interpreting surfaces as the zero-level set of a signed distance function, robustly representing complex objects, and improving upon traditional volume rendering techniques that can be biased and inaccurate. NeRS [28] reconstructs 3D scenes from sparse-view images by learning bidirectional surface reflectance functions, capturing not just shape but also texture and illumination properties, providing water-tight reconstructions. Fig. 9(a) shows the fundamental idea of the Signed Distance Function. The signs in the image indicate the position of the point relative to the surface. It is positive when the point is outside, negative when the point is inside, and zero when the point is exactly on the surface of the object. This method enables to form of a robust boundary of an object. Using SDF in 3D geometry reconstruction has benefits such as solving different objects interacting with each other as shown in Fig. 9(b) and constructing watertight models without any opening on any sides as shown in Fig. 9(c) because natural object geometries does not have the incomplete opened structure [29].

Fig. 9. Image representing signed distance function (SDF) and its benefits. (a) The fundamental idea of signed distance function (SDF). (b) Object with intersection problem. (c) Object with incomplete structure at one side.

../../Resources/ieie/IEIESPC.2025.14.2.191/image9.png

3.3.2 Enhanced Volume Rendering for Geometry and Quality

The motivation behind enhanced volume rendering techniques such as VolSDF [30] and Neural RGB-D Surface Reconstruction [31] was to overcome the limitations of NeRF in representing sophisticated scene geometry and lighting effects. Traditional NeRF struggles to disentangle shape and appearance, which can lead to inaccuracies in scenes with detailed geometric structures. By applying Laplace's cumulative distribution function to SDFs, as in VolSDF, [30] and utilizing both RGB and depth information, as in Neural RGB-D [31], these methods provide a more nuanced understanding of volume density and surface geometry. This leads to higher-quality reconstructions with more precise sampling and improved appearance modeling, essential for multi-view datasets with significant viewpoint changes.

VolSDF (volume rendering of neural implicit surfaces) [30] utilizes Laplace's cumulative distribution function applied to SDFs to model volume density as a function of geometry. This approach disentangles shape and appearance in volumetric rendering, leading to more accurate geometric reconstructions. {Neural RGB-D surface reconstruction} [31] advances 3D reconstruction by using both RGB and depth data, reconstructing surface geometry more accurately. It refines camera poses and intrinsic through optimization, which enhances the reconstruction quality.

3.3.3 Approaches for Single-View Reconstruction

Single-view reconstruction techniques such as Zero-1-to-3 [32] and Make-it-3D [33] were developed in response to NeRF's requirement for multiple views to reconstruct a scene. This is a significant limitation when only a single image is available or when capturing multiple views is impractical. The use of geometric priors and diffusion models allows for the extrapolation of 3D information from a single viewpoint, enabling the generation of new views and detailed reconstructions from limited data. These approaches are crucial for applications where data is scarce and for creating 3D models from images where no prior 3D information is available.

Zero-1-to-3 [32]2] tackles the challenge of synthesizing new views of a 3D object from a single RGB image, using geometric priors from large-scale diffusion models to generate images from various perspectives and enable 3D reconstruction from minimal data. Make-it-3D [33] aims to create high-quality 3D content from a single image, this method first optimizes a neural radiance field and then transforms it into textured point clouds. It uses a well-trained 2D diffusion model to estimate accurate geometry and generate plausible textures.

3.3.4 Shadow and Light Interaction for Scene Reconstruction

ShadowNeuS's [34] incorporation of shadow ray supervision directly tackles NeRF's difficulties in capturing the intricate interplay between light and geometry, which is essential for reconstructing scenes with realistic lighting conditions. Traditional NeRF can struggle with accurate shadow modeling, which is critical for understanding the spatial relationships within a scene. By using shadow information, ShadowNeuS [34] can optimize sample locations along rays more effectively, resulting in improved SDF representations and more complete reconstructions from single-view images under various lighting conditions.

ShadowNeuS (Neural SDF Reconstruction by Shadow Ray Supervision) [34] integrates shadow ray information to enhance shape reconstruction tasks, optimizing both the samples along the ray and its location for more effective reconstruction of neural SDF representations.

3.3.5 Speed and Detail in Multi-View Reconstruction

The development of methods such as PermutoSDF [35] is a direct response to the computationally intensive nature of NeRF, which can be a bottleneck for real-time applications and detailed reconstructions. By leveraging permutohedral lattices and hash-based positional encoding, PermutoSDF [35] significantly accelerates the reconstruction process while focusing on recovering fine details. Such improvements in speed and detail are vital for practical applications that require quick processing times and high levels of detail, such as digital content creation for virtual reality or visual effects industries.

PermutoSDF (fast multi-view reconstruction with implicit surfaces using permutohedral lattices) [35] improves multi-view reconstruction speed and detail recovery, combining hash-based positional encoding with density-based methods to recover fine geometric details efficiently.

3.4 Scene Editing

3.4.1 Text and Image-Driven Manipulation

Traditional NeRF methods have limited ability to control scene attributes directly from high-level inputs such as text or reference images. This limitation inspired the development of methods such as CLIP-NeRF [36] and NeRF-Art [37], which integrate language and image understanding models (e.g., CLIP [38]) with NeRF. These methods allow users to manipulate NeRFs using text prompts or images, offering an intuitive interface for editing the shape and appearance of objects within a 3D scene. The necessity for such methods arises from the desire to bridge the gap between human language or visual inputs and the control over digital environments, making the editing process more accessible and creative.

CLIP-NeRF [36] integrates text and image inputs to control the shape and appearance of objects within a NeRF framework. Fig. 10 shows a simplified version of the CLIP-NeRF process. The figure shows that it uses CLIP embeddings to map language and image inputs to a latent space, allowing users to manipulate NeRFs with text prompts or exemplar images. The architecture employs disentangled latent codes for shape and appearance, which are matched with CLIP embeddings for precise control. Shape conditioning is achieved by applying learned deformation fields to positional encoding while color conditioning is deferred until the volumetric rendering stage. To bridge this disentangled latent representation with the CLIP embedding space, two code mappers are designed. These mappers take input from CLIP embeddings and update latent codes accordingly for targeted editing purposes. They are trained with a matching loss based on CLIP embeddings to ensure accurate manipulation results. Additionally, an inverse optimization method is proposed which accurately projects an input image onto latent codes for manipulation purposes even with real-world images as inputs. NeRF-Art [37] offers a method for altering appearance and geometry in pre-trained NeRF models using text descriptions. It introduces a global-local contrastive learning strategy, a directional constraint, and a weight regularization method to control the style transformation process and maintain consistency across views. This allows for stylization changes based on textual prompts without the need for mesh guidance.

Fig. 10. Simplified image of the CLIP-NeRF process.

../../Resources/ieie/IEIESPC.2025.14.2.191/image10.png

3.4.2 2D and 3D Style Transfer Consistency

Ensuring stylistic consistency across different views in a 3D scene is a challenge that traditional NeRF does not address. Stylized-NeRF [39] and 3D Cinemagraphy [40] were introduced to maintain stylistic coherence when applying 2D artistic effects to 3D scenes rendered by NeRF. These advancements are critical for applications in virtual reality, film, and gaming, where artistic style needs to be consistent regardless of viewpoint.

Stylized NeRF [39] proposes a mutual learning framework that fuses 2D image stylization networks with NeRFs, maintaining stylistic consistency across different viewpoints of a 3D scene. It replaces NeRF's color prediction module with a style network and introduces consistency loss and mimic loss to ensure spatial coherence between the stylized NeRF outputs and 2D stylization results. 3D Cinemagraphy [40] aims to animate 2D images and create videos with camera motions, combining 2D animation and 3D photography. It addresses the inconsistencies observed when integrating 2D animation with NeRF-generated 3D scenes, providing a more natural and immersive experience.

3.4.3 Interactive Editing and Control Frameworks

NeRF traditionally does not support interactive manipulation or editing of the rendered scenes. NeRFshop [41], Instruct-NeRF2NeRF [42] and CoNeRF [43] provide solutions that allow for deformations, modifications, and text-based scene editing.

NeRFshop [41] presents an interactive method for editing NeRFs, allowing users to perform deformations and modifications with a semi-automatic cage creation and volumetric manipulation. It introduces a volumetric membrane interpolation inspired by Poisson image editing to reduce artifacts from object deformations. Instruct-NeRF2NeRF [42] introduces a text-based editing framework that utilizes an image-conditioned diffusion model alongside NeRF to iteratively edit and optimize 3D scenes based on textual instructions, making scene editing more accessible to users without specialized knowledge. CoNeRF [43] enhances NeRF with the ability to manipulate specific scene attributes with minimal user input. It allows users to control attributes such as facial expressions or object movements within a scene using a small number of annotations. The development of these frameworks is vital for making 3D scene editing user-friendly and for expanding the practical usage of NeRF in creative industries where rapid and intuitive control is required.

4. Conclusion

This paper has comprehensively explored the advancements in neural rendering, particularly post the advent of Neural Radiance Fields (NeRF). We observed significant progress across multiple domains, including image quality enhancement, rendering speed, 3D geometry reconstruction, and neural scene editing. These advancements have notably elevated the standards of photorealism, enabled real-time and interactive applications, and expanded the scope of scene understanding and manipulation.

However, our analysis indicates that current research exhibits a marked proficiency in handling synthetic data as opposed to real-world data. This disparity underscores the need for future research to focus on enhancing the robustness of neural rendering techniques in real-world scenarios. While synthetic datasets offer controlled environments for developing and testing algorithms, the complexity and unpredictability of real-world data present unique challenges that are yet to be fully addressed.

Moreover, we encountered issues where certain experiments did not function as intended when attempting to replicate the results documented in papers. In some cases, attempting to replicate the results presented in papers yielded different outcomes, which could be attributed to variations in GPU performance, hardware specifications, or differences in parameter specifications. Figs. 11-13 illustrate a common issue in this field. Fig. 11 displays ground truth RGB images. Fig. 12 depicts a 3D surface reconstruction achieved using Neuralangelo [44], a state-of-the-art neural rendering model, employing the original researcher's hardware configuration. This reconstruction is characterized by its high fidelity. In contrast, Fig. 13 presents the surface reconstruction results obtained with the same model but using an arbitrary commercial hardware setup. Although considering the fact that the data is poorly generated due to the complexity and unpredictability of real-world data, the scene's details have undergone noticeable degradation in this instance. These discrepancies in parameter settings, often overlooked, play a crucial role in determining the output of neural rendering experiments. Such variability in outcomes underscores the need for standardization in experimental setups or the development of techniques less sensitive to hardware and parameter variations. This approach would facilitate more consistent and reliable replication of results across different research setups, advancing the field's overall robustness and reproducibility.

Another area requiring attention is the scalability of neural rendering methods. Current techniques often struggle with large-scale scenes, limiting their applicability in expansive environments. Future research should aim to increase the efficiency and effectiveness of neural rendering in handling such large-scale scenes, possibly through advanced data structures or more efficient rendering algorithms.

The requirement for initial pose estimation remains a foundational necessity in most neural rendering pipelines. This dependency poses limitations in scenarios where obtaining accurate initial pose information is challenging. Overcoming this dependency could significantly broaden the applicability and ease of use of neural rendering technologies.

In conclusion, while the field of neural rendering has made remarkable strides, the effort to achieve universally robust, scalable, and realistic rendering in all types of environments continues. Future directions might include the integration of multi-sensory data, the development of user-friendly scene editing tools, and ensuring the ethical application of these rapidly evolving technologies. As we progress, the field of neural rendering is poised to not only refine its technical capabilities but also broaden its impact across various industries, revolutionizing our interaction with digital content.

Fig. 11. Ground-truth RGB image of meeting room dataset.

../../Resources/ieie/IEIESPC.2025.14.2.191/image11.png

Fig. 12. 3D surface reconstruction output of the scene original researchers' setting with meeting room dataset.

../../Resources/ieie/IEIESPC.2025.14.2.191/image12.png

Fig. 13. 3D surface reconstruction output of the scene using arbitrary setting with meeting room dataset.

../../Resources/ieie/IEIESPC.2025.14.2.191/image13.png

ACKNOWLEDGMENTS

This research was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (NRF-2022R1C1C1011084).

REFERENCES

1 
H.-Y. Shum, S. B. Kang, and S.-C. Chan, ``Survey of image-based representations and compression techniques,'' IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, no. 11, pp. 1020-1037, Nov. 2003.DOI
2 
R. Jensen, A. Dahl, G. Vogiatzis, E. Tola, and H. Aanæs, ``Large scale multi-view stereopsis evaluation,'' Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp. 406-413, Jun. 2014.DOI
3 
Y. Yao, Z. Luo, S. Li, J. Zhang, Y. Ren, and L. Zhou, ``Blendedmvs: A large-scale dataset for generalized multi-view stereo networks,'' Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp. 1790-1799, Jun. 2020.DOI
4 
A. Knapitsch, J. Park, Q.-Y. Zhou, and V. Koltun, ``Tanks and temples: Benchmarking large-scale scene reconstruction,'' ACM Transactions on Graphics, vol. 36, no. 4, pp. 1-13, Jul. 2017.DOI
5 
B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, ``NeRF: Representing scenes as neural radiance fields for view synthesis,'' Communications of the ACM, vol. 65, no. 1, pp. 99-106, Dec. 2021.DOI
6 
S. Choi, Q.-Y. Zhou, S. Miller, and V. Koltun, ``A large dataset of object scans,'' arXiv preprint arXiv:1602.02481, 2016.DOI
7 
A. Tewari, O. Fird, J. Ties, V. Sitzmann, S. Lombardi, et al., ``State of the art on neural rendering,'' Computer Graphics Forum, vol. 39, no. 2 pp. 701-727, Jul. 2020.DOI
8 
L. Liu, J. Gu, K. Z. Lin, T.-S. Chua, and C. Theobalt, ``Neural sparse voxel fields,'' Advances in Neural Information Processing Systems, vol. 33, pp. 15651-15663, 2020.DOI
9 
C. Sun, M. Sun, and H.-T. Chen, ``Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction,'' Proc. of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5459-5469, Jun. 2022.DOI
10 
S. Fridovich-Keil, A. Yu, M. Tancik, W. Chen, B. Recht, and A. Kanazawa, ``Plenoxels: Radiance fields without neural networks,'' Proc. of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5501-5510, Jun. 2022.DOI
11 
S. J. Garbin, M. Kowalski, M. Johnson, J. Shotton, and J. Valentin, ``FastNeRF: High-fidelity neural rendering at 200fps,'' Proc. of IEEE/CVF International Conference on Computer Vision (ICCV), pp. 14346-14355, Oct. 2021.DOI
12 
C. Reiser, S. Peng, Y. Liao, and A. Geiger, ``KiloNeRF: Speeding up neural radiance fields with thousands of tiny mlps,'' Proc. of IEEE/CVF International Conference on Computer Vision (ICCV), pp. 14335-14345, Oct. 2021.DOI
13 
A. Chen, Z. Xu, A. Geiger, J. Yu, and H. Su, ``TensoRF: Tensorial radiance fields,'' Proc. of European Conference on Computer Vision, vol. 13692, pp. 333-350, Nov. 2022.DOI
14 
T. Müller, A. Evans, C. Schied, and A. Keller, ``Instant neural graphics primitives with a multiresolution hash encoding,'' ACM Transactions on Graphics, vol. 41, no. 4 pp. 1-15, Jul. 2022.DOI
15 
A. Yu, R. Li, M. Tancik, H. Lio, R. Ng, and A. Kanazawa, ``PlenOctrees for real-time rendering of neural radiance fields,'' Proc. of IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5752-5761, Oct. 2021.DOI
16 
Z. Chen, T. Funkhouser, P. Hedman, and A. Tagliasacchi, ``MobileNeRF: Exploiting the polygon rasterization pipeline for efficient neural field rendering on mobile architectures,'' Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp. 16569-16578, Jun. 2023.DOI
17 
A. Chen, Z. Xu, F. Zhao, X. Zhang, F. Xiang, and J. Yu, ``MVSNeRF: Fast generalizable radiance field reconstruction from multi-view stereo,'' Proc. of IEEE/CVF International Conference on Computer Vision (ICCV), pp. 14124-14133, Oct. 2021.DOI
18 
P. Wang, Y. Liu, Z. Chen, L. Liu, Z. Liu, and T. Komura, ``F$^2$-NeRF: Fast neural radiance field training with free camera trajectories,'' Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp. 4150-4159, Jun 2023.DOI
19 
K. Deng, A. Liu, J.-Y. Zhu, and D. Ramanan, ``Depth-supervised NeRF: Fewer views and faster training for free,'' Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp. 12882-12891, Jun. 2022.DOI
20 
R. Clark, ``Volumetric bundle adjustment for online photorealistic scene capture,'' Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp. 6124-6132, Jun. 2022.DOI
21 
B. Kerble, G. Kopanas, T. Leimkühler, and G. Drettakis, ``3D Gaussian splatting for real-time radiance field rendering,'' ACM Transactions on Graphics, vol. 42, no. 4, pp. 1-14, Mar. 2023.DOI
22 
B. Roessle, N. Müller, L. Porzi, S. R. Bulò, P. Kontschieder, and M. NieSSner, ``GANeRF: Leveraging Discriminators to Optimize Neural Radiance Fields,'' arXiv preprint arXiv:2306.06044, 2023.DOI
23 
J. Kulhanek and T. Sattler, ``Tetra-NeRF: Representing neural radiance fields using tetrahedra,'' arXiv preprint arXiv:2304.09987, 2023.DOI
24 
D. Lee, M. Lee, C. Shin, and S. Lee, ``Deblurred neural radiance field with physical scene priors,'' arXiv preprint arXiv:2211.12046, 2022.DOI
25 
F. Warburg, E. Weber, M. Tancik, A. Holynski, and A. Kanazawa, ``Nerfbusters: Removing ghostly artifacts from casually captured NeRFs,'' arXiv preprint arXiv:2304.10532, 2023.DOI
26 
K. Zhou, W. Li, Y. Wang, T. Hu, N. Jiang, and X. Han, ``NeRFLiX: High-quality neural view synthesis by learning a degradation-driven inter-viewpoint MiXer,'' Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp. 12363-12374, Jun. 2023.DOI
27 
P. Wang, L. Liu, Y. Liu, C. Theobalt, T. Komura, and W. Wang, ``Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction,'' arXiv preprint arXiv:2106.10689, 2021.DOI
28 
J. Y. Zhang, G. Yang, S. Tulsiani, and D. Ramanan, ``NeRS: Neural reflectance surfaces for sparse-view 3d reconstruction in the wild,'' Advances in Neural Information Processing Systems, vol. 34, pp. 29835-29847, 2021.DOI
29 
T. Takikawa, A. Glassner, and M. McGuire, ``A dataset and explorer for 3D signed distance functions,'' Journal of Computer Graphics Techniques, vol. 11, no. 2, Apr. 2022.URL
30 
L. Yariv, J. Gu, Y, Kasten, and Y. Lipman, ``Volume rendering of neural implicit surfaces,'' Advances in Neural Information Processing Systems, vol 34, pp. 4805-4815, 2021.DOI
31 
D. Azinovic, R. Martin-Brualla, D. B. Goldman, M. ´ Nießner, and J. Thies, ``Neural RGB-D surface reconstruction,'' Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp. 6290-6301, Jun. 2022.DOI
32 
R. Liu, R. Wu, B. V. Hoorick, P. Tokmakov, S. Zakharov, and C. Vondrick, ``Zero-1-to-3: Zero-shot one image to 3d object,'' Proc. of IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9298-9309, Oct. 2023.DOI
33 
J. Tang, T. Wang, B. Zhang, T. Zhang, R. Yi, L. Ma, and D. Chen, ``Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior,'' arXiv preprint arXiv:2303.14184, 2023.DOI
34 
J. Ling, Z. Wang, and F. Xu, ``ShadowNeuS: Neural sdf reconstruction by shadow ray supervision,'' Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp. 175-185, 2023.DOI
35 
R. A. Rosu and S. Behnke, ``PermutoSDF: Fast multi-view reconstruction with implicit surfaces using permutohedral lattices,'' Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp. 8466-8475, 2023.DOI
36 
C. Wang, M. Chai, M. He, D. Che, and J. Liao, ``Clip-NeRF: Text-and-image driven manipulation of neural radiance fields,'' Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3835-3844, 2022.DOI
37 
C. Wang, R. Jiang, M. Chai, M. He, D. Chen, and J. Liao, ``NeRF-Art: Text-driven neural radiance fields stylization,'' IEEE Transactions on Visualization and Computer Graphics, vol. 30, no. 8, pp. 4983-4996, 2024.DOI
38 
A. Radford et al., ``Learning transferable visual models from natural language supervision,'' Proc. of the 38 th International Conference on Machine Learning, vol. 139, pp. 8748-8763, Jul. 2021.DOI
39 
Y.-H. Huang, Y. He, Y.-J. Yuan, Y.-K. Lai, and L. Gao, ``StylizedNeRF: consistent 3D scene stylization as stylized nerf via 2D-3D mutual learning,'' Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp. pp. 18342-18352, Jun. 2022.DOI
40 
X. Li, Z. Cao, H. Sun, J. Zhang, K. Xian, and G. Lin, ``3D cinemagraphy from a single image,'' Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp. 4595-4605, Jun. 2023.DOI
41 
C. Jambon, B. Kerbl, G. Kopanas, S. Diolatzis, T. Leimkühler, and G. Drettakis, ``NeRFshop: Interactive editing of neural radiance fields,'' Proc. of he ACM on Computer Graphics and Interactive Techniques, vol. 6, no. 1, pp. 1-21, Mar. 2023.DOI
42 
A. Haque, M. Tancik, A. A. Efros, A. Holynski, and A. Kanazawa, ``Instruct-NeRF2NeRF: Editing 3D scenes with instructions,'' arXiv preprint arXiv:2303.12789, 2023.DOI
43 
K. Kania, K. M. Yi, M. Kowalski, T. Trzciniski, and ´ A. Tagliasacchi, ``CoNeRF: Controllable neural radiance fields,'' Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp. 18623-18632, Jun. 2022.DOI
44 
Z. Li, T. Müller, A. Evans, R. H. Taylor, M. Unberath, and M.-Y. Liu, ``Neuralangelo: High-fidelity neural surface reconstruction,'' Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pp. 8456-8465, Jun. 2023.DOI

Author

Cheolsu Kwag
../../Resources/ieie/IEIESPC.2025.14.2.191/author1.png

Cheolsu Kwag received his B.S. degree in artificial intelligence, computer science, and engineering from Handong Global University, Pohang, South Korea, in 2023. He is currently in the course of his M.S. degree in Computer Graphics and Vision Lab at Handong Global University. His research interests include neural rendering, 3D reconstruction, and digital twins.

Sung Soo Hwang
../../Resources/ieie/IEIESPC.2025.14.2.191/author2.png

Sung Soo Hwang received his B.S. degree in computer science and electrical engineering from Handong Global University, Pohang, South Korea, in 2008 and his M.S. and Ph.D degrees in electrical engineering from Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea, in 2010 and 2015, respectively. He is currently working as an associate professor with the School of Computer Science and Electrical Engineering, at Handong Global University, South Korea. His current research interests include visual SLAM and neural rendering based 3D Reconstruction.