UC Berkeley logo

Project 4

CS180/280A: Intro to Computer Vision and Computational Photography

Neural Radiance Field!

Neural Radiance Field animation

Part 0: Calibrating Your Camera and Capturing a 3D Scan

Part 0.1: Calibrating Your Camera

Calibration image samples

Four example frames used for camera calibration.

Part 0.2: Capturing a 3D Object Scan

Object scan samples

Example frames from the object capture session.

Part 0.3 + 0.4: Estimating Camera Pose and Dataset

3D Visualization: Cameras, Rays, and Samples

In the visualization below, the camera-to-world (c2w) frames make it clear where each photograph was taken from. The camera centers trace an approximately hemispherical path around the object—covering different azimuths and elevations—so you can see angles and positions spanning a broad arc. This is visible in the Viser plot and provides good angular coverage while keeping the object at a roughly constant distance.

Camera frustums and sampled rays

Sample points along rays for NeRF training. Origin visible at ArUco tag corner.

Part 1: Fit a Neural Field to a 2D Image

In this part, I train a neural field (an MLP with sinusoidal positional encoding) to map 2D pixel coordinates to RGB, fitting an image by minimizing mean squared error. The output quality is measured with Peak Signal-to-Noise Ratio (PSNR), which increases as MSE decreases:

PSNR = 10 · log10( 1 / MSE )

Model Architecture

MLP with positional encoding and a skip connection.

Hyperparameters (Example + Yosemite training)

ParameterValue
iters2000
batch_size20000
lr0.001
width512
depth8
pe_levels12

These hyperparameters were used for both the example image and the Yosemite image training.

Example: Training progression

Resolution: Pixel Width: 1.024, Pixel Height: 689.

Input image

Step 1

Step 50

Step 100

Step 200

Step 400

Step 2000

Example: PSNR and MSE curves

PSNR over iterations

MSE over iterations

Yosemite: Training progression

Resolution: Pixel Width: 2.075, Pixel Height: 1.177.

Input image

Step 1

Step 200

Step 400

Step 600

Step 1400

Step 7000

Yosemite: PSNR and MSE curves

PSNR over iterations

MSE over iterations

Hyperparameter Comparison

  • image: example.jpg
  • sweep_widths: 32, 64
  • sweep_pe_levels: 2, 4

Top-left: w32_L2 • Top-right: w32_L4 • Bottom-left: w64_L2 • Bottom-right: w64_L4

The findings indicate that the network width W and the positional encoding frequency L are both significant. Reconstructions that are excessively smooth or blurry result from the model's lack of frequency coverage and capacity when either is too small. Increasing W gives the network more parameters, increasing capacity and overall image fidelity, and increasing L enables the network to represent higher-frequency detail (less blur).

Part 2: Fit a Neural Radiance Field from Multi-view Images

Part 2.1: Create Rays from Cameras

Implemented the full camera/ray geometry pipeline and data loading used throughout NeRF:

Part 2.2: Sampling

Sampled both rays and 3D points along each ray:

Part 2.3: Putting the Dataloading All Together

The combined pipeline produces per-batch ray origins/directions and colors, and I added a small visualization script (Viser) to render cameras, rays, and sampled 3D points for sanity checks.

Lego dataset: sampled camera rays

Lego dataset: points sampled along rays

Part 2.4: Neural Radiance Field

Implemented a NeRF-style MLP conditioned on positional encodings of 3D points and view directions:

NeRF Model Architecture

Coarse-to-fine MLP with positional encodings, density head (ReLU) and color head (Sigmoid) conditioned on view direction.

For Part 2 I kept the architecture essentially the same as the reference NeRF: a stacked MLP with a skip connection, separate density and color heads, and positional encodings for both 3D coordinates and view directions. My changes focused on hyperparameters only (e.g., network width, coordinate positional-encoding levels L, and view-direction encoding levels Ldir).

Part 2.5: Volume Rendering

Implemented the discrete volumetric rendering equations in PyTorch for color, depth, and opacity:

Hyperparameters (Lego NeRF)

ParameterValue
steps10000
width256
depth8
skip_layer4
posenc_L10
direnc_L4
lr0.0005
batch_rays4096
num_samples64
near2.0
far6.0

Using the standard NeRF model size for Lego (width 256, positional-encoding L=10); other settings as listed.

Actual Lego training ran for 10,000 steps.

Lego: Training progression

Step 1

Step 200

Step 400

Step 1400

Step 4000

Step 10000

Lego: PSNR and MSE curves

PSNR over iterations

MSE over iterations

Lego: Spherical rendering (10,000 iterations)

Rendered with provided test camera extrinsics after 10,000 iterations.

Part 2.6: Training with your own data

Training ties all components together: per‑step I sample rays and 3D points, run the NeRF, volume‑render, compute MSE against ground‑truth pixels, and optimize with Adam. I log PSNR (−10 log10(MSE)), periodically render full images (with optional depth/opacity), and save checkpoints.

I trained a NeRF on my Part 0 dataset, rendered a circling camera GIF, and tracked loss with intermediate renders.

During training, the training MSE continued to decrease while the validation error largely stagnated. This behavior is consistent with the earlier observation that my calibration is not perfectly accurate—validation rays are evaluated under slightly mismatched intrinsics/poses. Even though the validation curves did not keep improving, the rendered reconstructions still looked progressively better to a human observer, suggesting that the model was learning a useful representation of the scene despite these imperfections.

I rescaled images to a width of 800 px and scaled the camera intrinsics accordingly. I chose a higher resolution because my calibration was not perfect: in Viser, back‑projected rays from the camera origin did not intersect exactly at the ArUco tag borders (as they would with ideal intrinsics/poses). My hypothesis was that a higher resolution would reduce pixel‑level quantization error and provide more samples per object area, helping NeRF tolerate small calibration errors. In practice, training at the higher resolution produced better results, which likely supports this rationale (with the trade‑off of increased compute).

Hyperparameters (Own dataset training)

ParameterValue
steps30000
near0.2
far0.6
batch_rays8000
num_samples64
width512
depth8
skip_layer4
posenc_L12
direnc_L4
lr6e-4

Because I increased the input resolution, rays sampled the scene more densely and the images contained higher spatial frequencies. To model this additional detail, I raised the positional‑encoding levels (posenc_L for coordinates and direnc_L for view directions), and increased the network width to provide more capacity. In short: more pixels → more high‑frequency content → higher PE, and denser 3D sampling → a wider MLP to faithfully represent the added detail.

Own dataset: Training progression

Step 1

Step 200

Step 400

Step 1400

Step 4000

Step 10000

Own dataset: PSNR and MSE curves

PSNR over iterations (train/val)

MSE over iterations (train/val)

Own dataset: Shoe novel views

Shoe reconstruction, 24-frame spin.

Bells & Whistles

Depth Map Rendering with Accumulated Opacity Threshold

Along each camera ray, NeRF samples points with predicted densities σ (and colors). Volumetric rendering assigns each sample an opacity αi = 1 − exp(−σi Δi) over interval Δi and a transmittance Ti = exp(−∑j<i σj Δj). The per‑sample weight is wi = Ti · αi. The depth map is the ray‑wise expectation of distance: Depth = ∑ wi · ti (optionally normalized for display).

The accumulated opacity (confidence) that a ray intersects a surface is acc = ∑ wi, with values near 0 indicating empty space and near 1 indicating an opaque hit. A threshold acc_thresh masks rays with insufficient confidence (acc < acc_thresh) and normalizes depths over the remaining pixels, suppressing floaters and background leakage in low‑confidence regions.

I tuned acc_thresh empirically. Lower values kept semi‑transparent floaters, while higher values started knocking out thin structures. Settling around acc_thresh ≈ 0.3 consistently produced cleaner, more stable depth maps for my scenes.

Lego: Depth Map Comparison

Without acc_thresh (noisy, shows floaters and artifacts)

With acc_thresh = 0.3 (clean, artifacts removed)

Learnings