CS180/280A: Intro to Computer Vision and Computational Photography
cv2.calibrateCamera.
Four example frames used for camera calibration.
Example frames from the object capture session.
cv2.solvePnP with detected corners and intrinsics to recover pose.cv2.undistort (later, I skipped the undistort step, as this yielded better results).c2w), I packaged everything into a final .npz dataset for training.
In the visualization below, the camera-to-world (c2w) frames make it clear where each
photograph was taken from. The camera centers trace an approximately hemispherical path around the
object—covering different azimuths and elevations—so you can see angles and positions spanning a
broad arc. This is visible in the Viser plot and provides good angular coverage while keeping the
object at a roughly constant distance.
Camera frustums and sampled rays
Sample points along rays for NeRF training. Origin visible at ArUco tag corner.
In this part, I train a neural field (an MLP with sinusoidal positional encoding) to map 2D pixel coordinates to RGB, fitting an image by minimizing mean squared error. The output quality is measured with Peak Signal-to-Noise Ratio (PSNR), which increases as MSE decreases:
MLP with positional encoding and a skip connection.
| Parameter | Value |
|---|---|
| iters | 2000 |
| batch_size | 20000 |
| lr | 0.001 |
| width | 512 |
| depth | 8 |
| pe_levels | 12 |
These hyperparameters were used for both the example image and the Yosemite image training.
Resolution: Pixel Width: 1.024, Pixel Height: 689.
Input image
Step 1
Step 50
Step 100
Step 200
Step 400
Step 2000
PSNR over iterations
MSE over iterations
Resolution: Pixel Width: 2.075, Pixel Height: 1.177.
Input image
Step 1
Step 200
Step 400
Step 600
Step 1400
Step 7000
PSNR over iterations
MSE over iterations
example.jpg32, 642, 4
Top-left: w32_L2 • Top-right: w32_L4 • Bottom-left: w64_L2 • Bottom-right: w64_L4
The findings indicate that the network width W and the positional encoding frequency L are both significant. Reconstructions that are excessively smooth or blurry result from the model's lack of frequency coverage and capacity when either is too small. Increasing W gives the network more parameters, increasing capacity and overall image fidelity, and increasing L enables the network to represent higher-frequency detail (less blur).
Implemented the full camera/ray geometry pipeline and data loading used throughout NeRF:
.npz dataset via numpy.load, extracted images_train, c2ws_train, and focal, and normalized images to [0,1] float32.K^{-1}[u,v,1]^T (depth scale s applied afterward), precomputing K^{-1} and building homogeneous [u,v,1] per pixel.c2w[:3,3]). Pixels are mapped to world-space points at unit depth, then ray_d = normalize(x_w − ray_o), with everything fully batched.Sampled both rays and 3D points along each ray:
(ray_o, ray_d, pixel_color). Randomly sampled image indices and pixel coordinates, gathered RGB targets, selected per-pixel K and c2w, and called pixel→ray to get batched origins and directions.[near, far] with num_samples bins. With perturbation, added uniform noise within each interval and returned the 3D points, unit directions, and step_size = (far − near)/num_samples.The combined pipeline produces per-batch ray origins/directions and colors, and I added a small visualization script (Viser) to render cameras, rays, and sampled 3D points for sanity checks.
Lego dataset: sampled camera rays
Lego dataset: points sampled along rays
Implemented a NeRF-style MLP conditioned on positional encodings of 3D points and view directions:
2^k π, concatenated with the input. Separate levels for positions and directions.σ via softplus (with positive bias) for stable, non‑negative densities.sigmoid to constrain [0,1].
Coarse-to-fine MLP with positional encodings, density head (ReLU) and color head (Sigmoid) conditioned on view direction.
For Part 2 I kept the architecture essentially the same as the reference NeRF: a stacked MLP with a skip connection, separate density and color heads, and positional encodings for both 3D coordinates and view directions. My changes focused on hyperparameters only (e.g., network width, coordinate positional-encoding levels L, and view-direction encoding levels Ldir).
Implemented the discrete volumetric rendering equations in PyTorch for color, depth, and opacity:
| Parameter | Value |
|---|---|
| steps | 10000 |
| width | 256 |
| depth | 8 |
| skip_layer | 4 |
| posenc_L | 10 |
| direnc_L | 4 |
| lr | 0.0005 |
| batch_rays | 4096 |
| num_samples | 64 |
| near | 2.0 |
| far | 6.0 |
Using the standard NeRF model size for Lego (width 256, positional-encoding L=10); other settings as listed.
Actual Lego training ran for 10,000 steps.
Step 1
Step 200
Step 400
Step 1400
Step 4000
Step 10000
PSNR over iterations
MSE over iterations
Rendered with provided test camera extrinsics after 10,000 iterations.
Training ties all components together: per‑step I sample rays and 3D points, run the NeRF, volume‑render, compute MSE against ground‑truth pixels, and optimize with Adam. I log PSNR (−10 log10(MSE)), periodically render full images (with optional depth/opacity), and save checkpoints.
I trained a NeRF on my Part 0 dataset, rendered a circling camera GIF, and tracked loss with intermediate renders.
During training, the training MSE continued to decrease while the validation error largely stagnated. This behavior is consistent with the earlier observation that my calibration is not perfectly accurate—validation rays are evaluated under slightly mismatched intrinsics/poses. Even though the validation curves did not keep improving, the rendered reconstructions still looked progressively better to a human observer, suggesting that the model was learning a useful representation of the scene despite these imperfections.
I rescaled images to a width of 800 px and scaled the camera intrinsics accordingly. I chose a higher resolution because my calibration was not perfect: in Viser, back‑projected rays from the camera origin did not intersect exactly at the ArUco tag borders (as they would with ideal intrinsics/poses). My hypothesis was that a higher resolution would reduce pixel‑level quantization error and provide more samples per object area, helping NeRF tolerate small calibration errors. In practice, training at the higher resolution produced better results, which likely supports this rationale (with the trade‑off of increased compute).
| Parameter | Value |
|---|---|
| steps | 30000 |
| near | 0.2 |
| far | 0.6 |
| batch_rays | 8000 |
| num_samples | 64 |
| width | 512 |
| depth | 8 |
| skip_layer | 4 |
| posenc_L | 12 |
| direnc_L | 4 |
| lr | 6e-4 |
Because I increased the input resolution, rays sampled the scene more densely and the images contained
higher spatial frequencies. To model this additional detail, I raised the positional‑encoding levels
(posenc_L for coordinates and direnc_L for view directions), and increased the network
width to provide more capacity. In short: more pixels → more high‑frequency content → higher PE,
and denser 3D sampling → a wider MLP to faithfully represent the added detail.
Step 1
Step 200
Step 400
Step 1400
Step 4000
Step 10000
PSNR over iterations (train/val)
MSE over iterations (train/val)
Shoe reconstruction, 24-frame spin.
Along each camera ray, NeRF samples points with predicted densities σ (and colors). Volumetric rendering assigns each sample an opacity αi = 1 − exp(−σi Δi) over interval Δi and a transmittance Ti = exp(−∑j<i σj Δj). The per‑sample weight is wi = Ti · αi. The depth map is the ray‑wise expectation of distance: Depth = ∑ wi · ti (optionally normalized for display).
The accumulated opacity (confidence) that a ray intersects a surface is
acc = ∑ wi, with values near 0 indicating empty space and near 1 indicating an
opaque hit. A threshold acc_thresh masks rays with insufficient confidence
(acc < acc_thresh) and normalizes depths over the remaining pixels, suppressing floaters and
background leakage in low‑confidence regions.
I tuned acc_thresh empirically. Lower values kept semi‑transparent floaters, while higher values
started knocking out thin structures. Settling around acc_thresh ≈ 0.3 consistently produced cleaner,
more stable depth maps for my scenes.
Without acc_thresh (noisy, shows floaters and artifacts)
With acc_thresh = 0.3 (clean, artifacts removed)