UC Berkeley logo

Project 1

CS180/280A: Intro to Computer Vision and Computational Photography

Images of the Russian Empire: Colorizing the Prokudin-Gorskii photo collection

Unaligned stacked color channels Aligned color image

Approach (High-Level)

Goal: Reconstruct a clean color image by extracting the three monochrome plates (B, G, R), aligning them precisely, and stacking them into an RGB image with minimal artifacts.

I first inspected the input plates. There is clear correlation across channels at the same pixel locations: brighter pixels in one channel are usually brighter in the others (not always, but mostly). However, overall illumination and contrast differ between channels, so raw intensities are on different scales.

This suggested using a pixelwise distance metric to quantify alignment quality: if the plates are well aligned, the sum of per-pixel distances should be small. To avoid getting stuck in local minima (which can happen if you only move greedily in the locally best direction), I evaluate the total distance for many candidate shifts and choose the shift with the lowest global score.

I started with an exhaustive search over translations using the L2 distance on raw pixel values. This already produced good alignments for many images. However, some cases like the emir still failed due to cross‑channel brightness differences, for example clearly visible in his very colorful dress.

I first tried per‑channel min–max scaling to normalize dynamic range, but results were unstable since extrema are image‑dependent and sensitive to outliers. I then switched to per‑channel z‑normalization (standardizing to zero mean and unit variance) before scoring so that structures became more comparable across channels despite overall brightness/gain differences.

L2 distance between two patches/channels I and J:

\[ \|I - J\|_2 = \sqrt{\sum_i (I_i - J_i)^2} \]

Z-normalization of a channel X:

\[ z(x) = \frac{x - \mu}{\sigma} \quad \text{where } \mu = \mathrm{mean}(X),\; \sigma = \mathrm{std}(X) \]

Despite z‑normalization, a few images remained challenging. I therefore added Normalized Cross‑Correlation (NCC) as the matching score. Intuition: NCC compares zero‑mean, normalized patches and is invariant to linear brightness/contrast changes, making it better suited when channels have different gains.

\[ \mathrm{NCC}(I, J) = \frac{\sum_i (I_i - \bar I)(J_i - \bar J)}{\sqrt{\sum_i (I_i - \bar I)^2}\;\sqrt{\sum_i (J_i - \bar J)^2}} \]

In practice, I z‑normalize or zero‑center patches and evaluate NCC over a search window derived by splitting each channel into a small grid of patches; I score candidate shifts across these windows and take the displacement with the highest overall score. This made the alignment robust across all images, including difficult cases like the emir.

Low‑resolution results (JPEG, brute force)

cathedral (JPEG brute force)
cathedral (JPEG)
monastery (JPEG brute force)
monastery (JPEG)
tobolsk (JPEG brute force)
tobolsk (JPEG)

However, because high‑resolution TIFFs are also available, a faster approach is needed for those larger images.

Coarse‑to‑Fine Pyramid

To make alignment efficient on large glass plates, I use a multi-scale (coarse‑to‑fine) image pyramid:

Intuition: Computing scores (e.g., L2, NCC) over many shifts on full‑resolution images is expensive. Start at a much smaller resolution to get a cheap, rough alignment; then iteratively double the size and refine around that estimate. This minimizes the total number of evaluations overall, with only a few refinements needed at full resolution.

Coarse-to-fine image pyramid visualization

Source: IIPImage documentation

  1. Build a pyramid by halving resolution until another downscale would make the smallest dimension < 32 px (min base ≈ 32×32).
  2. At the coarsest level, stack the channels and run an exhaustive search with a ±8 px radius to find the best translation.
  3. Propagate the displacement to the next finer level (scale ×2), then refine within a small local window of ±2 px (top/bottom/left/right).
  4. Repeat to full resolution, optionally re‑normalizing and scoring on interior pixels to avoid border effects.

NCC windowing: at the smallest level I split the image into a 2×2 grid and evaluated NCC per cell; at the next level (2× bigger) into a 4×4 grid, and so on. This keeps the NCC window size approximately constant in absolute terms across scales.

Using a larger window (±8) only at the coarsest level and a tighter window (±2) at finer levels dramatically reduces the global search space while preserving final alignment accuracy.

Results (Aligned RGB, high‑res TIFF, pyramid)

Outputs produced on the high‑resolution TIFFs using the coarse‑to‑fine alignment and per‑channel min–max rendering.

church (TIFF pyramid)
church (TIFF)
emir (TIFF pyramid)
emir (TIFF)
harvesters (TIFF pyramid)
harvesters (TIFF)
icon (TIFF pyramid)
icon (TIFF)
italil (TIFF pyramid)
italil (TIFF)
lastochikino (TIFF pyramid)
lastochikino (TIFF)
lugano (TIFF pyramid)
lugano (TIFF)
melons (TIFF pyramid)
melons (TIFF)
self_portrait (TIFF pyramid)
self_portrait (TIFF)
siren (TIFF pyramid)
siren (TIFF)
three_generations (TIFF pyramid)
three_generations (TIFF)

Alignment Offsets and Compute Times (Combined L2 + NCC)

Image B→G shift R→G shift Total time (s)
cathedralup 5 px, left 2 pxdown 7 px, right 1 px0.388
churchup 25 px, left 4 pxdown 33 px, left 8 px30.243
emirup 49 px, left 22 pxdown 57 px, right 18 px30.524
harvestersup 59 px, left 16 pxdown 64 px, left 2 px30.019
iconup 39 px, left 17 pxdown 48 px, right 5 px30.702
italilup 38 px, left 21 pxdown 39 px, right 15 px30.655
lastochikinodown 2 px, right 2 pxdown 78 px, left 7 px30.435
luganoup 40 px, right 15 pxdown 53 px, left 13 px30.780
melonsup 81 px, left 10 pxdown 96 px, right 4 px31.156
monasterydown 3 px, left 2 pxdown 6 px, right 1 px0.377
self_portraitup 78 px, left 29 pxdown 98 px, right 8 px30.837
sirenup 48 px, right 6 pxdown 47 px, left 18 px32.307
three_generationsup 50 px, left 15 pxdown 58 px, left 4 px30.443
tobolskdown 3 px, left 3 pxdown 4 px0.387

Additional Results — Prokudin‑Gorskii Collection (my selections)

Prokudin-Gorskii extra - city
city (collection)
Prokudin-Gorskii extra - gate
gate (collection)
Prokudin-Gorskii extra - tippie
tippie (collection — downloaded picture was already aligned, my algorithm found the same alignment, as no extra borders around are visible)
Prokudin-Gorskii extra - water
water (collection)

Bells & Whistles Results

Below are result summaries for each enhancement, with brief approach notes and before/after comparisons.

Automatic cropping

Intuition: Real borders sit at the edges, look like long straight lines, and contain lots of extreme pixels — very bright (white) and very dark (black). In contrast, real scene content has mixed mid‑tones and broken/curvy edges; full rows or columns in the interior are rarely dominated by extremes, which makes border lines easy to spot.

Method:

As visible below, applying the procedure independently to each channel (B/G/R) produces consistently good crop boxes.

Limitation: in some cases, adjacent strong whites and blacks can blend (e.g., via blur/compression/downsampling) into gray along a line, weakening the extreme‑pixel signal and causing under‑detection of borders. The 4‑of‑5 neighbor voting and per‑channel processing help, but are not perfect; tightening thresholds or the extreme‑fraction can further improve those edge cases.

Per-channel crop box (cathedral, B)
Cathedral — B channel (red box = crop)
Per-channel crop box (cathedral, G)
Cathedral — G channel (red box = crop)
Per-channel crop box (cathedral, R)
Cathedral — R channel (red box = crop)

Automatic contrasting

Motivation

Global methods (min–max scaling or vanilla histogram equalization) can either leave low‑contrast regions flat or push highlights/shadows too far. Many Prokudin‑Gorskii images have locally varying illumination; we need a method that enhances texture and structure region‑by‑region without amplifying noise or creating halos.

Approach

Before & After Results

Original contrast (Lugano)
Enhanced contrast (Lugano)
Original Enhanced
Original contrast (Self Portrait)
Enhanced contrast (Self Portrait)
Original Enhanced

The enhanced versions breathe noticeably more contrast and clarity, giving a crisper, more lifelike appearance; by comparison, the originals feel slightly soft and subdued.

Automatic white balance

Motivation

Channels often have different gains/illumination, leading to color casts (e.g., blue/green tint). We aim to estimate a neutral illuminant and re-scale channels so neutrals appear gray/white.

Approach

Gray‑World algorithm. Intuition: under a neutral illuminant, the spatial average of an image should be achromatic (gray). If an image has a color cast, the per‑channel means differ; we correct this by scaling each channel so their means match a common gray target.

\[ s_c = \frac{\mu_{\text{target}}}{\mu_c}, \quad X'_c = \operatorname{clip}(s_c \cdot X_c) \]

This neutralizes global color casts while preserving relative contrasts; using interior pixels and clipping avoids bias from borders and outliers.

Implementation note: I used the average-of-means gray target \(\mu_{\text{target}} = (\mu_R+\mu_G+\mu_B)/3\) (not a fixed 0.5), and clipped per-channel gains to a safe range before applying them.

Before & After Results

Original white balance (Emir)
White-balanced result (Emir)
Original White-balanced
Original white balance (Church)
White-balanced result (Church)
Original White-balanced

Assuming the Emir’s hat and the church facade are meant to be white, the Gray‑World correction makes these regions appear white in the new images—whereas they showed clear color casts in the originals.

Better color mapping

Motivation

Channel gains and spectral responses can differ; simple per‑channel scaling (white balance) may not fully capture cross‑channel interactions. I explored small color transforms to find a mapping that looks most realistic.

Approach

I generated a small set of candidates using two tiny 3×3 matrix families — diagonal channel gains and cross‑channel mixing — then picked the most natural‑looking result.

Note: these mappings are not universal. The most natural‑looking transform is image‑dependent (scene/illuminant) and therefore relative; the grid enables per‑image selection rather than a single fixed mapping.

Chosen mapping (bottom‑right tile)

I picked the bottom‑right tile of the grid as the final mapping. This corresponds to a subtle blue‑from‑green mix:

\[ B' = (1-\varepsilon)\,B + \varepsilon\,G, \quad G' = G, \quad R' = R, \;\; \text{with } \varepsilon \approx 0.25. \]

Intuition: adding a touch of green into blue reduces residual magenta casts and improves foliage/sky balance without over‑saturating reds.

Grid and comparison (tippie)

Color mapping variant grid for tippie
Variant grid (tippie)
Tippie before color mapping
Tippie after chosen color mapping
Original Mapped

The mapped image (right) looks more realistic and less like a warm filter is applied (as the left side feels). The sand color also appears more natural.

Better features

Gradient-based approach

Intuition: edges are places where neighboring pixels differ strongly. Compute horizontal and vertical finite differences, then combine them into a gradient magnitude and normalize. This captures structure while being less sensitive to absolute brightness.

\[ G_x(x,y) = I(x+1,y) - I(x-1,y), \quad G_y(x,y) = I(x,y+1) - I(x,y-1) \]
\[ |\nabla I|(x,y) = \sqrt{ G_x(x,y)^2 + G_y(x,y)^2 }, \quad \hat G = \frac{|\nabla I| - \mu}{\sigma} \]

Alignment

For each candidate shift, I evaluate equal‑weight NCC and L2 on the gradient maps and pick the best shift in a coarse‑to‑fine pyramid (same window schedule as before).

Gradient map — Blue channel
Gradient map — Green channel

These gradient maps are used directly for alignment: the algorithm aligns on gradients (not raw intensities) using equal‑weight NCC and L2.

Finding: On this dataset, gradients+NCC/L2 did not outperform the prior method in a significant way, but it was a useful experiment that confirmed alignment robustness to brightness changes.