Seam carving is a technique for content-aware resizing, which changes the size of an image while attempting to preserve the interesting content of the image. The basic idea behind the technique is to remove low-energy seams: paths of pixels that are similar to their surroundings. A removed seam can be either vertical or horizontal, to reduce either the width or height of an image.

For example, here’s a panorama I took a few years ago near Spruce Knob, WV.

The image is super wide, but we can carve it down. Here’s an animation of the process, generated by the seam-carver I implemented recently as a parallel benchmark for mpl.

Getting real parallel speedups in this implementation turned out to be an interesting challenge. For typical images, which are fairly small, I found that the most obvious approach for parallelizing the algorithm was not able to extract enough parallelism, because it required too small of a grain size. So, I designed a new way of parallelizing the algorithm which works much better on small images. This post presents some of the details.

## The Seam Carving Algorithm

The idea behind seam-carving is to consider all possible seams in an image, select the one which is least important, and remove it. A seam is a path of pixels that cuts across an image, touching one pixel in each row (or column) along the way. When we remove a seam, we reduce either the width or height of the image by 1. We can repeat this process as many times as desired to shrink the image.

We’ll base the “importance” of seams on a notion of pixel energy, which can be defined in many different ways. For seam-carving, the color difference gradient seems to work well, which measures how different a pixel’s color is in comparison to its surroundings. The energy of a seam is just the cumulative energies of all pixels touched by the seam. With this setup, the seam with minimum energy should be unimportant and safe to remove from an image.

To compute minimum seam energies, we can use a simple dynamic programming algorithm. For simplicity here, let’s focus on vertical seams. Let $$M(i,j)$$ be the minimum energy of any partial seam that ends at row $$i$$ and column $$j$$, and let $$E(i,j)$$ be the energy of the pixel at that position. These minimum seam energies are related by the following equation:

$M(i,j) = E(i,j) + \min(M(i-1,j-1), M(i-1,j), M(i-1,j+1))$

A picture might help. For each position, we look at the minimum partial seams that end above it, pick the smallest, and then add the next pixel.

Once $$M$$ has been fully computed, we can find the minimum overall seam by working backwards, from bottom to top, by walking upwards along the path of minimum seam energies starting at the smallest in the bottommost row.

# Row-major order (doesn’t work)

The dependencies within the equation for $$M$$ appear to offer a lot of parallelism. Perhaps the most immediately obvious approach is to compute $$M$$ in row-major order, where first we do all of row 0, and then all of row 1, etc. With this ordering, the values within a row can be computed entirely in parallel, because each $$M(i, j)$$ only depends on three values $$M(i-1, \ldots)$$ from the previous row.

However, there’s a major problem in practice with row-major order: typical images are only at most a few thousand pixels wide, and each pixel only requires a small amount of work. That’s not enough work (per row) to parallelize effectively! We’ll need a granularity of at least, say, a thousand pixels to amortize the cost of parallelism itself.

This pretty much rules out using row-major order for typical, small images. Is there a better to compute $$M$$ without reducing the grain size?

# Triangular-blocked strips

To extract more parallelism without reducing the grain size, we need a better way to partition up the work. This is something that probably could be done “automatically” (e.g. take a look at Rezaul Chowdhury’s work), but here we’ll just try to design something ourselves.

What are all of the dependencies for a single value $$M(i,j)$$? Visually, the dependencies form a triangle with the pointy-end pointing down:

We can use triangles like these to split up the work in a nice way. First, imagine grouping adjacent rows into strips. Within one strip, cover the top of the strip with a bunch downward-pointing triangles. The leftover space will look like a bunch of upward-pointing triangles, each of which has all of its dependencies already satisfied (this is not immediately obvious but drawing a picture will help). So, we can process one strip in two parallel rounds, where first we do a bunch of downward-pointing triangles in parallel, and then we do a bunch of upward-pointing triangles to fill in the leftover space.

Here’s a picture using triangles that are 6 pixels wide at the base. Note that if we use triangles of even base-width, then these tile naturally with no overlap.

How wide should each triangle be in practice? Recall that we’re shooting for a target granularity of maybe one or two thousand pixels. This corresponds to triangles with base-widths of somewhere between 64 (1056 pixels in the triangle) and 90 (2070 pixels in the triangle).

So, in a smallish image, e.g. 1000 pixels wide, we can fit at least 11 triangles horizontally, suggesting a maximum possible speedup of 11x or more. This is a big improvement over the row-major strategy, which (using a granularity of 1000 pixels) would not be able to extract any parallelism from such an image.

## Implementation and performance

With mpl, I implemented both the row-major and triangular strategies to compare their performance. The code for the triangular strategy is available in the examples/src/seam-carve/ subdirectory of the mpl repo. I’ll leave the row-major code as an exercise to the reader. (It’s not too bad!)

Here are some performance numbers for removing 10 seams from an image of size approximately 2600x600. I did a little bit of tuning, and ended up choosing a row-major granularity of 1000 pixels, and a triangular base-width granularity of 88 (1980 pixels).

Strategy P = 1 P = 10 P = 20 P = 30
row-major 0.66 0.38 0.38 0.38
triangular 0.78 0.17 0.12 0.09

On 1 processor (P = 1), the row-major strategy is faster by about 15%, but on more processors quickly hits a maximum speedup of about 2x, seeing no additional improvement above 10 processors. In contrast, the triangular strategy continues to get faster as the number of processors increases, with self-speedups of about 5x on 10 processors, 6.5x on 20, and 9x on 30.

Despite having a mild amount of overhead on 1 processor, we can see that the triangular strategy provides significant gains. On higher core counts, it is as much as 4x faster than the row-major strategy.

## Discussion

One might wonder: why are the speedups we got so far away from theoretical limits? With a base-width of 88, shouldn’t we be able to get up to 30x speedup on an image of width 2600? Well, unfortunately, there are a large number of barriers to achieving such good speedup in practice: ineffective cache utilization, memory bandwidth bottlenecks, non-optimal scheduling… and many more.

In the case of seam carving, there are two major limiting factors.

1. Seam carving is largely memory-bound, i.e., for each memory access it only does a small amount of work before the next access. To get around this bottleneck, we might need better cache utilization, and so we would have to more carefully lay out the values of $$M(i,j)$$ in memory. (For this implementation, I used a simple flat-array layout, where $$M(i,j)$$ lives at index $$iW + j$$. We could instead try storing these values in a hierarchical structure, indexing first by strip, and then by triangle, to better improve locality.)
2. Seam carving has high span, performing a small amount of work in between each synchronization. For example in the row-major strategy, we use a barrier between each row, and in the triangular strategy, we use two barriers per strip. This can cause the scheduler to perform too many thread migrations. To mitigate this, one thing we could try is to let triangles overlap slightly, to increase the size of each strip. This trades a small amount of duplicated work for fewer barrier synchronizations overall, which could prove to be hugely beneficial.