# Parallel Seam Carving

Seam carving is a technique
for *content-aware resizing*, which changes the size of an image while
attempting to preserve the interesting content of the image. The basic
idea behind the technique is to remove ** low-energy seams**: paths of pixels
that are similar to their surroundings. A removed seam can be either vertical or
horizontal, to reduce either the width or height of an image.

For example, here’s a panorama I took a few years ago near Spruce Knob, WV.

The image is super wide, but we can carve it down.
Here’s an animation of the process, generated by the
seam-carver I implemented recently
as a parallel benchmark for
`mpl`

.

Getting real parallel speedups in this implementation turned out to be an
interesting challenge. For typical images, which are fairly small,
I found that the most obvious approach for parallelizing the algorithm was not
able to extract enough parallelism, because it required too small of a
*grain size*.
So, I designed a new way of parallelizing the algorithm which works much
better on small images. This post presents some of the details.

## The Seam Carving Algorithm

The idea behind seam-carving is to consider all possible seams in an image,
select the one which is least important, and remove it. A ** seam** is a path of
pixels that cuts across an image, touching one pixel in each row (or column)
along the way. When we remove a seam, we reduce either the width or height
of the image by 1. We can repeat this process as many times as desired to
shrink the image.

We’ll base the “importance” of seams on a notion of pixel ** energy**, which
can be defined in
many different ways.
For seam-carving, the
color difference gradient
seems to work well, which measures how different a pixel’s color is in
comparison to its surroundings. The

**is just the cumulative energies of all pixels touched by the seam. With this setup, the seam with**

*energy of a seam**minimum energy*should be unimportant and safe to remove from an image.

To compute *minimum seam energies*, we can use a simple
dynamic programming
algorithm. For simplicity here, let’s focus on vertical seams. Let
\(M(i,j)\) be the minimum energy of
any partial seam that ends at row \(i\) and column \(j\), and let
\(E(i,j)\) be the energy of the pixel at that position. These minimum seam
energies are related by the following equation:

A picture might help. For each position, we look at the minimum partial seams that end above it, pick the smallest, and then add the next pixel.

Once \(M\) has been fully computed, we can find the minimum overall seam by working backwards, from bottom to top, by walking upwards along the path of minimum seam energies starting at the smallest in the bottommost row.

## Finding the minimum seam in parallel

# Row-major order (doesn’t work)

The dependencies within the equation for \(M\) appear to offer a lot of
parallelism. Perhaps the most immediately obvious approach is to compute
\(M\) in ** row-major** order,
where first we do all of row 0, and then all of row 1, etc. With this ordering,
the values within a row can be computed entirely in parallel, because each
\(M(i, j)\) only depends on three values \(M(i-1, \ldots)\) from the previous row.

However, there’s a major problem in practice with row-major order: typical images are only at most a few thousand pixels wide, and each pixel only requires a small amount of work. That’s not enough work (per row) to parallelize effectively! We’ll need a granularity of at least, say, a thousand pixels to amortize the cost of parallelism itself.

This pretty much rules out using row-major order for typical, small images. Is there a better to compute \(M\) without reducing the grain size?

# Triangular-blocked strips

To extract more parallelism without reducing the grain size, we need a
better way to partition up the work. This is something that probably
*could* be done “automatically” (e.g. take a look at
Rezaul Chowdhury’s
work), but here we’ll just try to design something ourselves.

What are all of the dependencies for a single value \(M(i,j)\)? Visually, the dependencies form a triangle with the pointy-end pointing down:

We can use triangles like these to split up the work in a nice way. First,
imagine grouping adjacent rows into ** strips**. Within one strip,
cover the top of the strip with a bunch downward-pointing triangles. The
leftover space will look like a bunch of upward-pointing triangles, each
of which has all of its dependencies already satisfied (this is not immediately
obvious but drawing a picture will help). So, we
can process one strip in two parallel rounds, where first we do a bunch of
downward-pointing triangles in parallel, and then we do a bunch of
upward-pointing triangles to fill in the leftover space.

Here’s a picture using triangles that are 6 pixels wide at the base. Note that if we use triangles of even base-width, then these tile naturally with no overlap.

How wide should each triangle be in practice? Recall that we’re shooting for a target granularity of maybe one or two thousand pixels. This corresponds to triangles with base-widths of somewhere between 64 (1056 pixels in the triangle) and 90 (2070 pixels in the triangle).

So, in a smallish image, e.g. 1000 pixels wide, we can fit at least 11 triangles horizontally, suggesting a maximum possible speedup of 11x or more. This is a big improvement over the row-major strategy, which (using a granularity of 1000 pixels) would not be able to extract any parallelism from such an image.

## Implementation and performance

With `mpl`

, I implemented both the row-major
and triangular strategies to compare their performance. The code for the
triangular strategy is available in the `examples/src/seam-carve/`

subdirectory
of the `mpl`

repo. I’ll leave the row-major
code as an exercise to the reader. (It’s not too bad!)

Here are some performance numbers for removing 10 seams from an image of size approximately 2600x600. I did a little bit of tuning, and ended up choosing a row-major granularity of 1000 pixels, and a triangular base-width granularity of 88 (1980 pixels).

Strategy | P = 1 | P = 10 | P = 20 | P = 30 |
---|---|---|---|---|

row-major | 0.66 | 0.38 | 0.38 | 0.38 |

triangular | 0.78 | 0.17 | 0.12 | 0.09 |

On 1 processor (P = 1), the row-major strategy is faster by about 15%, but on more processors quickly hits a maximum speedup of about 2x, seeing no additional improvement above 10 processors. In contrast, the triangular strategy continues to get faster as the number of processors increases, with self-speedups of about 5x on 10 processors, 6.5x on 20, and 9x on 30.

Despite having a mild amount of overhead on 1 processor, we can see that the triangular strategy provides significant gains. On higher core counts, it is as much as 4x faster than the row-major strategy.

## Discussion

One might wonder: why are the speedups we got so far away from theoretical limits? With a base-width of 88, shouldn’t we be able to get up to 30x speedup on an image of width 2600? Well, unfortunately, there are a large number of barriers to achieving such good speedup in practice: ineffective cache utilization, memory bandwidth bottlenecks, non-optimal scheduling… and many more.

In the case of seam carving, there are two major limiting factors.

- Seam carving is largely memory-bound, i.e., for each memory access it only does a small amount of work before the next access. To get around this bottleneck, we might need better cache utilization, and so we would have to more carefully lay out the values of \(M(i,j)\) in memory. (For this implementation, I used a simple flat-array layout, where \(M(i,j)\) lives at index \(iW + j\). We could instead try storing these values in a hierarchical structure, indexing first by strip, and then by triangle, to better improve locality.)
- Seam carving has high span, performing a small amount of work in between each synchronization. For example in the row-major strategy, we use a barrier between each row, and in the triangular strategy, we use two barriers per strip. This can cause the scheduler to perform too many thread migrations. To mitigate this, one thing we could try is to let triangles overlap slightly, to increase the size of each strip. This trades a small amount of duplicated work for fewer barrier synchronizations overall, which could prove to be hugely beneficial.

## Comments

Clever! It was hard to see how the negative space triangles were computable at first, but it made sense after looking at it for a second.

That was my “aha” moment for this problem! I love the way the triangles just… fit.

Hi Sam!

Hey!! I hope this post brought back some good memories

Great post! It’s good to see a

practicalanalysis of the cost of too-fine-grained parallelism, since this is often hand-waved away when discussing “embarrassingly parallel” problems. As you show, to make some things fast you need a not too fine-grained approach (too much overhead) and also not to coarse (or else you can’t use all the existing parallelism). Finding a minimal overhead approach which scales well across a range of available parallelism is often a bit tricky.About this:

Shouldn’t “each row” there actually be “each column (within a row)” or “each pixel”?

That is, if I understand what you are suggesting for this approach, it is doing each row serially, and getting per-pixel (or coarser) parallelism within a row.

Minor typos:

I guess should be “imagine grouping”

I’m glad you liked it! And yes, I completely agree: the intricacies of practical parallelism are too often hand-waved away as “implementation details”, despite being really interesting in their own right. I’m hoping these posts will help shed some light on the matter.

Thanks for spotting those typos—fixed!

After a re-read, I probably just mis-parsed “each row can be processed entirely in parallel”. I read it as “across the set of all rows, each row can be processed in parallel (but the work within a row is serial)” (e.g., one row per worker), but of course it could validly mean “the work

withineach row can be processed in parallel (e.g., one pixel per worker), but whole rows go serially (row 1 must be processed after row 0, etc).Whew!

Anyway, the way you have it now is unambiguous, I think.