There is a sorting algorithm I’ve always wanted to implement, but never had the
patience to deal with all the fiddly bits of: `TurkSort`

. The premise of
`TurkSort`

is that some lists are *hard* to sort, where *hard* is a
highly-technical word meaning “no good comparison function is known for two
elements”. TurkSort solves this problem by *outsourcing* the work to Amazon
Mechanical Turk.

First, what makes a list *hard* to sort? Numbers are not hard to sort; the
simple function `>`

or any of its ilk will efficiently compare two numbers.
Other kinds of input can be murkier, but still tractable. Consider colors:
while there is no obvious sorting algorithm for colors, it is easy to come up
with one for any given color space. In RGB space,
we could treat each color as a 6-digit hexidecimal number `0xrrggbb`

and then
sort numerically. Likewise for Y’UV or HSL we can use
a simple `tuple`

comparison of the color components (otherwise known as *lexicographical order*).

All of these *easy* sorts have had a small number of *dimensions* - three at
most. *Highly-dimensional* data, however, poses a challenge. Consider a
photograph represented as a `BMP`

file - at the very least we have an `M`

by
`N`

grid of pixels and each pixel has an `R`

, `G`

, and `B`

component. A naïve
algorithm would treat each image as an `M * N`

long string of `RGB`

values and
sort lexicographically from there. A potentially better one would first sort
all images by size (`M * N`

) and then by average `R`

in the image, then `G`

,
then `B`

. However, it is easy to imagine the comparison metric depends on the
data itself.

Consider a set of images in varying sizes, all of the sky. A human may prefer them sorted by cloud cover (which can be approximated by looking at the number of “non-blue” pixels). If all images have roughly equivalent cloud cover, perhaps by time of day. Or by size. The possibilities are endless, and could drive one insane.

`TurkSort`

is one such solution to this problem. Using Mechanical Turk, one can
let *real humans* decide on an appropriate metric. Instead of writing a
comparison function, one would create a unit of work that displays the two
elements and asks which is “less than” the other. Again, in the naïve case,
this is not a tractable problem - using `mergesort`

requires ```
O(n
log n)
```

comparisons to be made. If each comparison requires a real humans
input, and the elements to compare are based on the result of the previous
comparison, the time required to sort a list quickly outpaces a reasonable
amount of time to wait.

One solution is to ask a single human to simply *sort the list themselves*.
Browsing Mechanical Turk, the going rate seems to be around $0.01/minute for
smaller tasks and $1/hour for larger tasks. However, a suitably large set may
be infeasible for a single worker to sort - the work is, frankly, boring. I
would gladly sort 10 items for a small fee; 100 is murkier and 1000 is right
out.

For a larger set, we can solve the latency problem in a few ways. The most
obvious is to create a unit of work for each *pair* of elements to sort. Assume
we offered the minimum for these - $0.01/pair - how much would it cost? This is
simply a combinatorics problem - a *binomial coefficient* to be precise.
For `N`

elements, we have `N choose 2`

pairs of items. This is roughly
`O(n^2)`

comparisons. For a 500 element list, we have 124,750 comparisons to make - or $1,247.50 to spend. Unless one works for a well-funded research department, this comes nowhere close to feasible.

One way to tackle this is to bundle many pairs into a single task. One would
need to find the largest number of comparisons you can get a person to do for a
penny, and then make each task have that many pairs to sort. However, the
number of pairs grows *quadratically* and the number of comparisons a person is
willing to do is a *constant*. In Big-O speak, that is insignificant to the
overall cost.

Another solution is to *not* consider every pair. Assuming humans implement a
*strict weak ordering* (a problematic assumption, but greatly
simplifying), we know `A < B`

and `B < C`

implies `A < C`

. This is, in fact,
how efficient sorting algorithms work. The trick, then, is to pick a sorting
algorithm with behaviour s.t. all comparisons we need to make are known ahead
of time. As we saw before, `mergesort`

is *not* one such algorithm, nor are any
of the `O(n log n)`

algorithms. Only `O(n^2)`

algorithms satisfy this
constraint, and this puts us in a bit of a pickle. To make `TurkSort`

efficient, the quadratic growth must be solved. It is the difference between
$1,247.50 and $44.82. That is around 500 days of Top Ramen diet (assuming 400
calories per $0.50 package) - or roughly 3 failed startups.

Next week, we’ll look at ways to tackle the issue by relaxing our sorting to be
*approximate* rather than exact - which plays well with the fallibity of
humans.