We’re not used to thinking parallel

(This is a continuation of my previous post, Why is parallel programming difficult?.)

Because we’re not looking for it, we often don’t notice the potential parallelisms in our programs. I’m not talking about obvious parallel problems like matrix multiplication and graphics rendering pipelines. I mean everyday problems. For example, we’re always weaving independent patterns together to construct our procedures, like this.

pattern_a( A, B )
  C1 <- calc_1( A)
  C2 <- calc_2( A, B, C1)
  return calc_3( C1, C2)

pattern_b( X, Y)
  Z <- calc_4( X)
  return calc_5( X, X, Z, Y)

procedure_that_uses_both_patterns( N, M )
  C1 <- calc_1( N)
  Z  <- calc_4( M)
  C2 <- calc_2( N, Z, C1)
  return calc_5( M, M, Z, calc_3( C1, C2))

We usually don’t think much about the fact that the first two lines C1<-calc_1(N) and Z<-calc_4(M) can probably be switched around or performed in parallel.

And why should we? There’s no annotation in most programming languages to capture the parallelism. The compiler doesn’t want to know.

And what about common algorithms like quicksort. It’s usually taught and implemented as linear, but it’s really a bifurcating parallel algorithm. The merge-sort algorithm involves a massive parallel split at the beginning followed by a hierarchical parallel reduction for the merge. Bubble-sort is a great parallel algorithm. It’s a linear-time sort, provided you have a very long chain of processors, each of which can communicate with its two neighbors.

Linearism is often not inherent. There are parallelisms in many common algorithms. But we don’t notice them much because our programming languages encourage linear thought, and don’t give us a way to capture or annotate parallelisms.

Why is parallel programming difficult?

Why is parallel programming so hard? At least it’s supposed to be hard — that’s the myth. Of course there’s also a myth (among non-programmers) that all programming is hard, for everyone except 12 year-olds in movies.

One reason parallel programming is hard is we don’t get any practice. Programmers aren’t used to thinking parallel. Introductory programming is linear programming, and most real-world bread-and-butter programming tasks are don’t require a parallel approach. And frankly, it’s easier to understand and express a linear program, and it’s certainly easier to debug.

And besides, parallel hardware has always been exotic, at least up until now. So there hasn’t been much demand for parallelism, and the tools reflect that. Programming languages are (mostly) linear, described as linear text that becomes a linear token stream. Years of parser and compiler research has focused on grammars to translate that linear token stream into linear machine code.

Programs were linear even before text files. Fortran programs used to be stored on stacks of punch cards.

So our tools and habits encourage linear thinking. We introduce unnecessary constraints and dependencies that break parallelism without even thinking about it.

And that’s one reason parallel programming is hard. A big reason.

Tic-tac-toe in Erlang — parallel processing

This is part of an Erlang tutorial built around a tic-tac-toe program. The program is stuffed into one file, called tic.erl and available here. The source file and this tutorial are organized as follows:

Parallel processing

One of the great things about Erlang is how easy it is to experiment with multi-processing. Or multi-threading. In Erlang there’s not much difference because you don’t work with global variables and mutable data structures. So you don’t have to worry about locking. Most of the time.

In Erlang it’s easy and cheap to start a new “process”. Remember an Erlang process is not an operating system process. An Erlang process is much lighter weight. You can have tens of thousands of them running in parallel on your underpowered laptop and not even notice.

It starts to affect your thinking.

And since we’re looking at a tic-tac-toe sample program with next move calculations and predictions, let’s see how we can take the massively inefficient game space search and make it massively parallel too.

Original implementation

Remember the functions to search the game space chain like this:

The calc_all_outcomes(..) function is just waiting to be parallelized (is that a word?). It is where the search space branches. With comments and asserts stripped out the Erlang code looks like this.

calc_all_outcomes( Board, Xo, Countdown, Outcome_start ) ->
  [ given_move_predict_outcome(
        Board, Xo, Pos, Countdown, Outcome_start)
    || Pos <- lists:seq( 1, 9) ].

Or you might find it easier to read if expressed with map instead of as a list comprehension. I like map myself, I'm used it so long it's like an old friend. But the two forms are equivalent.

calc_all_outcomes( Board, Xo, Countdown, Outcome_start ) ->
  lists:map(
    fun( Pos ) ->
      given_move_predict_outcome(
        Board, Xo, Pos, Countdown, Outcome_start)
    end,
    lists:seq( 1, 9))

Parallel implementation

We can implement this as a function that starts up 9 processes, and then waits for them all to finish and send back results. This is a pretty standard transformative pattern: re-interpreting a map as a scatter to start the processes, and then a gather to bring all the results back home.

calc_all_outcomes( Board, Xo, Countdown, Outcome_start ) ->
  Pid_parent = self( ),

  Scatter_fn =
    fun( Pos ) ->
      spawn(
        fun( ) ->
          % Send results back to parent
          Pid_parent !
            { Pos,
              given_move_predict_outcome(
                Board, Xo, Pos, Countdown, Outcome_start)
            }
        end)
    end,

  Gather_fn =
    fun( Pos ) ->
      receive { Pos, Outcome_predicted } ->
        Outcome_predicted
      end
    end,

  % Start 9 processes
  lists:foreach( Scatter_fn, lists:seq( 1, 9)),

  % Gather and order 9 results
  lists:map( Gather_fn, lists:seq( 1, 9))

You can also express this as nested list comprehensions. The code is way cool even if it's a little more difficult to understand.

calc_all_outcomes( Board, Xo, Countdown, Outcome_start ) ->
  Pid_parent = self( ),
  [ receive {Pos, Outcome_predicted} ->
      Outcome_predicted
    end
    || Pos <-
       [ Pos
         || Pos <- lists:seq( 1, 9),
            is_pid(
              spawn(
                fun( ) ->
                  Pid_parent !
                    { Pos,
                      given_move_predict_outcome(
                        Board, Xo, Pos,
                        Countdown, Outcome_start)
                    }
                end)) ]]

Not-so parallel implementation

The thing to remember is to get all the processes going before listening for any return messages. When I first implemented this function I forgot that and came up with this wrong implementation.

% Incorrect implementation:
calc_all_outcomes( Board, Xo, Countdown, Outcome_start ) ->
  Pid_parent = self( ),
  lists:map(
    fun( Pos ) ->
      spawn(
        fun( ) ->
          Pid_parent !
            given_move_predict_outcome(
              Board, Xo, Pos, Countdown, Outcome_start)
        end),
      receive Outcome_predicted -> Outcome_predicted end
    end,
    lists:seq( 1, 9))

This spawns a process, and then waits for its results before spawing the next one. This actually works fine (and is barely slower than the original non-parallel implementation), but it never has more than 9 processes alive (and one running) at a time.

Analysis

So, how did this work? Wellllll, in game space search I mention that given_move_predict_outcome(..) is evaluated 2,653,002 times if you fully predict all outcomes for an empty board. Which means the parallel code above will spawn 2,653,002 Erlang processes, all of which can potentially be running at the same time. Which is a little breathtaking.

And it's a little too much. Erlang on my system can't do it. (It has no problem with the "wrong" implementation given above though. That also spawns 2,653,002 processes, but never more than 9 at a time.)

But the code in tic.erl never predicts an empty board. It only predicts after at least one mark is down. Which means it would never try to spawn more than 308,817 processes.

And, not surprisingly, that's still too many. Erlang chokes.

But you CAN run the above code with two marks (an X and an O) on the tic-tac-toe board. In that case Erlang only has to spawn 30 or 40 thousand processes, all running at the same time. That's still pretty impressive. On my setup I max out at about 80,000 simultaneous processes.

Final thoughts

Working with something like Erlang, where it's easy to spawn thousands of processes, influences how you think and how you experiment. It let's you ask how you can solve a problem using a bunch of communicating agents, or using a fabric of processes interconnected in a specific hardware architecture. It let's you test ideas and gain a clearer understanding of how a parallel solution might work. If I were designing a parallel system that I eventually going to implement in C and MPI/OpenMP/CUDA, I'd consider starting with prototyping and concept checking on an Erlang testbed. It'd help catch mistakes and highlight problems early on.