Haskell Optimization and the Game of Life

Andrew Rademacher - andrewrademacher@gmail.com

There are numerous examples of performance optimization floating about the Internet, but they usually involve very specific cases pulled off the shelf to demonstrate a specific compiler feature. Or the example is specifically designed to go from worse possible case to best case.

This example is intended to show what is possible in a real world use case with a non-trivial problem, John Conway's Game of Life.

You can find the full code-base for the application on GitHub (big suprise I know).

Rules of the game

The game of life is a simulation of cellular automaton. Each cell is arranged in a 2-D grid and has 8 neighboring cells. The life cycle of a cell is dictated by a simple set of rules.

  • If a cell is alive and has fewer than 2 neighbors, it dies as if by loneliness.
  • If a cell is alive and has more than 3 neighbors, it dies as if by starvation due to over-crowding.
  • If a cell is alive and has 2 or 3 neighbors it remains alive.
  • If a cell is dead and has exactly 3 neighbors it spawns, as if by reproduction.

A further description of the game of life can be found at Wikipedia.

Initial Design Choices

sample game board

The game board is a plane (2-D Cartesian coordinate space). Which makes it easy to represent as a list in Haskell. Further, each cell has only two states: alive or dead. This means that the list can be represented with boolean values. However, this representation comes with two limitations that could be cost prohibitive.

Use vectors instead of lists.

When determining if a cell should remain alive or die in the next generation we need to observe the state of all 8 of its neighbors. This kind of random access to the previous generation would be extremely costly with traditional Haskell lists. Remember that lists have an access complexity of O(n), this means that the complexity of counting the neighbors of cell is O(8n).

If we were to use a Data.Vector instead, our access complexity would be O(1), this means that the complexity of counting the neighbors of a cell is O(8 * 1) or simplified: O(8). The table below compares the neighbor count complexity, when attempting to count the neighbors of the middle cell on the board.

Board Size Total Cells List Complexity Vector Complexity
100x100 10,000 40,000 8
200x200 40,000 160,000 8
300x300 90,000 360,000 8

Because of the random access needed by the game of life, a vector is the clear winner.

Use integers instead of booleans.

While a cell can only be in one of two possible states, suggesting the use of a boolean, cells are almost exclusively used to be counted. With the one exception of displaying the board to a screen. This means that if we were to represent cells with a boolean value we would have to do a compare before each addition operation when counting neighbors.

count = neighbor1 + neighbor2 + neighbor3 ...  
    where neighbor1 = case (board !! 1) of
                        True  -> 1
                        False -> 0
          ...

Instead, we would prefer to just pre-encode each cell's state in an integer. This saves us from having to type-convert from booleans to integers when doing neighbor counts.

count = (board !! 1) + (board !! 2) + (board !! 3)  

First Version

Repo - File

In spite of the choices made about using vectors and integers, the performance of the first version was admittedly terrible. Here is some of the raw data about its time/space performance. The table shows the amount of time and memory used to calculate 100 generations of a given board size, where the first generation is random.

100x100 300x300 500x500 700x700 900x900
Time (s) 0.53 5.78 15.73 33.17 55.14
Space (MB) 92 1011 1629 5066 8217

We know from the start that these numbers are unacceptably high. For example, a 900x900 board has 810,000 cells in it. Each cell containing only one integer, and somehow this consumes 8GB of memory (it was at this moment I patted myself on the back for choosing a laptop with 16GB of memory instead of one with long battery life).

For perspective, an HD image (1920x1080) has 2,073,600 pixels. Each containing a 24-bit RGB or a 32-bit RGBA value. Such an image would only consume ~8MB of memory, and even if you had 100 of them, that's still only ~800MB (assuming no compression).

For further perspective, a video game has to render images at a rate of at least 30fps (frames per second) in order to appear as motion to a human eye. Even at minimum settings (which would be necessary if you didn't have a GPU...I want to be sure to compare apples-to-apples), rendering a 3-D environment from scratch is a wildly more complicated process then counting neigbhors on a 2-D plane. This application can only squeeze out 1.8fps.

SSTDERR

Lets take a look at the performance report from GHC. This can be done by compiling with the -rtsopts flag, then running with +RTS -sstderr tagged at the end of your terminal parameters.

./Profile -w 900 -h 900 -g 100 +RTS -sstderr900S.txt 
  54,151,984,128 bytes allocated in the heap
  17,295,030,608 bytes copied during GC
   4,479,547,912 bytes maximum residency (12 sample(s))
      94,803,192 bytes maximum slop
            8217 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     114417 colls,     0 par    7.80s    7.81s     0.0001s    0.0085s
  Gen  1        12 colls,     0 par    7.39s    7.41s     0.6175s    3.5248s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time   39.64s  ( 39.67s elapsed)
  GC      time   15.20s  ( 15.22s elapsed)
  EXIT    time    0.23s  (  0.24s elapsed)
  Total   time   55.07s  ( 55.14s elapsed)

  %GC     time      27.6%  (27.6% elapsed)

  Alloc rate    1,366,125,607 bytes per MUT second

  Productivity  72.4% of total user, 72.3% of total elapsed

Immediately, we can see that the allocation rate is through the roof, and the program spends more than a quarter of its time in garbage collection. This seems really strange since, in theory we should only have to allocate one object for each generation (the vector that represents it). Everything else should fit in stack memory.

If we look at the cost center output from GHC we see that almost half of our allocations occur in the evaluation of nextCell.nc.

COST CENTRE      MODULE    %time %alloc

nextCell.gn      Life       32.5    1.7  
nextCell.nc      Life       26.2   43.7  
fromCoords       Life       10.2   13.7  
nextCell         Life        8.8   13.7  
nextGeneration   Life        5.0   13.7  
nextCell.(...)   Life        3.7    9.4  
nextCell.gn.midx Life        3.6    0.0  
getCoords.x      Life        3.4    1.7  
getCoords.y      Life        2.4    1.7  
getCoords        Life        1.8    0.0  
nextCell.y       Life        1.0    0.0  

NOTE: I generally start with memory problems before working on time problems. This is because garbage collection is an extremely time consuming task, that can grow exponentially as your heap size increases. I don't know if this is true for GHC, but in other GC environments like V8 or the JVM, garbage collection requires a the system to stop while the GC clears out unused objects. This "GC Pause" can be extremely costly. For several gigabyte heaps, it can be in the order of full seconds. For this reason, early versions of V8 wouldn't allow the heap to grow beyond 1.9GB. Algorithms that would otherwise be fast can be severely hampered by excessive memory turnover.

The reason why nextCell.nc is responsible for so many allocations has to do with its call to Data.List.sum.

nc = L.sum [ (gn neg neg) , (gn zro neg)   , (gn pos neg)  
           , (gn neg zro) , (0)            , (gn pos zro)
           , (gn neg pos) , (gn zro pos)   , (gn pos pos) ]

This requires us to create a list of the values for each neighbor before calculating a sum. Lists in Haskell are single-linked lists. This means that for each item, there is a data type that holds the a pointer to the value of that item and a pointer to the next item in the list. In this case, that makes 18 objects. There are 9 objects representing nodes in the list and 9 objects representing the values of each node. Each of these objects is treated as independent by the garbage collector. This means that for each generation in a 100x100 board, we have to generate ~180,000 objects to properly count neighbors. A better way would be to use the vanilla addition operator.

nc =   (gn neg neg) + (gn zro neg)   + (gn pos neg)  
     + (gn neg zro) + (0)            + (gn pos zro)
     + (gn neg pos) + (gn zro pos)   + (gn pos pos) 

The sad part is that this is actually more concise than the list summation I used at first. That was a complete miss on my part.


Repo - File

Now that we have corrected our list creation issues with neighbor count, our numbers look just a bit better.

./Profile -w 900 -h 900 -g 100 +RTS -sstderr900S.txt 
  13,327,978,048 bytes allocated in the heap
  18,575,178,928 bytes copied during GC
   4,464,629,208 bytes maximum residency (12 sample(s))
      85,868,864 bytes maximum slop
            8229 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     24303 colls,     0 par    8.25s    8.26s     0.0003s    0.0148s
  Gen  1        12 colls,     0 par    7.62s    7.64s     0.6366s    3.6518s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time   26.54s  ( 26.56s elapsed)
  GC      time   15.87s  ( 15.90s elapsed)
  EXIT    time    0.23s  (  0.24s elapsed)
  Total   time   42.65s  ( 42.70s elapsed)

  %GC     time      37.2%  (37.2% elapsed)

  Alloc rate    502,145,549 bytes per MUT second

  Productivity  62.8% of total user, 62.7% of total elapsed

Our allocation rate was cut in half by the change. We also saved 12.44s in total execution time. However, I'm still quite surprised by the result. In spite of cutting our allocations in half our program is still spending about one-third of its time in garbage collection. To be fair, it is spending almost exactly the same amount of time in GC. That just happens to be a greater proportion of the total execution time. The program is also, consuming almost exactly the same amount of RAM.

Our problem could be that we are not explicitly using strict evaluation. Haskell is a lazy language, so it tries desperately to not do work that it doesn't need to do. This means that for expressions it will create a graph representing what the value should be, but it will not evaluate until the result of the expression is actually needed. This graph requires a fairly extensive use of memory. We can avoid this graph creation by forcing strictly eager evaluation of expressions since we know that we will need the value of each cell. Haskell has a simple syntax for supporting this called Bang Patterns

The best way to see the change is to look at the Commit itself. By adding the ! in front of function parameters or at the front of local variables you are telling Haskell: "this input cannot be an expression graph, it must be a fully evaluated machine number (like Int, Float or Double)." This should allow Haskell to pass around variables as the primitive types they are, instead of references to what the will be some point in the future.


Repo - File

Lets look at the effect of adding strictness to our application.

./Profile -w 900 -h 900 -g 100 +RTS -sstderr900S.txt 
  13,327,978,464 bytes allocated in the heap
  18,575,180,576 bytes copied during GC
   4,464,629,208 bytes maximum residency (12 sample(s))
      85,885,144 bytes maximum slop
            8229 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     24303 colls,     0 par    8.39s    8.40s     0.0003s    0.0099s
  Gen  1        12 colls,     0 par    7.66s    7.68s     0.6399s    3.6718s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time   27.16s  ( 27.19s elapsed)
  GC      time   16.06s  ( 16.08s elapsed)
  EXIT    time    0.24s  (  0.25s elapsed)
  Total   time   43.47s  ( 43.52s elapsed)

  %GC     time      36.9%  (37.0% elapsed)

  Alloc rate    490,635,717 bytes per MUT second

  Productivity  63.1% of total user, 63.0% of total elapsed

Whoa?! The program actually got slower! Well, that might not be a completely fair assessment of the situation. Some benefits were derived from adding explicit strictness. If we look at the cost centers before and after we see a few useful changes.

Before
COST CENTRE      MODULE    %time %alloc

nextCell.gn      Life       28.7    3.6  
nextCell         Life       16.6   10.8  
fromCoords       Life       13.9   28.7  
nextGeneration   Life        8.6   28.7  
nextCell.gn.midx Life        8.5    0.0  
nextCell.nc      Life        8.0    0.0  
getCoords.x      Life        5.5    3.6  
getCoords.y      Life        4.4    3.6  
nextCell.(...)   Life        3.2   19.7  
getCoords        Life        1.9    0.0  
randomBoard      Life        0.2    1.0  
After
COST CENTRE    MODULE  %time %alloc

nextCell.gn    Life     56.7    7.2  
nextCell       Life     22.1   21.6  
nextGeneration Life      9.9   54.1  
nextCell.(...) Life      5.6    7.2  
getCoords      Life      3.4    7.2  
fromCoords     Life      1.4    0.0  
randomBoard    Life      0.7    2.0  

First, we have reduced the number of hot functions in the application. By enforcing strictness, we have allowed the compiler to do a better job with in-lining the creation of some small variables. Second, we have massively reduced the difficulty of the fromCoords and getCoords functions. These functions are for mapping a 2-D plane to the 1-D vector on which a game board is represented. fromCoords was reduced in time by a whole order of magnitude.

However, the application's overall performance was not improved. In this case, strictness was clearly not the limiting factor for performance. Also, expression graphs didn't represent a meaningful proportion of the application's memory consumption.

Unboxing your data

It was at this point in the process that I decided to take a second look at the vector library provided by Haskell. This program should not be all that intensive at a scale of 900x900, and a vector is suppose to give you high performance with random access.

What I found is that the vector package provides several variants, one of which is the unboxed vector found at Data.Vector.Unboxed. It turns out that the default vector in Haskell is boxed. This means that my original assertion that access complexity is O(1) isn't exactly true. It is constant time, but the complexity is actually O(2).

Initially you might think that this is a distinction without a difference. After all O(2) is still constant time, and you are right. Accessing the 50,000th element in a vector takes the same amount of time as accessing the 1st element of a vector. However, even if the first jump is to a CPU register or cache, the second jump is always to the heap. This means that every access to a vector requires access to RAM. Which on modern CPUs could be once every 10-20 clock cycles.

Programmers often talk about how slow IO is. Accessing a hard drive is really slow, accessing the network is even slower. But it is important to remember that while accessing RAM is massively faster than accessing a disk, it is still several orders of magnitude slower than accessing a CPU cache or register.

Using unboxed data also comes with the benefit of reducing the overhead associated with storage. If you use the boxed vector each element has to be stored in memory separately from the vector itself. This is because the array used to represent the vector is not an array of values, but an array of pointers. Each pointing to a memory location in the heap where the value you want is stored.

The table below shows what this means in terms of how many objects have to be created to represent game boards of different sizes.

Board Size Total Cells Boxed Object Count Unboxed Object Count
100x100 10,000 10,000 1
200x200 40,000 40,000 1
300x300 90,000 90,000 1

The Data.Vector.Unboxed API is almost identical to the Data.Vector API. Which means that we can switch them out with almost no effort, just change the references as seen in this Commit


Repo - File

Lets examine what effects unboxing our data has.

./Profile -w 900 -h 900 -g 100 +RTS -sstderr900S.txt 
   9,436,060,504 bytes allocated in the heap
         469,672 bytes copied during GC
      12,997,456 bytes maximum residency (36 sample(s))
       4,227,016 bytes maximum slop
              36 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     16878 colls,     0 par    0.06s    0.06s     0.0000s    0.0001s
  Gen  1        36 colls,     0 par    0.00s    0.00s     0.0001s    0.0022s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time   11.56s  ( 11.57s elapsed)
  GC      time    0.07s  (  0.07s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time   11.63s  ( 11.64s elapsed)

  %GC     time       0.6%  (0.6% elapsed)

  Alloc rate    816,449,326 bytes per MUT second

  Productivity  99.4% of total user, 99.3% of total elapsed

Unboxing has an utterly shocking result. We have saved 8.193GB of RAM, and the application now spends less than 1% of its time in garbage collection. To boot, the total execution time is a quarter of what it was before unboxing.

This kind of improvement is absolutely fantastic, but it's still not enough. Our execution is still completely single threaded, without reason. The game of life is not a problem which requires cells to be coordinated, each cell can be calculated completely independently. However, the vector library has no facility for using multiple cores of a processor. In fact if we compile the application with the -threaded flag and run with +RTS -N, the performance drops and the program takes 16.38s to complete.

Compute your data in parallel

After some digging on Hackage I found library called repa. This library is build on top of the vector library, but it offers two huge advantages.

  1. Repa comes with its own system for mapping N-dimensional spaces to a 1-D vector. This means we can ditch the fromCoords and getCoords functions. If we wanted to we could even map 3-D or 4-D coordinate systems to a vector.
  2. Repa comes with the ability to perform operations in parallel, fully utilizing multi-core machines.

The repa library will take full advantage of Haskell's parallel compute facility. This means that we can use the regular map syntax to express the nextCell function (although in this case we use traverse so that the nextCell function is given both a cell's value and its location) and every cell will be computed concurrently. Haskell will also, automatically balance the computation of each cell across the logical CPU cores available (in my case 8). Haskell will not create a full OS thread for each cell, so there is no worry about clogging the scheduler or consuming excess RAM.

Unfortunately, switching from unboxed vectors to repa isn't as easy as switching from boxed to unboxed was. The API for repa is substantially different. You can view the full Commit here, the important stuff is in src/Life.hs.


Second Version - Using Repa

Repo - File

Lets see how well repa handles life.

./Profile -w 900 -h 900 -g 100 +RTS -N -sstderr900P.txt 
   4,855,170,984 bytes allocated in the heap
         913,104 bytes copied during GC
      13,011,688 bytes maximum residency (36 sample(s))
       4,309,792 bytes maximum slop
              40 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      1854 colls,  1854 par    1.15s    0.17s     0.0001s    0.0102s
  Gen  1        36 colls,    35 par    0.04s    0.01s     0.0002s    0.0018s

  Parallel GC work balance: 19.63% (serial 0%, perfect 100%)

  TASKS: 10 (1 bound, 9 peak workers (9 total), using -N8)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time   17.21s  (  2.74s elapsed)
  GC      time    1.19s  (  0.18s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time   18.41s  (  2.92s elapsed)

  Alloc rate    282,054,182 bytes per MUT second

  Productivity  93.5% of total user, 590.1% of total elapsed

gc\_alloc\_block\_sync: 21591  
whitehole_spin: 0  
gen[0].sync: 17  
gen[1].sync: 174  

Another, massive improvement indeed. We added an additional 4MB in overhead for managing multiple threads. However, in exchange we saved 8.72s in execution time. It's worth mentioning that while the test was running I noticed that my computer didn't hit 100% utilization. It was hovering around 80% utilization and in smaller test (100x100) it was even less. My suspicion is that the benefit from a multi-core implementation actually increases as the data set increases.

Inline functions when possible

While reading the Repa documentation I came across a piece of advice about generating performance code. To quote the documentation:

Add INLINE pragmas to all leaf-functions in your code, especially ones that compute numeric results. Non-inlined lazy function calls can cost upwards of 50 cycles each, while each numeric operator only costs one (or less). In-lining leaf functions also ensures they are specialized at the appropriate numeric types.  

When the compiler generates code for a normal function it has to generate the overhead for calling that function. Type checking has to occur, parameters have to be passed across CPU registers, a new variable scope has to be created, etc. If a function is in-lined, then the body of the function is transposed into whatever function calls it. This removed all of the overhead associated with calling a function, but it also means that a function's logic is copied throughout a binary instead of being referenced. This can increase the size of the binary considerably. You can read more at Wikipedia

I assumed that the nextCell, nextCell.gn and nextCell.nc would be in-lined automatically. They are only used once, and they are not exported from the module. So to check I ran the profile, and sure enough they had not been in-lined.

COST CENTRE      MODULE    %time %alloc

nextCell.nc      Life       39.7    0.0  
nextCell.gn      Life       33.1   66.3  
nextCell         Life       14.6   26.5  
nextGen          Life       10.2    4.4  
nextCell.gn.nLoc Life        1.1    0.0  
simulate         Main        0.6    2.0  

Each function was treaded as independent in the cost analysis, indicating that they existed as full functions in the binary. To fix this, I took the Repa documentation's advice and added an INLINE pragma to the nextCell function. Commit

nextCell :: DIM2 -> (DIM2 -> Int) -> DIM2 -> Int  
{-# INLINE nextCell #-}
nextCell !aSh !lkp !loc@(Z :. x :. y) |  nc < 2 || nc > 3  =  0  

Repo - File

The results of in-lining are...

./Profile -w 900 -h 900 -g 100 +RTS -N -sstderr900P.txt 
     967,168,896 bytes allocated in the heap
         810,312 bytes copied during GC
      13,011,664 bytes maximum residency (36 sample(s))
       3,467,056 bytes maximum slop
              40 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0       661 colls,   661 par    0.26s    0.05s     0.0001s    0.0176s
  Gen  1        36 colls,    35 par    0.05s    0.01s     0.0002s    0.0015s

  Parallel GC work balance: 4.26% (serial 0%, perfect 100%)

  TASKS: 10 (1 bound, 9 peak workers (9 total), using -N8)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    9.83s  (  1.84s elapsed)
  GC      time    0.31s  (  0.05s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time   10.14s  (  1.90s elapsed)

  Alloc rate    98,400,962 bytes per MUT second

  Productivity  96.9% of total user, 518.6% of total elapsed

gc\_alloc\_block\_sync: 5839  
whitehole_spin: 0  
gen[0].sync: 0  
gen[1].sync: 0  

By adding the INLINE pragrama and putting strictness requirements in the program we shaved almost 40% off our execution time. Since there were no structural changes to the code we got this performance gain almost for free. If we look at the profiling results from GHC we will find that there is only one major cost center.

COST CENTRE MODULE  %time %alloc

nextGen     Life     94.4   61.7  
simulate    Main      4.4   28.5  
main        Main      0.6    8.5  

GHC has successfully in-lined the entire process of calculating generations into one function call: nextGen. This allows the compiler to make huge assumptions about type, and ignore any calling overhead of the function.


Results

Each of these optimizations on its own has a compelling effect on the overall performance of the game of life simulation. Together, the effects are truly astounding both in terms of computation time and space.

Calculation Time

Calculation Space

The combined effect of these optimizations reduces execution time by 96.5%, and memory consumption by 99.5%. This happens with NO algorithmic changes to the application.

But wait! There's More!

What I believe to be the most important result of these kinds of optimizations is not just that the application completes faster, but that it allows it to achieve an entirely new scale. Throughout the optimization process our grid size has always been 900x900. Now that we have our optimizations in place we are no longer limited to board sizes in the hundreds, but in the thousands. We can now handle almost 36x the amount of cell data.

Version Grid Size Cell Count Time Space
Base Version 900x900 810,000 55.14 s 8217 MB
Fully Optimized 6000x6000 36,000,000 58.35 s 555 MB

Disclaimer: I had to make a change in the profiler to get these numbers. The profiler calculates the sum of cells alive on the last generation so that the function has a result that can be printed. At the larger scale, it takes a non-negligible amount of time, so I switched a serial sum to a parallel sum. Commit

You can still go further.

You could absolutely call it a day after all this, but there is still massive room for improvement. The current implementation doesn't take advantage of any SIMD functionality a CPU might have. This can be addressed in a few ways, two of which stand out to me.

  1. There is a lot of talk about GHC adding support for SIMD in version 7.8. If this feature is added, it will add SIMD support to the vector and dph libraries. Since repa is built on top of vector, the application should be able to take advantage of this improvement automatically. Yes, I am a fan of free optimization.

  2. If you don't want to wait and you have an nVidia graphics card handy, you could switch the repa library for accelerate. This will allow you to push execution to any CUDA enabled device and take full advantage of its SIMD nature. However, this is not for the faint of heart as it would mean a third rewrite from the base version. That said, the game of life should be well suited for a GPU and it is likely that you will see another order of magnitude gain in performance.