Monday, September 14, 2015

Making sequence/mapM for IO take O(1) stack

Summary: We have a version of mapM for IO that takes O(1) stack and is faster than the standard Haskell/GHC one for long lists.

The standard Haskell/GHC base library sequence function in IO takes O(n) stack space. However, working with Tom Ellis, we came up with a version that takes O(1) stack space. Our version is slower at reasonable sizes, but faster at huge sizes (100,000+ elements). The standard definition of sequence (when specialised for both IO and []) is equivalent to:

sequence :: [IO a] -> IO [a]
sequence [] = return []
sequence (x:xs) = do y <- x; ys <- sequence xs; return (y:ys)

Or, when rewritten inlining the monadic bind and opening up the internals of GHC's IO type, becomes:

sequence :: [IO a] -> IO [a]
sequence [] = IO $ \r -> (# r, () #)
sequence (y:ys) = IO $ \r -> case unIO y r of
    (# r, y #) -> case unIO (sequence xs) r of
        (# r, ys #) -> (# r, y:ys #)

For those not familiar with IO, it is internally defined as:

newtype IO a = IO (State# RealWorld -> (# State# RealWorld, a #)

Each IO action takes a RealWorld token and returns a RealWorld token, which ensures that IO actions run in order. See here for a full tutorial.

Our observation was that this version requires O(n) stack space, as each recursive call is performed inside a case. The algorithm proceeds in two phases:

  • First, it traverses the input list, evaluating each action and pushing y on the stack.
  • After reaching the [] at the end of the list, it traverses the stack constructing the output list.

By constructing the list directly on the heap we can avoid the extra copy and use O(1) stack. Our version is:

sequenceIO :: [IO a] -> IO [a]
sequenceIO xs = do
        ys <- IO $ \r -> (# r, apply r xs #)
        evaluate $ demand ys
        return ys
    where
        apply r [] = []
        apply r (IO x:xs) = case x r of
            (# r, y #) -> y : apply r xs

        demand [] = ()
        demand (x:xs) = demand xs

Here the two traversals are explicit:

  • First, we traverse the list using apply. Note that we pass the RealWorld token down the list (ensuring the items happen in the right order), but we do not pass it back.
  • To ensure all the IO actions performed during apply happen before we return any of the list, we then demand the list, ensuring the [] element has been forced.

Both these traversals use O(1) stack. The first runs the actions and constructs the list. The second ensures evaluation has happened before we continue. The trick in this algorithm is:

ys <- IO $ \r -> (# r, apply r xs #)

Here we cheat by duplicating r, which is potentially unsafe. This line does not evaluate apply, merely returns a thunk which when evaluated will force apply, and cause the IO to happen during evaluation, at somewhat unspecified times. To regain well-defined evaluation order we force the result of apply on the next line using demand.

Benchmarks

We benchmarked using GHC 7.10.2, comparing the default sequence (which has identical performance to the specialised monomorphic variant at the top of this post), and our sequenceIO. We benchmarked at different lengths of lists. Our sequenceIO is twice as slow at short lists, draws even around 10,000-100,000 elements, and goes 40% faster by 1,000,000 elements.

Our algorithm saves allocating stack, at the cost of iterating through the list twice. It is likely that by tuning the stack allocation flags the standard algorithm would be faster everywhere.

Using sequence at large sizes

Despite improved performance at large size, we would not encourage people to use sequence or mapM at such large sizes - these functions still require O(n) memory. Instead:

  • If you are iterating over the elements, consider fusing this stage with subsequence stages, so that each element is processed one-by-one. The conduit and pipes libraries both help with that approach.
  • If you are reducing the elements, e.g. performing a sum, consider fusing the mapM and sum using something like foldM.

For common operations, such a concat after a mapM, an obvious definition of concatMapM is:

concatMapM :: Monad m => (a -> m [b]) -> [a] -> m [b]
concatMapM f = liftM concat . mapM f

But that if the result of the argument is regularly [] then a more efficient version is:

concatMapM op = foldr f (return [])
    where f x xs = do x <- op x
                      if null x then xs else liftM (x ++) xs

For lists of 10,000 elements long, using the function const (return []), this definition is about 4x faster. Version 1.4.2 of the extra library uses this approach for both concatMapM and mapMaybeM.

Update: there are now real benchmarks of this technique and some others.

8 comments:

Edward Kmett said...

This is (more or less) the instance mentioned back when I talked about why `mapM` needed to remain in Traversable for now.

There we just constructed the output list in reverse order on the heap and then reversed it at the end. This uses more memory than the chicanery you use here, but it works for more types. You can actually use that form of `mapM` with any monad walking [].

However for other monads it usually isn't a pure win, in fact its more or less just IO, ST s, and STM-based monads where this wins. =( Many other (non-IO) monads that can productively produce the result list inside of their respectively monads.

Neil Mitchell said...

I did try the reversing list, but it benchmarked quite a lot slower - I suspect two traversals and two allocations is too expensive. Note that even for IO it's only a win at fairly large sizes, I don't think it's ever worth using.

nomeata said...

Some related thoughts can be found on https://www.joachim-breitner.de/blog/620-Constructing_a_list_in_a_Monad

Neil Mitchell said...

Thanks nomeata, very interesting, I guess this technique could be another way to do what you were doing.

Anonymous said...

I have included your’s and Twern’s ideas in my benchmarks, see http://www.joachim-breitner.de/blog/684-Constructing_a_list_in_a_monad_revisited for the result.

Neil Mitchell said...

Thanks nomeata - I've updated the post to link to your post.

Dan said...

I was also able to use DList is a go-through to have O(1) stack space, using foldM. Thus, the variant of mapM that uses DList for all monads depends on Foldable instead of Traversable, and declared as such:

dlistMapM :: (Monad m, Foldable l) => (a -> m b) -> l a -> m [b]
dlistMapM f = fmap DList.toList . foldM item DList.empty
where item acc x = f x >>= return . (acc `DList.snoc`)

Tom Ellis said...

How about this? I think it's equivalent but written using a standard unsafe primitive rather than hacking around with (# #) and RealWorld.

import Data.List
import System.IO.Unsafe

mapMUnsafe :: (a -> IO b) -> [a] -> IO [b]
mapMUnsafe f xs = forceAll ys `seq` return ys
where ys = map (unsafePerformIO . f) xs

forceAll :: [a] -> ()
forceAll [] = ()
forceAll (a:as) = a `seq` forceAll as