Tuesday, December 13, 2016

space leak

I noticed that after editing a score for 10 minutes or so, the UI would start getting laggy. Usually that means a memory leak and too much GC, and sure enough ekg showed that after each derivation memory usage would jump up, and never go back down again.

The first thing I blame is the cache, because it's the only thing that remains after each derivation, besides the note data itself. If the cache itself somehow holds a reference to the previous cache, then no derivation will ever be freed. It's like the joke where all I wanted was the banana, but I got the banana, the monkey holding the banana, and the jungle the monkey lives in. Only in this case I also get all the previous generations of monkeys.

I tried to debug by stripping out various fields in the cache, and got mysterious results. Dropping the cache entirely would fix the leak, but replacing all of its entries with Invalid tokens would add to it. Then I discovered that Data.Map's fmap is always lazy (of course) and so the test itself was insufficiently strict, and Data.Map.Strict.map lead to more consistent results.

I discovered an intentionally lazy field, with a potentially complicated thunk lurking inside. That's the root cause. In this case, the mechanism to get neighbor note pitches sticks the evaluation in a lazy field, with the idea that if you don't need a neighbor pitch (the common case), then you don't have to pay for the evaluation. I don't need that field once the computation is done, so I stripped it out on return. This still didn't solve the problem, because it was going into another intentionally lazy field, so of course the stripping didn't happen. I bang-patterned the value before putting it in the record, and the leak was gone!

As an aside, I discovered that I don't even need to use the value, e.g.: make x = Record (f x) where !unused = f x is already enough. Of course that's perfectly normal in a strict language, but in haskell I'm used to freely deleting unused bindings.

So the leak was gone, but now the UI had a hitch. The reason the field was intentionally lazy was to avoid doing that work in the event loop, so it could be passed to another thread and forced over there. So removed the bang and now the hitch is gone, but the leak is back! But shouldn't the other thread forcing have cleaned up the thunk in the first place? Then I discovered another fun bug:

force_performance perf = perf_logs perf `deepseq` perf_events perf
    `deepseq` perf_warps perf `deepseq` perf_track_dynamic
    `deepseq` ()

Not too obvious, right? It turns out perf_track_dynamic is exactly the field I needed to force, and yes functions are in NFData, so no type error for that.

So I fixed that and... still the leak. Actually, it seems like the leak is gone in the application, but still there in the test. I did all sorts of messing about trying really ensure that field is forced in the test and no luck.

Finally I somewhat accidentally fixed it, by refactoring force_performance to use less error-prone pattern matching instead of accessor functions, and added another field to the deepseq chain while I was at it. It turns out that other field, which has nothing to do with the guilty perf_track_dynamic one, still somehow had a pointer to it in its thunk. Since it's built in the same function that builds the other record, maybe it has a pointer to that whole function, and hence everything that function mentions. And of course one of the core principles of hunting space leaks is that you have to kill them all at once. It's like a hydra, where you have to cut off all the heads at once to have any effect.

The morale is be really careful about intentionally lazy fields. Of course that includes anything wrapped in any standard type like Maybe or (,)... so put in regression tests for both memory usage growing too much (too lazy) and functions taking longer than expected on large input (too strict).

No comments:

Post a Comment