Hardest Game On Earth Solutions

September 08, 2015

The Hardest Logic Puzzle Ever is a logic puzzle so called by American philosopher and logician George Boolos and published in The Harvard Review of. The solution.

The hardest program I’ve ever written, once you strip out the whitespace, is3,835 lines long. That handful of code took me almost a year to write. Granted,that doesn’t take into account the code that didn’t make it. The commithistory shows that I deleted 20,704 lines of code over that time. Everysurviving line has about three fallen comrades.

If it took that much thrashing to get it right, you’d expect it to do somethingpretty deep right? Maybe a low-level hardware interface or some wicked graphicsdemo with tons of math and pumping early-90s-style techno? A likely-to-turn-evilmachine learning AI Skynet thing?

Nope. It reads in a string and writes out a string. The only difference betweenthe input and output strings is that it modifies some of the whitespacecharacters. I’m talking, of course, about an automated codeformatter.

Introducing dartfmt

I work on the Dart programming language. Part of my job is helping make moreDart code, readable, idiomatic, and consistent, which is why I ended up writingour style guide. That was a good first step, but any style guide written inEnglish is either so brief that it’s ambiguous, or so long that no one reads it.

Go’s “gofmt” tool showed a better solution: automatically formateverything. Code is easier to read and contribute to because it’s already in thestyle you’re used to. Even if the output of the formatter isn’t great, it endsthose interminable soul-crushing arguments on code reviews about formatting.

Of course, I still have to sell users on running the formatter in the firstplace. For that, having great output really does matter. Also, I’m prettypicky with the formatting in my own code, and I didn’t want to tell users to usea tool that I didn’t use myself.

Getting that kind of quality means applying pretty sophisticated formattingrules. That in turn makes performance difficult. I knew balancing quality andspeed would be hard, but I didn’t realize just how deep the rabbit hole went.

I have finally emerged back into the sun, and I’m pleased with what I broughtback. I like the output, and the performance is solid. On my laptop, it can blowthrough over two million lines of code in about 45 seconds, using a single core.

Why is formatting hard?

At this point, you’re probably thinking, “Wait. What’s so hard aboutformatting?” After you’ve parsed, can’t you just walk the AST andpretty-print it with some whitespace?

If every statement fit within the column limit of the page, yup. It’s a piece ofcake. (I think that’s what gofmt does.) But our formatter also keeps your codewithin the line length limit. That means adding line breaks (or “splits” as theformatter calls them), and determining the best place to add those is famouslyhard.

Check out this guy:

There are thirteen places where a line break is possible here according to ourstyle rules. That’s 8,192 different combinations if we brute force them all 1. The search space we have to cover isexponentially large, and even ranking different solutions is a subtle problem.Is it better to split before the .any()? Why or why not?

What is up with the skulls?

I had two goals with this article: to explain how dartfmt works, and to show a realistic picture of how a real programmer solves a difficult problem with all of the messiness that entails. Alas, the first is more than long enough to try your patience, so I shunted all of the dead ends and failed attempts to footnotes. Click the skulls to laugh at my misfortune.

In Dart, we made things harder on ourselves. We have anonymous functions, lotsof higher-order functions, and—until we added async andawait—used futures for concurrency. That means lots ofcallbacks and lots of long method chains. Some Dart users really dig afunctional style and appear to be playing a game where whoever crams the mostwork before a single semicolon wins.

Here’s real code from an amateur player:

Yeah, that’s four nested functions. 1,048,576 ways to split that one. Here’s oneof the best that I’ve found. This is what a pro player brings to the game:

(The funny names are because this was sanitized from internal code.) That’s asingle statement, all 565 characters of it. There are about 549 billion wayswe could line break it.

Ultimately, this is what the formatter does. It applies some fairlysophisticated ranking rules to find the best set of line breaks from anexponential solution space. Note that “best” is a property of the entirestatement being formatted. A line break changes the indentation of theremainder of the statement, which in turn affects which other line breaks areneeded. Sorry, Knuth. No dynamic programming this time 2.

I think the formatter does a good job, but how it does it is a mystery tousers. People get spooked when robots surprise them, so I thought I would tracethe inner workings of its metal mind. And maybe try to justify to myself why ittook me a year to write a program whose behavior in many ways isindistinguishable from cat.

How the formatter sees your code

As you’d expect from a program that works on source code, the formatter isstructured much like a compiler. It has a front end that parses your codeand converts that to an intermediate representation3. It does some optimization and clean up onthat 4, and then the IR goes to aback end 5 that produces the finaloutput. The main objects here are chunks, rules, and spans.

Chunks

A chunk is an atomic unit of formatting. It’s a contiguous region ofcharacters that we know will not contain any line breaks. Given this code:

We break it into these chunks: format/* comment */this;.

Chunks are similar to a token in a conventional compiler, but they tend tobe, well, chunkier. Often, the text for several tokens ends up in the samechunk, like this and ; here. If a line break can never occur between twotokens, they end up in the same chunk 6.

Chunks are mostly linear. For example, given an expression like:

We chunk it to the flat list: some(nested,function(ca +ll)).

We could treat an entire source file like a single flat sequence of chunks, butit would take forever and a day to line break the whole thing 7. With things like long chains of asynchronouscode, a single “statement” may be hundreds of lines of code containing severalnested functions or collections that each contain their own piles of code.

We can’t treat those nested functions or collection literals entirelyindependently because the surrounding expression affects how they are indented.That in turn affects how long their lines are. Indent a function body two morespaces and now its statements have two fewer spaces before they hit the end ofthe line.

Instead, we treat nested block bodies as a separate little list of chunks to beformatted mostly on their own but subordinate to where they appear. The chunkthat begins one of these literals, like the { preceding a function or map,contains a list of child block chunks for the contained block. In other words,chunks do form a tree, but one that only reflects block nesting, notexpressions.

The end of a chunk marks the point where a split may occur in the final output,and the chunk has some data describing it 8. It keeps track of whether a blank line should be addedbetween the chunks (likebetween two class definitions), how much the next line should be indented, andthe expression nesting depth at that point in the code.

The most important bit of data about the split is the rule that controls it 9.

Rules

Each potential split in the program is owned by a rule. A single rule mayown the splits of several chunks. For example, a series of binary operators ofthe same kind like a + b + c + d uses a single rule for the splits after each+ operator.

A rule controls which of its splits break and which don’t. It determines thisbased on the state that the rule is in, which it calls its value. You canthink of a rule like a dial and the value is what you’ve turned it to. Given avalue, the rule will tell you which of its chunks get split.

The simplest rule is a “hard split” rule. It says that its chunkalways splits, so it only has one value: 0. This is useful for things likeline comments where you always need to split after it, even in the middle of anexpression 10.

Then there is a “simple” split rule. It allows two values: 0 meansnone of its chunks split and 1 means they all do. Since most splits areindependent of the others, this gets used for most of the splits in the program.

Beyond that, there are a handful of special-case rules. These are usedin places where we want to more precisely control the configuration of a set ofsplits. For example, the positional argument list in a function list iscontrolled by a single rule. A function call like:

Will have splits after function(, first,, second,, and third). They areall owned by a single rule that only allows the following configurations:

Having a single rule for this instead of individual rules for each argument letsus prohibit things like:

Constraints

Grouping a range of splits under a single rule helps us prevent splitconfigurations we want to avoid like this, but it’s not enough. There are morecomplex constraints we want to enforce like: “if a split occurs inside a listelement, the list should split too”. That avoids output like this:

Here, the list and the + expression have their own rules, but those rules needto interact. If the + takes value 1, the list rule needs to as well. Tosupport this, rules can constrain each other. Any rule can limit the valuesanother rule is allowed to take based on its own value. Typically, this is usedto make a rule inside a nested expression force the rules surrounding itself tosplit when it does 11.

Finally, each rule has a cost. This is a numeric penalty that applies when anyof that rule’s chunks are split. This helps us determine which sets of splitsare better or worse than others 12.

Rule costs are only part of how overall fitness is calculated. Most of the costcalculation comes from spans.

Spans

A span marks a series of contiguous chunks that we want to avoid splitting.I picture it like a rubber band stretching around them. If a split happens inany of those chunks, the span is broken. When that happens, the solution ispenalized based on the cost of the span.

Spans can nest arbitrarily deeply. In an expression like:

There will be spans around a, b and c, d to try to keep those argument listsfrom splitting, but also another span around first(a, b), second(c, d) to keepthe outer argument list from splitting.

If a split occurs between a, and b, the a, b span splits, but so does thefirst(a, b), second(c, d) one. However, if a split occurs after first(a, b),then the a, b span is still fine. In this way, spans teach the formatter toprefer splitting at a higher level of nesting when possible since it breaksfewer nested spans.

Parsing source to chunks

Converting your raw source code to this representation is fairlystraightforward. The formatter uses the wonderful analyzer package to parseyour code to an AST. This gives us a tree structure that represents everysingle byte of your program. Unlike many ASTs, it even includes comments.

Once we have that, the formatter does a top-down traversal of thetree. As it walks, it writes out chunks, rules, and spansfor the various grammar productions. This is where the formatting “style” isdetermined.

There’s no rocket science here, but there are a lot of hairycornercases. Comments can appear in weird places. We haveto handle weird things like:

Here, we normally would have a split after the first argument owned by anargument list rule. But the line comment adheres to the , and has a hard splitafter it, so we need to make sure the argument list rule handles that.

Whitespace is only implicitly tracked by the AST so we have to reconstituteit in the few places where your original whitespace affects the output.Having a detailed test suite really helps here.

Once we’ve visited the entire tree, the AST has been converted to a tree ofchunks and a bunch of spans wrapped around pieces of it.

Formatting chunks

We’ve got ourselves a big tree of chunks owned by a slew of rules. Earlier, Isaid a rule is like a knob. Now we get to dial them in.

Doing this na誰vely is infeasible. Even a small source file contains hundreds ofindividual rules and the set of possible solutions is exponential in the numberof rules.

The first thing we do is divide the chunk list into regions we knowcan’t interfere with each other. These are roughly “lines” of code. So with:

We know that how we split the first statement has no effect on the second one.So we run through the list of chunks and break them into shorter lists wheneverwe hit a hard split that isn’t nested inside an expression.

Each of these shorter chunk lists is fed to the line splitter. Its job is topick the best set of values for all the rules used by the chunks in the line. Inmost cases, this is trivial: if the whole line fits on the page, every rule getsset to zero—no splits—and we’re done.

When a line doesn’t fit, the splitter has to figure out which combination ofrule values produces the best result. That is:

The one with the fewest characters that go over the column limit.
The one with the lowest cost, based on which rules and spans were split.

Calculating the cost for a set of rule values is pretty easy, but there arestill way too many permutations to brute force it. If we can’t brute force it,how do we do it?

How line splitting works

I’m a college dropout so my knowledge of algorithms was fairly, um, rudimentary.So before I interviewed at Google, I spent two days in a hotel room cramming asmany of them—mostly graph traversal—in my head as I could. At thetime, I thought graphs would never come up in the interviews…

Then I had multiple interview questions that reduced down to doing the rightkind of traversal over a graph. At the time, I thought this stuff would never berelevant to my actual job…

Then I spent the past few years at Google discovering that damn near everyprogram I have to write can be reduced down to some kind of graph search. Iwrote a package manager where dependencies are a transitive closure andversion constraint solving is graph based. My hobby roguelike usesgraphs for pathfinding. Graphs out the wazoo. I can do BFS in my sleep now.

Naturally, after several other failed approaches, I found that line splittingcan be handled like a graph search problem 13. Each node in the graph represents asolution—a set of values for each rule. Solutions can bepartial: some rules may be left with their values unbound.

From a given partial solution (including the initial “no rules bound” one),there are edge to new partial solutions. Each binds one additional rule to avalue. By starting from an empty solution and walking this graph, we eventuallyreach complete solutions where all of the rules have been bound to values.

Graph search is great if you know where your destination is and you’re trying tofind the best path. But we don’t actually know that. We don’t know what the bestcomplete solution is. (If we did, we’ve be done already!)

Given this, no textbook graph search algorithm is sufficient. We need to applysome domain knowledge—we need to take advantage of rules and conditionsimplicit in the specific problem we’re solving.

After a dozen dead ends, I found three (sort of four) that are enough to get itfinding the right solution quickly:

Bailing early

We are trying to minimize two soft constraints at the same time:

We want to minimize the number of characters that overflow the line lengthlimit. We can’t make this a hard constraint that there is no overflowbecause it’s possible for a long identifier or string literal to overflow inevery solution. In that case, we still need to find the one that’s closestto fitting.
We want to find the lowest cost—the fewest split rules and brokenspans.

The first constraint dominates the second: we’ll prefer a solution with any costif it fits one more character in. In practice, there is almost always a solutionthat does fit, so it usually comes down to picking the lowest cost solution 14.

We don’t know a priori what the cost of the winning solution will be, but wedo know one useful piece of information: forcing a rule to split alwaysincreases the cost.

If we treat any unbound rule as being implicitly unsplit 15, that means the starting solution witheverything unbound always has the lowest cost (zero). We can then exploreoutward from there in order of increasing cost by adding one rule at a time.

This is a basic best-first search: we keep a running queue of all ofthe partial solutions we’ve haven’t explored yet, sorted from lowest cost tohighest. Each iteration, we pop a solution off.

If the solution completely fits in the page width, then we know we’ve won theoverflow constraint. Since we’re exploring in order of increasing cost, we alsoknow it’s the lowest cost. So, ta-da!, we found the winner and can stopexploring. Otherwise, if it has any unbound rules, we enqueue newsolutions, each of which binds one of those to a value.

We basically explore the entire solution space in order of increasing cost. Assoon as we find a solution that fits in the page, we stop.

Avoiding dead ends

The above sounds pretty promising, but it turns out that there can be animperial ton of “low-cost but overflowing” solutions. When you’re trying toformat a really long line, there are plenty of ways it can not fit, and thisalgorithm will try basically all of them. After all, they’re low cost since theydon’t have many splits.

We need to avoid wasting time tweaking rules that aren’t part of the problem.For example, say we’re looking at a partial solution like this:

There are a bunch of ways we can split the arguments to firstCall(), but wedon’t need to. Its line already fits. The only line we need to worry about isthe secondCall() one.

So, when we are expanding a partial solution, we only bind rules that havechunks on overflowing lines. If all of a rule’s chunks are on linesthat already fit, we don’t mess with it. In fact, we don’t even worry aboutrules on any overflowing line but the first. Since tweaking the first line willaffect the others, there’s no reason to worry about them yet 16.

This dramatically cuts down “branchiness” of the graph. Even though a partialsolution may have dozens of unbound rules, usually only a couple are on longlines and only those get explored.

Pruning redundant branches

This gets us pretty far, but the splitter can still go off the deep end in somecases. The problem is that within large statements, you still run into caseswhere how you format part of the statement is mostly independent of laterparts.

Take something like:

Each of those named arguments can be split in a few different ways. And, sincethose are less nested—which means fewer split spans—than that nastyliveAnalysis: line, it will try every combination of all of them before itfinally gets down to the business of splitting that check() call.

The best way to split the liveAnalysis: line is the best way to split itregardless of how we split assertions: or annotations:. In other words,there are big branches of the solution space that initially differ in irrelevantways, but eventually reconvene to roughly the same solution. We traverse everysingle one of them.

What we need is a way to prune entire branches of the solution space. Given twopartial solutions A and B, if we could say not just “A is better than B” but“every solution we can get to from A will be better than every solution we canget to from B” then we can discard B and the entire branch of solutionsstemming from it.

It took some work, but I finally figured out that you can do this in manycases. Given two partial solutions, if one has a lower cost than the other and:

They have the same set of unbound rules (but their bound rules have differentvalues, obviously).
None of their bound rules are on the same line as an unbound rule.
None of their bound rules place constraints on an unbound rule.

If all of those are true, then the one with a lower cost will always lead tosolutions that also have a lower cost. Its entire branch wins. We can discardthe other solution and everything that it leads to. Once I got thisworking, the formatter could line split damn near anything in record time 17.

An escape hatch

Alas, that “damn near” is significant. There are still a few cases where theformatter takes a long time. I’ve only ever seen this on machine generated code.Stuff like:

Yeah, welcome to my waking nightmare. Unsurprisingly, code like this bogs downthe formatter. I want dartfmt to be usable in things like presubmit scriptswhere it will have a ton of weird code thrown at it and it must complete in areliable amount of time.

So there is one final escape hatch. If the line splitter tries, like, 5,000solutions and still hasn’t found a winner yet, it just picks the best it foundso far and bails.

In practice, I only see it hit this case on generated code. Thank God.

Finally, output

Once the line splitter has picked values for all of the rules, the rest is easy.The formatter walks the tree of chunks, printing their text. When arule forces a chunk to split, it outputs a newline (or two), updates theindentation appropriately and keeps trucking.

The end result is a string of (I hope!) beautifully formatted Dart code. So muchwork just to add or remove a few spaces!

Footnotes

1 Yes, I really did brute forceall of the combinations at first. It let me focus on getting the output correctbefore I worried about performance. Speed was fine for most statements. Theother few wouldn’t finish until after the heat death of the universe.

2 For most of the time, theformatter did use dynamic programming and memoization. I felt like a wizardwhen I first figured out how to do it. It worked fairly well, but was anightmare to debug.

It was highly recursive, and ensuring that the keys to the memoization tablewere precise enough to not cause bugs but not so precise that the cachelookups always fail was a very delicate balancing act. Over time, the amount ofdata needed to uniquely identify the state of a subproblem grew, includingthings like the entire expression nesting stack at a point in the line, and thememoization table performed worse and worse.

3 The IR evolved constantly.Spans and rules were later additions. Even the way chunks tracked indentationchanged frequently. Indentation used to be stored in levels, where each levelwas two spaces. Then directly in spaces. Expression nesting went through anumber of representations.

In all of this, the IR’s job is to balance being easy for the front-end toproduce while being efficient for the back end to consume. The back endreally drives this. The IR is structured to be the right data structure for thealgorithm the back end wants to use.

4 Comments were the one of thebiggest challenges. The formatter initially assumed there would be no newlinesin some places. Who would expect a newline, say, between the keywords inabstract class? Alas, there’s nothing preventing a user from doing:

So I had to do a ton of work to make it resilient in the face of comments andnewlines appearing in all sorts of weird places. There’s no single cleansolution for this, just lots of edge cases and special handling.

5 The back end is where all ofthe performance challenges come from, and it went through two almost completerewrites before it ended up where it is today.

6 I started from a simplerformatter written by a teammate that treated text, whitespace, and splits all asseparate chunks. I unified those so that each chunk included non-whitespacetext, line split information, and whitespace information if it didn’t split.That simplified a lot.

7 When I added support forbetter indentation of nested functions, that broke the code that split sourceinto separately splittable regions. For a while, a single top-level statementwould be split as a single unit, even if it contained nested functions withhundreds of lines of code. It was… not fast.

8 Ideally, the split informationin a chunk would describe the split before the chunk’s text. This would avoidthe pointless split information on the last chunk, and also solve annoyingspecial-case handling of the indentation before the very first chunk.

I’ve tried to correct this mistake a number of times, but it causes anear-infinite number of off-by-one bugs and I just haven’t had the time to pushit all the way through and fix everything.

9 Rules are a relatively recentaddition. Originally each chunk’s split was handled independently. You couldspecify some relations between them like “if this chunk splits then this otherone has to as well”, but you could not express things like “only one of thesethree chunks may split”

Eventually, I realized the latter is what I really needed to get argument listsformatting well, so I conceived of rules as a separate concept and rewrote thefront and line splitter to work using those.

10 At first, I thought hardsplits weren’t needed. Any place a mandatory newline appears (like between twostatements) is a place where you could just break the list of chunks in two andline split each half independently. From the line splitter’s perspective, therewould be no hard splits.

Which would work… except for line comments:

This has to be split as a single unit to get the expression nesting andindentation correct, but it also contains a mandatory newline after the linecomment.

11 There used to be a separateclass for a “multisplit” to directly handle forcing outer expressions to splitwhen inner ones did. Once rules came along, they also needed to expressconstraints between them, and eventually those constraints were expressiveenough to be able to handle the multisplit behavior directly and multisplitswere removed.

12 I spent a lot of timetuning costs for different grammar productions to control how tightly bounddifferent expressions were. The goal was to allow splits at the places where thereader thought code was “loosest”, so stuff like higher precedence expressionswould have higher costs.

Tuning these costs was a nightmare. It was like a hanging mobile where tweakingone cost would unbalance all of the others. On more than one occasion, I foundmyself considering making them floating point instead of integers, a sure signof madness.

It turns out spans are what you really want in order to express looseness.Nested infix operators then fall out naturally because you have more spansaround the deeper nested operands. The parse tree gives it to you for free.

These days, almost every chunk and span has a cost of 1, and it’s the quantityof nested spans and contained chunks that determine where it splits.

13 I had known thatclang-format worked this way for a long time, but I could never wrap my headaround how to apply it to dartfmt’s richer chunk/rule/span system.

I took a lot of walks along the bike trail next to work trying to think througha way to get graph search working when the two numbers being optimized (overflowcharacters and cost) are in direct opposition, and we don’t even know what thegoal state looks like. It took a long time before it clicked. Even then, itdidn’t work at all until I figured out the right heuristics to use to optimizeit.

14 For a long time, overflowand cost were treated as a single fitness function. Every overflow characterjust added a very high value to the cost to make the splitter strongly want toavoid them.

Splitting overflow out as a separate metric turned out to be key to getting thegraph search to work because it let us order the solutions by cost independentlyof overflow characters.

15 I went back and forth onhow an unbound rule should implicitly behave. Treating it as implicitly splitgives you solutions with fewer overflow characters sooner. Treating it asunsplit gives you lower costs.

16 Oh, God. I tried a milliondifferent ways to reduce the branchiness before I hit on only looking at rulesin the first long line. I’m still amazed that it works.

I could also talk about how controlling branchiness lets us avoid reaching thesame state from multiple different paths. After all, it’s a graph, buteverything I’ve described talks about it like it’s a tree. By carefullycontrolling how we extend partial solutions, we ensure we only take a singlepath to any given complete solution.

Before I got that working, I had to keep a “visited” set to make sure we didn’texplore the same regions twice, but just maintaining that set was a bigperformance sink.

17 Discarding overlappingbranches is the last macro-optimization I did and its behavior is very subtle.Correctly detecting when two partial solutions overlap took a lot ofiteration. Every time I thought I had it, one random weird test would fail whereit accidentally collapsed two branches that would eventually diverge.

That bullet list was paid for in blood, sweat, and tears. I honestly don’t thinkI could have figured them out at all until late in the project when I had acomprehensive test suite.

Please enable JavaScript to view the comments powered by Disqus.

pdffasr