Binary Search in Ruby, or, "Picking the Right Number, as Quickly as Possible"
So you’re writing a parser in C that parses the lines of a file. The line you’re parsing is made up of a 40 character key and any number of ip addresses after, space-separated. You need to know a max line length to read (because C is mean like that), but you’re not sure how many ip’s you can fit on a line for a given key.
Such was my case yesterday and decided to write a mini script in ruby to figure it out. My first stab was to iterate from 1 to 100 and checking the line lengths by literally building a line with x number of ip elements on the line. While the code was correct and produced the necessary information for the given inputs, it was horribly inefficient and so I decided to rewrite it to be smarter. Enter the Binary search algorithm.
Using the binary search algorithm, we take a lower and an upper bound of possible elements and try to quickly guess which number is the highest possible without exceeding the line limit. So here’s the concrete data we know. The line format (as described above) will look something like this:
… with theoretically unlimited ips per line. The first value is a key we’ll use to store the ips against in a lookup table, but don’t worry about that right now. The key is generated using sha1 digesting, so we know it will always be 40 characters. The max length for any given ip address is 15 assuming all 4 blocks are at least valued at 100 (e.g. 100.100.100.100). Space-separating the key and x number of ips and your line length calculation is f(x) = kl + (el*x) + x
where x
is line length, kl
is key length, and el
is element length (ip address length). In other words, if we’re testing 50 elements on the line, the line length would be 40 + (15*50) + 50
which equals 840
.
Now that we can arbitrarily calculate the length of a line based on the number of ip elements we want to test, we can start “guessing”. This isn’t guessing at all, we just split our possible range in half and use the middle ground to test the possible length. In other words, if my initial range is 1..100 (read as “anywhere from 1 ip element to 100 ip elements”), then our first test value for x
above would be 50, which if you remember produces a line length of 840. I assumed that I’d be okay with a max line length of 1000 characters, and so we assert that if len
is less than the max, then we can use the upper half of the range boundary, or 50..100
. If len
was more than our max of 1000, we’d take the bottom half, or 1..50
.
Using this technique recursively we can whittle down to the exact number of ip elements that can be inserted on a line before we go over the limit of 1000 characters on the line, which happens to be 60. You know you’re done checking when your range is only one element apart, in this case 60..61. With my first solution to iterate up from 1 to 100, this meant we had to check 61 times before we knew we were over the limit. With this new range, we actually only needed 8 iterations! Very cool how “guessing” can solve the problem quite nicely.
Running the above script will produce the following output:
I’m not really sure if the efficiency part makes sense, but you get a sense that it’s a LOT faster, not only because we’re calculating the line length per test, but also because we’re recursing a fraction of calls that the brute force method performs. It’s also fun to inflate/deflate the max line len or the starting range values to see how it affects the number of recursions needed to find the number. For instance, set the max line len to 100000 and see how many extra calls have to be made. Also, what happens if your range isn’t big enough? What if the range is off (e.g. 75..100)?
Algorithms are nifty.