bob – Page 13 – Insignificant Bits

Update: Explanation here.

August 4, 2013

New house, new projects

This year I attained permanent residency in Canada, and our family also attained permanent residency in a different way, by buying our first, and probably last, house. We moved into it a few months ago, a fact I neglected to remark here, so if you are mailing me something, and for some reason read my blog, you’ll want to ask me for my new address.

With the house comes plenty of things to work on that are only marginally interesting, so I shall proceed to bore you with them from time to time.

I completed my first mini-project this weekend: installing task lighting in the kitchen. Our builder wanted to charge hundreds of dollars to put in (environmentally unsound) Xenon lights under the cabinets in the kitchen. I, having zero experience with house wiring, thought “I can do that!” Thus, we opted to have them install just the house wiring but we would take care of the lights.

When I was a wee undergrad, LED lighting was then just around the corner. Installations existed, but they were terrifically pricey. Now, hundreds of years later, you can get lots of them on a spool for a few tens of dollars. The particular ones I purchased (from Amazon) come as a ribbon of copper with sticky tape on the back. The ribbon is essentially one 16-foot wire with groups of three small LEDs and a resistor in series, and each group connected in parallel. To use it, you just cut to length, and then solder on power and ground from the +12 volt DC power supply (sold separately). I went for ‘warm white’ LEDs (2700K) which are a bit into the yellow spectrum; they are quite a bit softer than your average super bright white LED.

I mounted the LEDs via the aforementioned tape on the back of the cabinet valence, and the power supplies on the cabinet undersides. The power supplies are designed to plug into a wall socket, but since we had junction boxes for the lighting pre-installed, I simply cut the ends off the cords and wired the DC supplies directly into the boxes. The house lines were already wired to a circuit with a wall switch, so that was all that was required. Each strip worked the first time.

All in all, it was a pretty uneventful install — the kind you want when dealing with A/C. I still need to staple up some hanging wires to the cabinet underside, but otherwise it’s finished, and the end result looks great.

July 17, 2013

It’s faster because it’s async!

Some Javascript developers, when presented with an API, think, “I know, I’ll make it async!” Now their users have a giant mess of a state machine implemented with callbacks and recursion, in a GC runtime that doesn’t eliminate tail calls, and two orders of magnitude worse performance. Normally, I wouldn’t care, because you get what you ask for when you write something in Javascript, but this time it happened to a project that I occasionally maintain for $work. And so I was sad.

So, far from expert in the ways of JS, I looked for a way out of this mire, and stumbled across task.js, which is what the kids are doing these days. It is a neat little hack, although it only works if your JS engine supports generators (most do not). Still, it seemed like a reasonable approach for my problem, so I decided to understand how task.js works by deriving my own.

I should note that if one’s JS engine doesn’t have generators, one can kind of emulate them, in the same sense that one can emulate goto with switch statements. Consider this simple example:


var square_gen = function() {
    var obj = {
        state: 0,

        next: function() {
            var ret = this.state * this.state;
            this.state++;
            return ret;
        },

        send: function(v) {
            this.state = v;
            return this.next();
        },
    };
    return obj;
};

var gen = square_gen();
for (var i=0; i < 4; i++) {
    console.log("The next square is: " + gen.next());
}
/*
Outputs:
The next square is: 0
The next square is: 1
The next square is: 4
The next square is: 9
*/

In other words, square_gen() returns an object that has a next() method which returns the next value, based on some internal state. The next() method could be a full-fledged state machine instead of a simple variable.

Certainly this is less tidy than the equivalent with language support:


var square_gen = function*() {
    var state = 0;
    while (1) {
        yield state * state;
        state++;
    }
};

(I'll assume the existence of generators in examples henceforth.)

The send() function is interesting -- it allows you to set the state from which the generator continues.


square_gen.send(16); /* => 256 */

The key idea is that generators will only do one iteration of a possibly large computation. When yield is reached, the function exits, and the computation is started up again after the yield statement when the caller performs another next() or a send().

Now, suppose you need to do something that takes a while, without waiting, like fetching a resource over the network. When the resource is available, you need to continue with that data. In the meantime you should try to do something else, like allow the UI to render itself.

The normal way one does this is by writing one's code in the (horrible) continuation-passing-style, which is fancy talk for passing callbacks to all functions. The task.js way is better: write functions that can block (generators) and switch to other tasks when they do block. No, I'm not making this up -- it is normal for a program in JS land to include a half-baked cooperative multitasker.

You can turn callbacks into promises, as in "I promise to give you value X when Y is done." Promise objects are getting baked into various runtimes, but rolling your own is also easy:


function MyPromise() {
    this.todo = function(){};
}

MyPromise.prototype.then = function(f) {
    this.todo = f;
};

MyPromise.prototype.resolve = function(value) {
    this.todo(value);
};

That is: register a callback with then(), to be called when a value is known. The value is set by calling resolve().

Now, suppose you have an async library function that takes a callback:


function example(call_when_done) {
    /* do something asynchronously */
    call_when_done(value);
}

Instead of calling it like this:


    example(function(value) { /* my continuation */ });

...you can give it a callback that resolves a promise, which, when resolved, calls your continuation:


    var p = new MyPromise();
    example(function(value) { p.resolve(value); });
    p.then(function(value) { /* my continuation */ });

This is just obfuscation so far. The usefulness comes when you rewrite your functions to return promises, so that:


    function my_fn(params, callback) { /* pyramid of doom */ }

becomes:


    function my_fn(params) { /* ... */ return promise; }

Now you can have generators yield promises, which will make them exit until another next() or send() call. And then you can arrange to call send() with the promise's value when it is resolved, which gives you blocking functionality.


    /* An async function that returns a promise */
    function async() {
        var p = new MyPromise();
        example(function(v) { p.resolve(v); });
        return p;
    }

    /* A generator which blocks */
    var gen = function*() {

        /* does some async call and receives a promise */
        var p = async();

        /* blocks until async call is done */
        var v = yield p;

        /* now free to do something with v... */
    }();

    /*
     * Run one iteration of the generator.  If the
     * generator yields a promise, set up another
     * iteration when that promise is resolved.
     */
    function iterate(gen, state) {
        var p = gen.send(state);
        p.then(function(v) {
            iterate(gen, v);
        });
    }

    /* Start a task */
    var p = gen.next();
    p.then(function(v) {
        iterate(gen, v);
    });

    /*
     * You can do something else here, like start another
     * generator.  However, control needs to eventually
     * return to runtime main loop so that async() can
     * make progress.
     */

A lot of details are omitted here like StopIteration and error handling, but that's the gist. Task.js generalizes a lot of this and has a decent interface, so, do that.

In summary, Javascript is terrible (only slightly less so without c-p-s), and I'm glad I only have to use it occasionally.

June 4, 2013

AWS precursor

Let’s say it’s 1975, and you have a mountain of data (2 megs) to process, and a heap of cash. Whom do you call to run your Cobol? Your local airline, of course!

Excerpt:

IBM 360/195 for only 50 cents a SECOND

Guaranteed Turnaround! 2meg; 2314’s – 3330’s – 3420’s
OS/MVT
HASP/RJE
MPSX-GPSS-PMS-SSP-CSMP
Ans Cobol, Fortran G, G1, H, Assembler F & H, PL/1 F and PL/1 Optimizing and Checkout Compilers.

Our typical customer is knowledgeable in OS; has good working knowledge of JCL, Utilities and the functions of the compilers/assemblers he uses.
[…]
Call or Write
UNITED AIRLINES

Courtesy of a yellowed copy of Computerworld my dad found among his things. I noticed Google has scanned a lot of old issues but I couldn’t find this one in their archive.

No, I don’t remember those days.

April 24, 2013

Grepping 300 gigs

It’s fun when a problem is simple, yet the program to solve it takes long enough to run that you can write a second, or third, much faster version while waiting on the first to complete. Today I had a 1.8 billion line, 300 GB sorted input file of the form “keytvalue”, and another file of 20k interesting keys. I wanted lines matching the keys sent to a third file, where the same key might appear multiple times in the input.

What doesn’t work: any version of grep -f, which I believe is something like O(N*M).

What could have worked: a one-off program to load the keys into a hash, and output the data in a single pass. Actually, I had written this one-off program before for a different task. In this case, the input file doesn’t need to be sorted, but the key set must be small and well-defined. It is O(N+M).

What worked: a series of sorted grep invocations (sgrep, but not the one in ubuntu). That is O(lgN * M), which can be a win if M is small compared to N. See also comm(1) and join(1), but I don’t know off the top of my head whether they start with a binary search. I had a fairly low selectivity in my input set so the binary search helps immensely.

What also could have worked: a streaming map-reduce or gnu parallel job (the input file was map-reduce output, but my cluster was busy at the time I was processing this, and I am lazy). That would still be O(N+M), but distributing across P machines in the cluster would reduce it by around the factor P. P would need to be large here to be competitive with the previous solution from a complexity standpoint.

Making up numbers, these take about 0, 5, 1, and 20 minutes to write, respectively.

Of course, algorithmic complexity isn’t the whole picture. The last option, or otherwise splitting the input/output across disks on a single machine, would have gone a long way to address this problem:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda              10.50     0.00  361.00    9.75 83682.00  4992.00   478.35   139.42  174.20    6.43 6385.95   2.70 100.00
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdd               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

You can’t go any faster when blocked on I/O.

Note the similarities with join algorithms in databases — variations on nested loop, hash, and sort-merge joins are all present above.

February 12, 2013

Cooking with gnuplot

Over the winter holidays I was put in charge of cooking one (of several) of the family dinners. At my house, a Christmas dinner can mean only one thing: prime rib is on the menu. The local grocery store had a great deal on rib roasts, so I bought a whole one. All seven ribs, 25 pounds of it. When it came time to cook this beast, I did plenty of reading, and settled on this seriouseats recipe. I guessed at about six hours to slow-roast the behemoth. But after a few hours of roasting, I decided it would be nice to know whether it would finish in time for the guests, or whether we would have to invent some pre-meal activities to stall.

Linear regression to the rescue! I had a leave-in meat thermometer plugged into the slab of cow, the type with a cable that runs outside the oven so that you can read the temperature without opening the oven door. It was then a simple matter to record the temperature every fifteen minutes and plot it to see how it was going. My uninformed guess was that the temperature curve is really sigmoid-shaped, but linearity is probably close enough around the target range.

Gnuplot can do linear regression for you:
f(x) = a*x + b fit f(x) 'temp.dat' u 1:2 via a, b set xrange [-5:160] plot 'temp.dat' u 1:2 w linespoints, f(x)
This produces a graph like the image below, which shows that after 3 hours of cooking, the meat would be around 128 degrees (I started keeping track about three hours in).

In the end, I turned up the oven a bit in the last hour to speed things along. The meat turned out great, although I didn’t have too much luck with the in-between rest that the recipe promotes: there were still plenty of juices all over the place at carving time. Next time, I believe I’ll just turn the oven up to 500 deg. F when the interior target temperature is reached, and then do a normal rest afterwards. Another lesson learned: a full rib roast is perhaps twice as much as needed for eight people, but I am not one to complain about having prime rib leftovers for a week.

December 13, 2012

Mavened.

So, I’m applying a one line patch to a Java package, and oh yes, it needs maven. Hooray!

$ apt-cache depends --recurse maven | grep Depends | 
     grep -v '<' | egrep lib.*java | sort | uniq | wc -l
203
[... start build and watch maven download another pile ...]
$ find ~/.m2 -name *pom | wc -l
851

I find it hard to believe that there are that many useful libraries in the world.

December 11, 2012

Olio

I neglected to write up anything on the blog in November despite it being the penultimate of months, so here’s a meandering catch-up post to atone. My apologies for the gratuitous self-linking that is about to ensue.

On Halloween: our 2-year old went as Kung Fu Panda (his favorite movie, and yes, shame on his parents for letting him watch movies) for Halloween this year. He was quite excited to learn that you can just go ask for candy from strangers and they will give it to you. He has mastered enough language to say “Pumpkin Scary,” which he did given every opportunity on seeing the skull I had carved on the pumpkin on the right. The other pumpkin is supposed to be McQueen from the Disney Cars property; Alex called it “Pumpkin Car.” They are now composting. So it goes.

On Thanksgiving: being a half-US, half-Canadian family, we get to celebrate both Thanksgivings: the fake one and the real one. And so we did. It was great spending time with the family and meeting up with some old friends in Atlanta, though we learned painful lessons about air travel with young children.

On Canada: I’ve now been in Canada for a little over a year. Among the observations I had detailed previously, I can now add these:

We do have a few days here in the summer that qualify as hot, but no, it does not get Georgia-in-August-hot.
Canadian Coca-cola is superior to non-KFP USA Coke, and despite what the PR machine in Atlanta will tell you, you can easily tell the difference between sugar and HFCS when you’re used to one or the other.
Canadian TV is even more a wasteland when there is no hockey.

On Bash Goto: per my last post, I considered how hard it would be to write an x86 emulator in bash. Conclusion: despite the potential good fun in simulating %eip using ‘nl’ and sed, I’ll leave this task to someone else. However, I did improve my actual implementation of this somewhat. One easy win is to put the jump labels inside comments so that an ordinary run of the script won’t barf. And so that is what I did.

On Work: while I’ve been a contractor in name for the last year, I have now taken on some other contracts and thereby made this status more official. It was a tough decision to go this route versus, say, working on a salaried basis with some large, hypothetical mobile chipmaker with a Canadian presence, but so far I am happy with the choice. Most recently, I have been doing some Linux mesh networking stuff with Cozybit. It may be a while before any of it finds its way upstream (there are NDAs involved) but I claim that it is cool stuff. In the meantime, I get to continue slaying big data dragons at LP/Xmarks.

On HBase: speaking of HBase, two things have recently come to my attention. First, there is Hannibal, a cluster monitoring tool which was inspired by my post about beating gnuplot over the head with perl. I had nothing to do with its implementation otherwise, but it looks pretty cool. Secondly, I recently had an enquiry about my cache-oblivious code from some HBase folks. I’m not working on that either, but I am hopeful that something comes of it since it would be great if these ideas (not my own) percolate out into mainstream practice.

October 26, 2012August 4, 2023

goto in bash

If you are as old and nerdy as I, you may have spent your grade school days hacking in the BASIC computer language. One of the (mostly hated) features of the (mostly hated) language was that any statement required a line number; this provided both the ability to edit individual lines of the program without a screen editor, as well as de facto labels for the (mostly hated) GOTO and GOSUB commands. But you could also use line numbers to run your program starting from any random point: “RUN 250” might start in the middle of a program, typically after line 250 exited with some syntax error and was subsequently fixed.

Today, in bash, we have no such facility. Why on earth would anyone want it, with the presence of actual flow control constructs? Who knows, but asking Google about “bash goto” shows that I am not the first.

For my part, at $work, I have a particular script which takes several days to run, each part of which may take many hours, and, due to moon phases, may fail haphazardly. If a command fails, the state up to that point is preserved, so I just need to continue where that left off. Each major part of the job is already factored into individual scripts, so I could cut-and-paste commands from the failure point onward, but I’m lazy.

Thus, I present bash goto. It runs sed on itself to strip out any parts of the script that shouldn’t run, and then evals it all. Prepare to cringe.


#!/bin/bash
# include this boilerplate
function jumpto
{
    label=$1
    cmd=$(sed -n "/$label:/{:a;n;p;ba};" $0 | grep -v ':$')
    eval "$cmd"
    exit
}

start=${1:-"start"}

jumpto $start

start:
# your script goes here...
x=100
jumpto foo

mid:
x=101
echo "This is not printed!"

foo:
x=${x:-10}
echo x is $x

results in:


$ ./test.sh
x is 100
$ ./test.sh foo
x is 10
$ ./test.sh mid
This is not printed!
x is 101

My quest to make bash look like assembly language draws ever nearer to completion.

Update 2019/05/21: A reader pointed out that executing one of the labels results in bash complaining “command not found” and suggested putting the labels in a comment, which works just fine without any other changes (but if you like, you can drop the “grep -v” in that case). You might also be interested in the takes of folks on reddit, or my own take of being discussed on reddit.

Update 2023/08/04: Greetings, and my apologies, to Hacker News readers all these years later. Hi!

October 9, 2012

Fake Thanksgiving

Happy Canadian Thanksgiving! Now it’s time to do the dishes.

Author: bob

Footprint