Tablet guts

Tablet gutsIn case you want to know what the inside of a Samsung Galaxy Tab 3.0 looks like, I propose the following experiment: unplug one, and let it sit unused and un-powered for two months. You will then find it in the state where disconnecting and reconnecting the battery is required to get it to charge again. I am not making this up. Oh, and opening it is pretty tricky even with the proper plastic tools to do so. Since I have two of them, that’s how I lost thirty minutes of my life on Saturday.

As for why I have two: this particular model has a Marvell SoC inside, and the wireless SD8787 peripheral can be used with the upstream cfg80211-based mwifiex driver. For cozybit, I helped write an alternative mac80211 driver that can run mesh. It was a little slower and more power hungry than mwifiex, but in addition to being mesh/ibss/AP capable, had some nice-for-development features like SDIO tracing and an nl80211-testmode interface that could run firmware commands from userspace, upon which we built some test scaffolding. You can get the code for that driver today, but development is pretty much at a dead end because we needed to extend the firmware for host-based operation, and there will probably never be a redistributable firmware at this point.

I’d love to have an open firmware for the device, but as I have seen and touched the NDA encumbered firmware, it’s unlikely that I can have any hand in bringing that about.

Phone bugs

If you want to experience what it is like to be the first person to ever do something, might I suggest turning on all the debug knobs on an Android vendor kernel? Some speculative fixes over here, but there is still at least one other RCU bug on boot, and this use-after-free in the usb controller driver:

[   34.520080] msm_hsic_host msm_hsic_host: remove, state 1
[   34.525329] usb usb1: USB disconnect, device number 1
[   34.529602] usb 1-1: USB disconnect, device number 2
[   34.637023] msm_hsic_host msm_hsic_host: USB bus 1 deregistered
[   34.668945] Unable to handle kernel paging request at virtual address aaaaaae6
[   34.675201] pgd = c0004000
[   34.678497] [aaaaaae6] *pgd=00000000
[   34.681762] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
[   34.686737] Modules linked in: wcn36xx_msm(O) wcn36xx(O) mac80211(O) cfg80211(O) compat(O)
[   34.694976] CPU: 1    Tainted: G        W  O  (3.4.0-g4a73a1d-00005-g311eaee-dirty #2)
[   34.702972] PC is at __gpio_get_value+0x28/0x1cc
[   34.707489] LR is at do_restart+0x24/0xd8
[   34.711486] pc : []    lr : []    psr: 60000013
[   34.711486] sp : ebcd9ed8  ip : c032a858  fp : ebcd9ef4
[   34.722930] r10: 00000000  r9 : c04dd67c  r8 : 00000000
[   34.728210] r7 : c4d81f00  r6 : 6b6b6b6b  r5 : c10099cc  r4 : aaaaaaaa
[   34.734680] r3 : 09090904  r2 : c1672f80  r1 : 00000010  r0 : 6b6b6b6b
                                                                 ^^^^^^^^ whoops
[   34.741241] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment kernel
[   34.748504] Control: 10c5787d  Table: acfb006a  DAC: 00000015

I guess I should try to find out where to report bugs for these (predictably) not upstream drivers, but that seems like a total pain.

Footprint, Part 2

My recent posting of ASCII art was intentionally subtle, perhaps overly so. If you haven’t figured it out by now, it is a C program that announces the birth of our second child. When compiled and executed, it prints:

Ian Yit-Sing Copeland
Born 5 August, 2013 at 02:49
Weight 6 lbs 11 oz

In this post, I will explain how I created it.

Like many a C practitioner, I’ve always found International Obfuscated C Code Contest entries to be at once entertaining and mystifying. While I don’t consider this effort to be near the level of IOCCC entries, I thought it might be fun to try my own IOCCC-lite ASCII art program.

I knew I wanted the program to simply print an announcement string, in a non-obvious fashion. It also had to be easy to modify the program with the actual details on the day of the delivery. Indeed, I wrote this program several weeks in advance and modified the output in only a few minutes.

Thanks to the Can You Crack It challenge, I already had an RC4-like algorithm sitting on my disk. The plan then was simple: embed the key and ciphertext in the program, and just decrypt and print it. Of course, the ciphertext would be all binary, which is hard to encode in a compact manner in the source. Thus, I decided to store the ciphertext in base32 encoding.

I actually didn’t put much effort into obfuscating the code; the main issue was getting the size of the source code into around 600 characters, and doing that involved using typedefs and one-letter variable names already. By the time that was done, I didn’t change much. Apart from variable naming, the main obfuscation step was to change the base32 mapping table to consist only of numbers and symbols so that the embedded string would itself look like code. The code otherwise is pretty straight-forward.

My starting point, which base32-decoded and decrypted some placeholder text, looked like this:

#include <string.h>
#include <stdio.h>

char str[] = "43(:7&!#3&@>$%^|]:&6;<7*-$}9{;!*!$5{<;-=^8]#:5<@#%#&!5!@40207#9($3&)7<$1";

int b32dec(char *in, char *out, int len)
{
    int i;
    char alpha[] = "3:5[9&2^]{7}*<-8@=4(6#!>|)0+;1$%";
    for (i=0; i < len; i++)
    {
        unsigned char bits = strchr(alpha, in[i])-alpha;
        int j = (i / 8) * 5;
        switch (i % 8) {
            case 0:
                out[j + 0] |= bits << 3;
                break;
            case 1:
                out[j + 0] |= bits >> 2;
                out[j + 1] |= bits << 6;
                break;
            case 2:
                out[j + 1] |= bits << 1;
                break;
            case 3:
                out[j + 1] |= bits >> 4;
                out[j + 2] |= bits << 4;
                break;
            case 4:
                out[j + 2] |= bits >> 1;
                out[j + 3] |= bits << 7;
                break;
            case 5:
                out[j + 3] |= bits << 2;
                break;
            case 6:
                out[j + 3] |= bits >> 3;
                out[j + 4] |= bits << 5;
                break;
            case 7:
                out[j + 4] |= bits;
                break;
        }
    }
    return i/8 * 5;
}


int keysched(unsigned char *x, unsigned char *key, int keylen)
{
    int i;
    unsigned char tmp, a = 0;

    for (i=0; i < 256; i++)
        x[i] = i;

    for (i=0; i < 256; i++)
    {
        a += x[i] + key[i % keylen];
        tmp = x[i];
        x[i] = x[a];
        x[a] = tmp;
    }
}

int crypt(unsigned char *x, unsigned char *y, int len)
{
    unsigned char a;
    unsigned char b = 0;
    unsigned char tmp;
    int i;

    for (i=0; i < len; i++)
    {
        a = i+1;
        b += x[a];
        tmp = x[a];
        x[a] = x[b];
        x[b] = tmp;
        y[i] ^= x[(x[a] + x[b]) & 0xff];
    }
}

int main()
{
    unsigned char x[256];
    unsigned char key[] = "abcd";
    unsigned char crypt_text[sizeof(str)] = {};
    int len;

    len = b32dec(str, crypt_text, strlen(str));
    keysched(x, key, strlen(key));
    crypt(x, crypt_text, len);
    printf("%sn", crypt_text);
}

Micro-optimizing for source code size is unusual, and in some ways backwards to optimizing for generated code. For example, this change saved a couple of characters, but in the opposite way frequently done:

-            *o++ = d[0] << 3 | d[1] >> 2;
+            *o++ = d[0] * 8 | d[1] / 4;

Similarly, I found that combining unrelated functions was useful in eliminating the character waste of function definitions. There are likely a good deal more space-savers to be found; I quit when I got it small enough.

I experimented with a few different formats for the code. It turns out that I'm not terribly good at drawing ASCII. I abandoned a baby bottle as unrecognizable and went with a footprint after seeing this motif on some birth announcements. I hand-drew it, scanned it in, loaded as a background in an html document and then put dollar signs in the right places. Yes, cheating, but I am no artist.

                                          $$$$
                                        $$$$$$$$
                              $$$      $$$$$$$$$
                        $$   $$$$$    $$$$$$$$$
                       $$$$   $$$     $$$$$$$$$
                       $$$$             $$$$$$
                  $$   $$
                 $$$$            $$$$$$
             $  $$$$$       $$$$$$$$$$$$$
           $$$$  $$$$    $$$$$$$$$$$$$$$$$
           $$$$       $$$$$$$$$$$$$$$$$$$$$
           $$$$     $$$$$$$$$$$$$$$$$$$$$$$
            $$    $$$$$$$$$$$$$$$$$$$$$$$$$
                $$$$$$$$$$$$$$$$$$$$$$$$$$$
               $$$$$$$$$$$$$$$$$$$$$$$$$$$
              $$$$$$$$$$$$$$$$$$$$$$$$$$$
              $$$$$$$$$$$$$$$$$$$$$$$$$
              $$$$$$$$$$$$$$$$$$$$$$$$
              $$$$$$$$$$$$$$$$$$$$$$
              $$$$$$$$$$$$$$$$$$$$$
              $$$$$$$$$$$$$$$$$$$$
              $$$$$$$$$$$$$$$$$$$
              $$$$$$$$$$$$$$$$$$
              $$$$$$$$$$$$$$$$$
              $$$$$$$$$$$$$$$$$
              $$$$$$$$$$$$$$$$$
              $$$$$$$$$$$$$$$$
              $$$$$$$$$$$$$$$$
              $$$$$$$$$$$$$$$$
              $$$$$$$$$$$$$$$
               $$$$$$$$$$$$$
               $$$$$$$$$$$$
                 $$$$$$$$$
                    $$$$

I wrote a python script to encrypt and encode the string to be embedded in the code. A second script tokenized the original C source and replaced dollar signs from the text template with code. The latter script is not perfect, but worked well enough that I could hand-edit the output in a few seconds to get something reasonable.

The gory details are available at github.

Finally, on delivery day, I discovered that WordPress mangled C code included in posts. So I simply took a screenshot of the output and posted that instead, with a link to the original text. A picture of the newborn was hidden as a hyperlink from the footprint image.

Overall, the entire process took a couple of evenings from concept to completion, and I'm quite happy with the way it turned out. And, of course, I'm ecstatic about the inspiration: our newly arrived baby boy. We are very fortunate to have both Alex and now Ian in our lives, and silly C programs and fake man pages cannot come close to capturing our joy.

It’s faster because it’s async!

Some Javascript developers, when presented with an API, think, “I know, I’ll make it async!” Now their users have a giant mess of a state machine implemented with callbacks and recursion, in a GC runtime that doesn’t eliminate tail calls, and two orders of magnitude worse performance. Normally, I wouldn’t care, because you get what you ask for when you write something in Javascript, but this time it happened to a project that I occasionally maintain for $work. And so I was sad.

So, far from expert in the ways of JS, I looked for a way out of this mire, and stumbled across task.js, which is what the kids are doing these days. It is a neat little hack, although it only works if your JS engine supports generators (most do not). Still, it seemed like a reasonable approach for my problem, so I decided to understand how task.js works by deriving my own.

I should note that if one’s JS engine doesn’t have generators, one can kind of emulate them, in the same sense that one can emulate goto with switch statements. Consider this simple example:


var square_gen = function() {
    var obj = {
        state: 0,

        next: function() {
            var ret = this.state * this.state;
            this.state++;
            return ret;
        },

        send: function(v) {
            this.state = v;
            return this.next();
        },
    };
    return obj;
};

var gen = square_gen();
for (var i=0; i < 4; i++) {
    console.log("The next square is: " + gen.next());
}
/*
Outputs:
The next square is: 0
The next square is: 1
The next square is: 4
The next square is: 9
*/

In other words, square_gen() returns an object that has a next() method which returns the next value, based on some internal state. The next() method could be a full-fledged state machine instead of a simple variable.

Certainly this is less tidy than the equivalent with language support:


var square_gen = function*() {
    var state = 0;
    while (1) {
        yield state * state;
        state++;
    }
};

(I'll assume the existence of generators in examples henceforth.)

The send() function is interesting -- it allows you to set the state from which the generator continues.


square_gen.send(16); /* => 256 */

The key idea is that generators will only do one iteration of a possibly large computation. When yield is reached, the function exits, and the computation is started up again after the yield statement when the caller performs another next() or a send().

Now, suppose you need to do something that takes a while, without waiting, like fetching a resource over the network. When the resource is available, you need to continue with that data. In the meantime you should try to do something else, like allow the UI to render itself.

The normal way one does this is by writing one's code in the (horrible) continuation-passing-style, which is fancy talk for passing callbacks to all functions. The task.js way is better: write functions that can block (generators) and switch to other tasks when they do block. No, I'm not making this up -- it is normal for a program in JS land to include a half-baked cooperative multitasker.

You can turn callbacks into promises, as in "I promise to give you value X when Y is done." Promise objects are getting baked into various runtimes, but rolling your own is also easy:


function MyPromise() {
    this.todo = function(){};
}

MyPromise.prototype.then = function(f) {
    this.todo = f;
};

MyPromise.prototype.resolve = function(value) {
    this.todo(value);
};

That is: register a callback with then(), to be called when a value is known. The value is set by calling resolve().

Now, suppose you have an async library function that takes a callback:


function example(call_when_done) {
    /* do something asynchronously */
    call_when_done(value);
}

Instead of calling it like this:


    example(function(value) { /* my continuation */ });

...you can give it a callback that resolves a promise, which, when resolved, calls your continuation:


    var p = new MyPromise();
    example(function(value) { p.resolve(value); });
    p.then(function(value) { /* my continuation */ });

This is just obfuscation so far. The usefulness comes when you rewrite your functions to return promises, so that:


    function my_fn(params, callback) { /* pyramid of doom */ }

becomes:


    function my_fn(params) { /* ... */ return promise; }

Now you can have generators yield promises, which will make them exit until another next() or send() call. And then you can arrange to call send() with the promise's value when it is resolved, which gives you blocking functionality.


    /* An async function that returns a promise */
    function async() {
        var p = new MyPromise();
        example(function(v) { p.resolve(v); });
        return p;
    }

    /* A generator which blocks */
    var gen = function*() {

        /* does some async call and receives a promise */
        var p = async();

        /* blocks until async call is done */
        var v = yield p;

        /* now free to do something with v... */
    }();

    /*
     * Run one iteration of the generator.  If the
     * generator yields a promise, set up another
     * iteration when that promise is resolved.
     */
    function iterate(gen, state) {
        var p = gen.send(state);
        p.then(function(v) {
            iterate(gen, v);
        });
    }

    /* Start a task */
    var p = gen.next();
    p.then(function(v) {
        iterate(gen, v);
    });

    /*
     * You can do something else here, like start another
     * generator.  However, control needs to eventually
     * return to runtime main loop so that async() can
     * make progress.
     */

A lot of details are omitted here like StopIteration and error handling, but that's the gist. Task.js generalizes a lot of this and has a decent interface, so, do that.

In summary, Javascript is terrible (only slightly less so without c-p-s), and I'm glad I only have to use it occasionally.

Grepping 300 gigs

It’s fun when a problem is simple, yet the program to solve it takes long enough to run that you can write a second, or third, much faster version while waiting on the first to complete. Today I had a 1.8 billion line, 300 GB sorted input file of the form “keytvalue”, and another file of 20k interesting keys. I wanted lines matching the keys sent to a third file, where the same key might appear multiple times in the input.

What doesn’t work: any version of grep -f, which I believe is something like O(N*M).

What could have worked: a one-off program to load the keys into a hash, and output the data in a single pass. Actually, I had written this one-off program before for a different task. In this case, the input file doesn’t need to be sorted, but the key set must be small and well-defined. It is O(N+M).

What worked: a series of sorted grep invocations (sgrep, but not the one in ubuntu). That is O(lgN * M), which can be a win if M is small compared to N. See also comm(1) and join(1), but I don’t know off the top of my head whether they start with a binary search. I had a fairly low selectivity in my input set so the binary search helps immensely.

What also could have worked: a streaming map-reduce or gnu parallel job (the input file was map-reduce output, but my cluster was busy at the time I was processing this, and I am lazy). That would still be O(N+M), but distributing across P machines in the cluster would reduce it by around the factor P. P would need to be large here to be competitive with the previous solution from a complexity standpoint.

Making up numbers, these take about 0, 5, 1, and 20 minutes to write, respectively.

Of course, algorithmic complexity isn’t the whole picture. The last option, or otherwise splitting the input/output across disks on a single machine, would have gone a long way to address this problem:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda              10.50     0.00  361.00    9.75 83682.00  4992.00   478.35   139.42  174.20    6.43 6385.95   2.70 100.00
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdd               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

You can’t go any faster when blocked on I/O.

Note the similarities with join algorithms in databases — variations on nested loop, hash, and sort-merge joins are all present above.

Mavened.

So, I’m applying a one line patch to a Java package, and oh yes, it needs maven. Hooray!

$ apt-cache depends --recurse maven | grep Depends | 
     grep -v '<' | egrep lib.*java | sort | uniq | wc -l
203
[... start build and watch maven download another pile ...]
$ find ~/.m2 -name *pom | wc -l
851

I find it hard to believe that there are that many useful libraries in the world.

goto in bash

If you are as old and nerdy as I, you may have spent your grade school days hacking in the BASIC computer language. One of the (mostly hated) features of the (mostly hated) language was that any statement required a line number; this provided both the ability to edit individual lines of the program without a screen editor, as well as de facto labels for the (mostly hated) GOTO and GOSUB commands. But you could also use line numbers to run your program starting from any random point: “RUN 250” might start in the middle of a program, typically after line 250 exited with some syntax error and was subsequently fixed.

Today, in bash, we have no such facility. Why on earth would anyone want it, with the presence of actual flow control constructs? Who knows, but asking Google about “bash goto” shows that I am not the first.

For my part, at $work, I have a particular script which takes several days to run, each part of which may take many hours, and, due to moon phases, may fail haphazardly. If a command fails, the state up to that point is preserved, so I just need to continue where that left off. Each major part of the job is already factored into individual scripts, so I could cut-and-paste commands from the failure point onward, but I’m lazy.

Thus, I present bash goto. It runs sed on itself to strip out any parts of the script that shouldn’t run, and then evals it all. Prepare to cringe.


#!/bin/bash
# include this boilerplate
function jumpto
{
    label=$1
    cmd=$(sed -n "/$label:/{:a;n;p;ba};" $0 | grep -v ':$')
    eval "$cmd"
    exit
}

start=${1:-"start"}

jumpto $start

start:
# your script goes here...
x=100
jumpto foo

mid:
x=101
echo "This is not printed!"

foo:
x=${x:-10}
echo x is $x

results in:


$ ./test.sh
x is 100
$ ./test.sh foo
x is 10
$ ./test.sh mid
This is not printed!
x is 101

My quest to make bash look like assembly language draws ever nearer to completion.

Update 2019/05/21: A reader pointed out that executing one of the labels results in bash complaining “command not found” and suggested putting the labels in a comment, which works just fine without any other changes (but if you like, you can drop the “grep -v” in that case). You might also be interested in the takes of folks on reddit, or my own take of being discussed on reddit.

Update 2023/08/04: Greetings, and my apologies, to Hacker News readers all these years later. Hi!

Packet Timing

Continuing with the charting theme, I wrote a quick python script yesterday to generate timing diagrams from pcap files while learning matplotlib. It’s just at the proof-of-concept stage but could be useful to debug things like power-saving in wireless. Surely something like this exists somewhere already, right? I couldn’t find it in the Google.

Along the way I had to write a python radiotap parser; that might actually be useful to others so I’ll try to put it on github at some point.

Parsing HRegionInfo in Python

I’ve been doing a fair amount of HBase work lately at $work, not least of which is pybase, a python module that encapsulates Thrift and puts it under an API that looks more or less like the Cassandra wrapper pycassa (which we also use).

When running an HBase cluster, one must very quickly learn the stack from top to bottom and be ready to fix the metadata when catastrophe strikes. Most of the necessary information about HBase regions is stored in the .META. table; unfortunately some of the values therein are serialized HBase Writables. One usually uses JRuby and directly loads Java classes to deal with the deserialization, but we’re a Python shop and doing it all over thrift would be ideal.

Thus, here’s a quick module to parse out HRegionInfo along with a few generic helpers for Writables. I haven’t decided yet whether this kind of thing belongs in pybase.

I’m curious whether there is an idiomatic way to do advancing pointer type operations in python without returning an index everywhere. Perhaps converting an array to a file-like object?

#!/usr/bin/python
import struct

def vint_size(byte):
    if byte >= -112:
        return 1

    if byte <= -120:
        return -119 - byte

    return -111 - byte

def vint_neg(byte):
    return byte < -120 or -112 <= byte < 0
        
def read_byte(data, ofs):
    return (ord(data[ofs]), ofs + 1)

def read_long(data, ofs):
    val = struct.unpack_from(">q", data, offset=ofs)[0]
    return (val, ofs + 8)

def read_vint(data, ofs):
    firstbyte, ofs = read_byte(data, ofs)

    sz = vint_size(firstbyte)
    if sz == 1:
        return (firstbyte, ofs)

    for i in xrange(0, sz):
        (nextb, ofs) = read_byte(data, ofs)
        val = (val << 8) | nextb

    if vint_neg(firstbyte):
        val = ~val

    return (val, ofs)

def read_bool(data, ofs):
    byte, ofs = read_byte(data, ofs)
    return (byte != 0, ofs)

def read_array(data, ofs):
    sz, ofs = read_vint(data, ofs)
    val = data[ofs:ofs+sz]
    ofs += sz
    return (val, ofs)

def parse_regioninfo(data, ofs):
    end_key, ofs = read_array(data, ofs)
    offline, ofs = read_bool(data, ofs)
    region_id, ofs = read_long(data, ofs)
    region_name, ofs = read_array(data, ofs)
    split, ofs = read_bool(data, ofs)
    start_key, ofs = read_array(data, ofs)
    # tabledesc: not about to parse this
    # hashcode: int

    result = {
        'end_key' : end_key,
        'offline' : offline,
        'region_id' : region_id,
        'region_name' : region_name,
        'split' : split,
        'start_key' : start_key,
    }
    return result

How I nearly cracked it

Here’s my methodology for part 1 of the Can You Crack It puzzle.

(Spoilers below)

eb 04 af c2 bf a3 81 ec  00 01 00 00 31 c9 88 0c
0c fe c1 75 f9 31 c0 ba  ef be ad de 02 04 0c 00
d0 c1 ca 08 8a 1c 0c 8a  3c 04 88 1c 04 88 3c 0c
fe c1 75 e8 e9 5c 00 00  00 89 e3 81 c3 04 00 00
00 5c 58 3d 41 41 41 41  75 43 58 3d 42 42 42 42
75 3b 5a 89 d1 89 e6 89  df 29 cf f3 a4 89 de 89
d1 89 df 29 cf 31 c0 31  db 31 d2 fe c0 02 1c 06
8a 14 06 8a 34 1e 88 34  06 88 14 1e 00 f2 30 f6
8a 1c 16 8a 17 30 da 88  17 47 49 75 de 31 db 89
d8 fe c0 cd 80 90 90 e8  9d ff ff ff 41 41 41 41

Anyone who has stared long and hard at x86 hexdumps before will immediately think “I know this, this is an intel system!” The value 0xdeadbeef in little-endian format is a dead-giveaway, as are the 0x90 (NOP) instructions. I know a couple of ways to go from a block of machine code to the corresponding code. One way, like scripts/decodecode in the kernel, is to make a .S file with .byte directives, assemble it, and run objdump over the object file.

Here’s another way, and what I did at first: create a .c file like so, and compile it:

$ cat foo.c

unsigned char x[] = { /* list of hexes here */ };
int main()
{
}

$ gcc -g -o foo foo.c

Then, load it up in gdb and disassemble as if it were a function:

$ gdb foo

gdb> disas x

That procedure yielded some interesting bits of hand-coded assembly, but at this point I had no idea what it did. I cleaned up the code a bit and added labels to arrive at a listing like the following:

    jmp l1
    scas %es:(%edi), %eax
    ret $0xa3bf

    l1:
    sub $0x100, %esp
    xor %ecx, %ecx

    loop1:
    mov %cl,(%esp,%ecx,1)
    inc %cl
    jne loop1

    xor %eax, %eax
    mov $0xdeadbeef, %edx

    loop2:
    add (%esp,%ecx,1),%al
    add %dl,%al
    ror $0x8, %edx
    mov (%esp,%ecx,1),%bl
    mov (%esp,%eax,1),%bh
    mov %bl,(%esp,%eax,1)
    mov %bh,(%esp,%ecx,1)
    inc %cl
    jne loop2
    jmp l2

    func1:
    mov %esp, %ebx
    add $0x4, %ebx
    pop %esp
    pop %eax
    cmp $0x41414141,%eax
    jne quit
    pop %eax
    cmp $0x42424242,%eax
    jne quit
    pop %edx
    mov %edx,%ecx
    mov %esp,%esi
    mov %ebx,%edi
    sub %ecx,%edi
    rep movsb %ds:(%esi),%es:(%edi)
    mov %ebx,%esi
    mov %edx,%ecx
    mov %ebx,%edi
    sub %ecx,%edi
    xor %eax,%eax
    xor %ebx,%ebx
    xor %edx,%edx

    loop3:
    inc %al
    add (%esi,%eax,1),%bl
    mov (%esi,%eax,1),%dl
    mov (%esi,%ebx,1),%dh
    mov %dh,(%esi,%eax,1)
    mov %dl,(%esi,%ebx,1)
    add %dh, %dl
    xor %dh,%dh
    mov (%esi,%edx,1),%bl
    mov (%edi),%dl
    xor %bl,%dl
    mov %dl,(%edi)
    inc %edi
    dec %ecx
    jne loop3

    quit:
    xor %ebx,%ebx
    mov %ebx,%eax
    inc %al
    int $0x80

    l2:
    nop
    nop
    call func1

    .byte 0x41, 0x41, 0x41, 0x41

The first thing I looked at was the int 0x80 call. This is how (well, one way) you make a system call on Linux. The %ebx register is always zero, and %eax is one. This handy website shows that such a syscall results in a call to sys_exit(). Thus I added the label quit to this bit of code.

At a high level, we then have a loop at loop1 that initializes some data on the stack; loop2, which performs some unknown calculation on that array; func1, a function which itself performs another loop. A close inspection reveals that func1 uses the output of loop2 as an input, along with data at the end of the program, beginning with 0x41414141.

Deciphering loop2 is an interesting exercise. The initializer creates a 256-byte character array with sequential values from 0 to 255. Then it uses the value 0xdeadbeef to generate an index in %eax, which is used here:

    mov (%esp,%ecx,1),%bl
    mov (%esp,%eax,1),%bh
    mov %bl,(%esp,%eax,1)
    mov %bh,(%esp,%ecx,1)

This is a swap operation, so after 256 iterations, the array will have been permuted. This looked to me like some kind of random shuffle, with 0xdeadbeef as a seed, but I was unfamiliar with its actual purpose. I wrote a C version just to make it clearer:

int firstpart(unsigned char *x, size_t len)
{
    int i;
    unsigned char tmp, a = 0;
    uint32_t d = 0xdeadbeef;

    for (i=0; i < len; i++)
        x[i] = i;

    for (i=0; i < len; i++)
    {
        a += x[i] + d;
        d = (d >> 8) | (d << 24);
        tmp = x[i];
        x[i] = x[a];
        x[a] = tmp;
    }
}

Likewise, the func1 was doing some kind of shuffle, and xor-ing the result with another block of data. This screams crypto to me, but I still didn't know exactly what func1 was doing. I wrote a C version:

int second_part(unsigned char *x, unsigned char *y, size_t len)
{
    unsigned char a = 0;
    unsigned char b = 0;
    unsigned char tmp;
    int i;

    for (i=0; i < len; i++)
    {
        a = i+1;
        b += x[a];
        tmp = x[a];
        x[a] = x[b];
        x[b] = tmp;
        b = x[(x[a] + x[b]) & 0xff];
        y[i] ^= b;
    }
}

On the crypto hunch, I started reading random wikipedia pages, until I stumbled across the pseudocode on the RC4 page. Aha! This is RC4 with a key of 0xdeadbeef. I never guessed RC4 was so simple.

At this point, I had this whole block of code figured out, could run it fully in my C variant, but knew I needed ciphertext to go at the end of the program and didn't know where to find it. Asking the internet gave me the hint to look inside the image file with the code dump, and the rest was easy to figure out.

Solving the puzzle yields a link to a javascript page where you are to write a virtual machine and run it to reveal the next stage. I implemented the machine in Python but it still needs a bit of debugging to give up its secret.