Graphing HBase Splits

I was asked for a blog post on this topic. I also do birthdays!

HBase achieves balance by splitting regions when they reach a certain size, and by evenly distributing the number of regions among cluster machines. However, the balancer will not run in some cases (e.g. if there are regions stuck in transition), and balancing the number of regions alone may not help if region sizes are not mostly the same size. If a region server is hosting more regions than the others, requests to that server experience higher latency, and batch (map-reduce) jobs take longer to complete due to parallel skew.

At $work we graph these data hourly, and here’s how we do it.

First, we run the following JRuby script from cron. [Note: I've been advised (thanks ntelford!) that HServerInfo is gone in newer releases and you now need to get HServerLoad via ClusterStatus.getLoad(server_name).]

# This ruby script dumps a text file of region sizes and the servers
# they are on, for determining balance/split effectiveness.
#
# Usage: hbase org.jruby.Main region_hist.rb
#
include Java
import org.apache.hadoop.hbase.ClusterStatus
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.HServerInfo
import org.apache.hadoop.hbase.HServerLoad
import org.apache.hadoop.hbase.HTableDescriptor
import org.apache.hadoop.hbase.client.HBaseAdmin

def main()
    conf = HBaseConfiguration.new()
    client = HBaseAdmin.new(conf)

    status = client.clusterStatus
    status.serverInfo.each do |server|
        server_name = server.serverName
        printed_server = false

        load = server.load
        rload = load.regionsLoad
        rload.each do |region|
            region_name = region.nameAsString
            size = region.storefileSizeMB
            puts "#{server_name}\t#{region_name}\t#{size}"
            printed_server = true
        end
        if !printed_server then
            puts "#{server_name}\tNONE\t0"
        end
    end
end

main()

The script generates a flat file with server name, region name, and region
store file size fields:

c1.example.com,60020,1333483861849  .META.,,1   2
c2.example.com,60020,1333484982245  edges,175192748,176293002,1331824785017.fc03e947e571dfbcf65aa16dfd073804    1723
...

We then process this through a pile of Perl (I’ll spare the details) to generate several other data files. First, there’s a flat file with a sum of region sizes and region count per table:

server  -ROOT-  -ROOT-  .META.  .META.  edges   edges   ids ids verts   verts   userdata    userdata    types   types   stats   stats   topics  topics  topics_meta topics_meta maps    maps
c1  0   0   0   0   51041   41  0   0   27198   12  0   0   0   0   585 2   0   0   0   0   0   0
c2  0   0   0   0   49260   40  3501    1   20090   10  0   0   0   0   772 3   0   0   0   0   0   0
...

Next, for each table, we generate a file showing the total size of each region:

1   230
2   510
3   1200

The first column is just a line numbering. The region sizes are sorted to make the final chart easier to read.

From there we use gnuplot to generate a histogram of the regions by numbers and by size, and then a per-table chart of the region size distribution. The gnuplot file looks like this:

set terminal png
set key invert reverse Left outside
set key autotitle columnheader
set boxwidth 0.75
set xtics nomirror rotate by -45 font ",8"
set key noenhanced
set yrange[0:*]
set style histogram rowstacked gap 1 title offset 2, 0.25
set style data histogram
set style fill solid border -1
set output 'hbase-fig1.png'
plot 'load.dat' using 3:xtic(1), for [i=2:11] '' using 2*i+1

set output 'hbase-fig2.png'
plot 'load.dat' using 2:xtic(1), for [i=2:11] '' using 2*i

set ylabel "Store Size (MB)"
set xlabel "Store"
unset xtics
set output 'hbase-fig3.png'
plot 'splits_edges.dat' using 2:xtic(1) title 'edges'

Here’s the end result (I changed the server and table names but this is otherwise real data):

In the above figure, we can see there’s a good balance of the number of regions across the region servers. We can also easily see which servers are hosting which regions, such as the important, but small, -ROOT- and .META. tables. So far, so good.

In this image, we see that the total size is not very well balanced: server c13 has a lot more data than the others. Taken together, these graphs indicate that our regions are not all the same size. The next image shows this more dramatically.

Here we see that around 60% of our regions for this table are smaller than 1 Gig, and the remaining 40% are split between 1-2G and 2-4G sizes. We would rather see a baseline at 2G (half the max region size), and the midpoint around 3G assuming evenly distributed splits. In our case, we had increased the region size of our largest table late in the game, so there are a ton of small regions here that we should try to merge.

Seeing the regions at a glance has been a useful tool. In one case, we got a factor of 8 speedup in a map-reduce job by re-splitting and manually moving regions to ensure that all the regions were evenly distributed across the cluster — the difference between running a job once a week vs. running it once a day.

Orders of magnitude

I had a Hadoop map-reduce job that kept timing out, which led to this interesting discovery:

$ time ./json-parser-test.py

real	0m0.205s
user	0m0.152s
sys	0m0.032s

$ time ./json-parser-test-no-speedups.py

real	0m2.069s
user	0m2.044s
sys	0m0.024s

$ time jython ./json-parser-test-no-speedups.py

real	79m59.785s
user	80m23.709s
sys	0m14.441s

Moral: use Java-based JSON libraries if you have to use Jython and JSON. Also, Java sucks.

Parsing HRegionInfo in Python

I’ve been doing a fair amount of HBase work lately at $work, not least of which is pybase, a python module that encapsulates Thrift and puts it under an API that looks more or less like the Cassandra wrapper pycassa (which we also use).

When running an HBase cluster, one must very quickly learn the stack from top to bottom and be ready to fix the metadata when catastrophe strikes. Most of the necessary information about HBase regions is stored in the .META. table; unfortunately some of the values therein are serialized HBase Writables. One usually uses JRuby and directly loads Java classes to deal with the deserialization, but we’re a Python shop and doing it all over thrift would be ideal.

Thus, here’s a quick module to parse out HRegionInfo along with a few generic helpers for Writables. I haven’t decided yet whether this kind of thing belongs in pybase.

I’m curious whether there is an idiomatic way to do advancing pointer type operations in python without returning an index everywhere. Perhaps converting an array to a file-like object?

#!/usr/bin/python
import struct

def vint_size(byte):
    if byte >= -112:
        return 1

    if byte <= -120:
        return -119 - byte

    return -111 - byte

def vint_neg(byte):
    return byte < -120 or -112 <= byte < 0
        
def read_byte(data, ofs):
    return (ord(data[ofs]), ofs + 1)

def read_long(data, ofs):
    val = struct.unpack_from(">q", data, offset=ofs)[0]
    return (val, ofs + 8)

def read_vint(data, ofs):
    firstbyte, ofs = read_byte(data, ofs)

    sz = vint_size(firstbyte)
    if sz == 1:
        return (firstbyte, ofs)

    for i in xrange(0, sz):
        (nextb, ofs) = read_byte(data, ofs)
        val = (val << 8) | nextb

    if vint_neg(firstbyte):
        val = ~val

    return (val, ofs)

def read_bool(data, ofs):
    byte, ofs = read_byte(data, ofs)
    return (byte != 0, ofs)

def read_array(data, ofs):
    sz, ofs = read_vint(data, ofs)
    val = data[ofs:ofs+sz]
    ofs += sz
    return (val, ofs)

def parse_regioninfo(data, ofs):
    end_key, ofs = read_array(data, ofs)
    offline, ofs = read_bool(data, ofs)
    region_id, ofs = read_long(data, ofs)
    region_name, ofs = read_array(data, ofs)
    split, ofs = read_bool(data, ofs)
    start_key, ofs = read_array(data, ofs)
    # tabledesc: not about to parse this
    # hashcode: int

    result = {
        'end_key' : end_key,
        'offline' : offline,
        'region_id' : region_id,
        'region_name' : region_name,
        'split' : split,
        'start_key' : start_key,
    }
    return result

How I nearly cracked it

Here’s my methodology for part 1 of the Can You Crack It puzzle.

(Spoilers below)

eb 04 af c2 bf a3 81 ec  00 01 00 00 31 c9 88 0c
0c fe c1 75 f9 31 c0 ba  ef be ad de 02 04 0c 00
d0 c1 ca 08 8a 1c 0c 8a  3c 04 88 1c 04 88 3c 0c
fe c1 75 e8 e9 5c 00 00  00 89 e3 81 c3 04 00 00
00 5c 58 3d 41 41 41 41  75 43 58 3d 42 42 42 42
75 3b 5a 89 d1 89 e6 89  df 29 cf f3 a4 89 de 89
d1 89 df 29 cf 31 c0 31  db 31 d2 fe c0 02 1c 06
8a 14 06 8a 34 1e 88 34  06 88 14 1e 00 f2 30 f6
8a 1c 16 8a 17 30 da 88  17 47 49 75 de 31 db 89
d8 fe c0 cd 80 90 90 e8  9d ff ff ff 41 41 41 41

Anyone who has stared long and hard at x86 hexdumps before will immediately think “I know this, this is an intel system!” The value 0xdeadbeef in little-endian format is a dead-giveaway, as are the 0×90 (NOP) instructions. I know a couple of ways to go from a block of machine code to the corresponding code. One way, like scripts/decodecode in the kernel, is to make a .S file with .byte directives, assemble it, and run objdump over the object file.

Here’s another way, and what I did at first: create a .c file like so, and compile it:

$ cat foo.c

unsigned char x[] = { /* list of hexes here */ };
int main()
{
}

$ gcc -g -o foo foo.c

Then, load it up in gdb and disassemble as if it were a function:

$ gdb foo

gdb> disas x

That procedure yielded some interesting bits of hand-coded assembly, but at this point I had no idea what it did. I cleaned up the code a bit and added labels to arrive at a listing like the following:

    jmp l1
    scas %es:(%edi), %eax
    ret $0xa3bf

    l1:
    sub $0x100, %esp
    xor %ecx, %ecx

    loop1:
    mov %cl,(%esp,%ecx,1)
    inc %cl
    jne loop1

    xor %eax, %eax
    mov $0xdeadbeef, %edx

    loop2:
    add (%esp,%ecx,1),%al
    add %dl,%al
    ror $0x8, %edx
    mov (%esp,%ecx,1),%bl
    mov (%esp,%eax,1),%bh
    mov %bl,(%esp,%eax,1)
    mov %bh,(%esp,%ecx,1)
    inc %cl
    jne loop2
    jmp l2

    func1:
    mov %esp, %ebx
    add $0x4, %ebx
    pop %esp
    pop %eax
    cmp $0x41414141,%eax
    jne quit
    pop %eax
    cmp $0x42424242,%eax
    jne quit
    pop %edx
    mov %edx,%ecx
    mov %esp,%esi
    mov %ebx,%edi
    sub %ecx,%edi
    rep movsb %ds:(%esi),%es:(%edi)
    mov %ebx,%esi
    mov %edx,%ecx
    mov %ebx,%edi
    sub %ecx,%edi
    xor %eax,%eax
    xor %ebx,%ebx
    xor %edx,%edx

    loop3:
    inc %al
    add (%esi,%eax,1),%bl
    mov (%esi,%eax,1),%dl
    mov (%esi,%ebx,1),%dh
    mov %dh,(%esi,%eax,1)
    mov %dl,(%esi,%ebx,1)
    add %dh, %dl
    xor %dh,%dh
    mov (%esi,%edx,1),%bl
    mov (%edi),%dl
    xor %bl,%dl
    mov %dl,(%edi)
    inc %edi
    dec %ecx
    jne loop3

    quit:
    xor %ebx,%ebx
    mov %ebx,%eax
    inc %al
    int $0x80

    l2:
    nop
    nop
    call func1

    .byte 0x41, 0x41, 0x41, 0x41

The first thing I looked at was the int 0×80 call. This is how (well, one way) you make a system call on Linux. The %ebx register is always zero, and %eax is one. This handy website shows that such a syscall results in a call to sys_exit(). Thus I added the label quit to this bit of code.

At a high level, we then have a loop at loop1 that initializes some data on the stack; loop2, which performs some unknown calculation on that array; func1, a function which itself performs another loop. A close inspection reveals that func1 uses the output of loop2 as an input, along with data at the end of the program, beginning with 0×41414141.

Deciphering loop2 is an interesting exercise. The initializer creates a 256-byte character array with sequential values from 0 to 255. Then it uses the value 0xdeadbeef to generate an index in %eax, which is used here:

    mov (%esp,%ecx,1),%bl
    mov (%esp,%eax,1),%bh
    mov %bl,(%esp,%eax,1)
    mov %bh,(%esp,%ecx,1)

This is a swap operation, so after 256 iterations, the array will have been permuted. This looked to me like some kind of random shuffle, with 0xdeadbeef as a seed, but I was unfamiliar with its actual purpose. I wrote a C version just to make it clearer:

int firstpart(unsigned char *x, size_t len)
{
    int i;
    unsigned char tmp, a = 0;
    uint32_t d = 0xdeadbeef;

    for (i=0; i < len; i++)
        x[i] = i;

    for (i=0; i < len; i++)
    {
        a += x[i] + d;
        d = (d >> 8) | (d << 24);
        tmp = x[i];
        x[i] = x[a];
        x[a] = tmp;
    }
}

Likewise, the func1 was doing some kind of shuffle, and xor-ing the result with another block of data. This screams crypto to me, but I still didn't know exactly what func1 was doing. I wrote a C version:

int second_part(unsigned char *x, unsigned char *y, size_t len)
{
    unsigned char a = 0;
    unsigned char b = 0;
    unsigned char tmp;
    int i;

    for (i=0; i < len; i++)
    {
        a = i+1;
        b += x[a];
        tmp = x[a];
        x[a] = x[b];
        x[b] = tmp;
        b = x[(x[a] + x[b]) & 0xff];
        y[i] ^= b;
    }
}

On the crypto hunch, I started reading random wikipedia pages, until I stumbled across the pseudocode on the RC4 page. Aha! This is RC4 with a key of 0xdeadbeef. I never guessed RC4 was so simple.

At this point, I had this whole block of code figured out, could run it fully in my C variant, but knew I needed ciphertext to go at the end of the program and didn't know where to find it. Asking the internet gave me the hint to look inside the image file with the code dump, and the rest was easy to figure out.

Solving the puzzle yields a link to a javascript page where you are to write a virtual machine and run it to reveal the next stage. I implemented the machine in Python but it still needs a bit of debugging to give up its secret.

I can nearly crack it

The GCHQ (that’s British for NSA) has been running a marketing gimmick to find new people to read your tweets. On this website, you will find an enigmatic hexdump, and a prompt for a keyword. Supposedly if you get it correct, then you get forwarded to their job site. I don’t care about the job aspect, but I do enjoy a good puzzle.

I’ll add my details and methodology in a follow-up post after the clock runs out, so as not to spoil anything for casual readers who want to take their own look. There are a few good clues in the hexdump if you have previously spent any time looking at real ones.

I did (almost) crack it: I figured out all there was to know about the hexdump. It became clear at that point that the hexes on the website aren’t everything you need for the puzzle, so I cheated to find the missing piece (somewhat obvious, in retrospect). A neat exercise, and I learned something about cryptography.

Now I’m on the second stage, which is a bit more straightforward in that the problem statement tells you what type of answer is expected.

I’m curious to know if this type of recruitment tool is actually useful in finding qualified applicants, or just in generating buzz. Certainly several companies I’ve worked at have had their share of recruiting woes, and dreaming up a set of screening puzzles has to be more fun than dealing with headhunters.

Bare-bones time tracker

Lifehacker is always suggesting super-complicated GUI applications for keeping track of one’s billing hours or making todo lists. Here’s my 10-minute solution that I’ve been using for a while — ~/bin/wl:

#! /bin/bash
if [[ -z $1 ]]; then
    cat ~/docs/worklog.`date +%Y-%m`
else
    (d=`date +"%Y-%m-%d %H:%M:%S"`; echo "$d" $@)>> ~/docs/worklog.`date +%Y-%m`
fi

I use it like this:

$ wl something goes here
$ wl tkt 192323 - remove n**2 loop from the frobnicator
$ wl
2011-11-04 09:55:56 something goes here
2011-11-04 09:56:07 tkt 192323 - remove n**2 loop from the frobnicator

I also sometimes start a log message with a single upper-case or symbol character to indicate something else; e.g., “^” means start of a block of time, “$” means end, “T” means todo, etc. It isn’t perfect — sometimes you have to escape shell meta-characters, but it beats using a mouse, and the backing format is plain old text so grep/vi/etc still work.

New GPG key

My previous key isn’t yet two years old, but given the recent kernel.org reboot, I created a new keypair today. The fingerprint is:

11FB F4D3 92A6 F32B CB4E  2E07 8411 1007 23B2 B915

The old key will be valid for a little while longer. As for korg, perhaps it’s time I used github anyway. My OMFS tree is over there now, lonely and waiting for me to feed him some patches.

My Pal SCSI

My son Alex turned one a couple of weeks ago. (If you are reading this, Alex, happy birthday, and congratulations for learning to read at such an early age!) He picked up quite a stash of loot as a result.

Moore’s law means that the processing power of his toy box very likely exceeds that of my first computer by a large factor. As one example, he received a My Pal Scout talking stuffed animal as a gift. A toy that comes with a USB cable — this is progress! You can customize the toy to say and spell your child’s name, and pick different tunes for it to play.

My inner geek has been wondering what’s inside ever since, but of course I cannot take apart or otherwise ruin my kid’s toy in the name of science. I did, however, plug it in to my Linux box while he was napping. A quick dmesg showed the device implements USB storage, but always responds with ‘Medium Not Present’ when accessed. I guessed (incorrectly) that some extra magic might make the internal flash appear as a disk and then files are just copied to a FAT filesystem stored therein. The toy is relatively inexpensive and coming up with too much special sauce is likely to be prohibitively costly.

USB storage is a successful example of taking an existing protocol (SCSI command set) and wholesale wrapping it in a different wire protocol. Each USB storage transfer is initiated by the host sending a Command Block Wrapper (CBW) — a 31-byte USB packet starting with ‘USBC’, typically containing a SCSI command as a payload. Next, a block of data is transferred if this command represents a read or write. Finally, the device completes the transaction by sending a Command Status Wrapper (CSW), a 13-byte packet beginning with the string ‘USBS’.

One can get a feel for the flavor of the protocol by using usbmon. Much like an ethernet sniffer, usbmon provides a simple mechanism under Linux to capture USB traffic. A simple session might look like:

    # cat /sys/kernel/debug/usb/usbmon/4u > usbmon.txt

One might even potentially run usbmon on a host OS while some other OS is running as a guest in a virtual machine with USB pass-through.

The upshot of the layered approach to USB storage is that Linux creates a generic SCSI device (/dev/sgX) for any USB storage device. Using the generic device, one can directly send SCSI commands to the USB device, and the kernel will take care of wrapping it in USB commands. I believe something similar is possible in Windows land.

As it turns out, Scout is even simpler than I imagined. The internal flash has no controller or filesystem; instead it appears to be a raw NAND flash written a page at a time. It is a simple matter to read the flash using the Linux sg device. One merely opens the device file, and then issues a vendor-specific SCSI command on the file descriptor:

static int read_page(int fd, u32 addr)
{
    u8 cmdblk[] = {
        0xfd,               /* access flash */
        0x28,               /* read it (0x20 = write) */
        0, 0, 0, 0,
        0x06, 0, 0x08, 0,   /* no idea what the rest is */
        0, 0, 0, 0,
        0x47, 0x50
    };
    u8 response_buf[4096];
    u8 sense_buffer[32];

    cmdblk[2] = (addr >> 24) & 0xff;
    cmdblk[3] = (addr >> 16) & 0xff;
    cmdblk[4] = (addr >> 8) & 0xff;
    cmdblk[5] = addr & 0xff;

    sg_io_hdr_t io_hdr = {
        .interface_id = 'S',
        .cmd_len = sizeof(cmdblk),
        .mx_sb_len = sizeof(sense_buffer),
        .dxfer_direction = SG_DXFER_FROM_DEV,
        .dxfer_len = sizeof(response_buf),
        .dxferp = response_buf,
        .cmdp = cmdblk,
        .sbp = sense_buffer,
        .timeout = 20000
    };

    ioctl(fd, SG_IO, &io_hdr);
    /* do something with response_buf here */
    return 0;
}

Reading addresses 0×01000 through 0×10000, 4k at a time, seems to yield the customizable data on the device. The flash is tiny: this is just 64k, yet you can upload a digital audio file of your child’s name plus ten songs.

The data format is rather simple: there is a 30-byte header starting at address 0×1000, containing 16-bit, little-endian pointers for the customized files. Address 0×1008 holds a pointer to the spelling of your child’s name, address 0x100e holds a pointer to the audio file pronouncing your child’s name, and so on. Armed with the Leapfrog software and a packet sniffer, one can verify that these files do indeed match the individual binary files that the software downloads over HTTP when syncing the puppy.

I believe the digital audio files are some flavor of raw 8 kHz PCM, but I could not find the right combination of parameters to sox to make sense out of them. The song files are all apparently compressed with TTComp, some compression program from 1995. Running ttdecomp.exe from within dosemu did successfully decompress them. My guess is these files are some sort of sequencer format rather than sampled audio, given their tiny size.

This is, I think, as far as I wish to investigate the toy. Obvious exercises for the interested reader are to discern the individual file formats, and have the toy play Metallica. It’s pretty incredible how much technology can be cheaply packed into a child’s plaything today. But now I have my eye on dissecting that (non-electronic) toy inchworm — is there a spring in there or what?

Fast… for a modem

This question is for someone slightly more motivated: why is qemu-img so slow to convert from a raw disk image to a VMware (vmdk) image? Apparently it does some compression or null-block elimination, as the resulting file is about 30% smaller, but I was not expecting it to take an entire day to convert a 100 gig disk. To get the answer to a first approximation without actually reading the source, I did a quick run with blktrace and seekwatcher:

So we get a nice sequential read pattern, but excessive think time between the reads. Probably, reading bigger blocks would help, or something like posix_fadvise(POSIX_FADV_SEQUENTIAL). Iostat showed the disk was reading exactly 2k sectors (1MB) per second, when it should be able to do 50-100 times that.

I just decided to wait a day rather than learn and hack the source.

My drive is now solid state

I always learn something new when doing some misguided thing such as “let’s copy our OS onto a new disk using tar like they did back in the day!” The impetus was a shiny new OCZ Vertex 3 to replace the spinning rust in my ancient Macbook. As a co-worker says, it’s like putting a gold steering wheel in a 1970 Pinto. My plan was roughly to:

  • Stuff the new drive in the desktop
  • Make a tar backup of the old laptop drive across the network to the desktop
  • Partition new drive, aligning to flash block sizes
  • Mount the new partition somewhere, and untar the backup
  • Chroot in the new directory, and do a grub-install

Simple. All of that went great, I thought.

The first hurdle when swapping the new drive in was a physical one: the laptop disk bay uses Torx, smaller ones than any adapters I have. This obstacle was cleared by creative use of needle-nosed pliers, and I was on to my first boot attempt, and subsequent failure.

Among the exciting new discoveries:

  • You really, really need to remember to mark /boot as bootable. Otherwise, the dumb Mac firmware will give you the blinking question mark despite all of your previous care.

  • If your user/groups don’t match between the two systems, you probably just borked up a bunch of uid/gids. Luckily there are the -nouser and -nogroup args to find, and almost everything outside of /home is root:root.
  • Once again, Macs won’t boot a rescue USB flash drive unless it has EFI crud laying around in the root directory.
  • UUID-based fstab and grub.conf are not happy when you have an entirely new drive
  • Debian tar doesn’t understand xattrs, that’s a RedHat feature (although I caught this one in time and compiled RedHat’s tar on the Debian system).

On the whole though, it wasn’t too bad. The install took maybe twice as long as installing from media and restoring my usual backup despite the mistakes, and this way I don’t have to reinstall huge numbers of packages to get back to where I was. And so far the drive seems quite speedy, even with my sad 1.5 Mbps interface. Plus it doesn’t hurt that the installation did a giant defrag of the whole OS.