I’ve been doing a fair amount of HBase work lately at $work, not least of which is pybase, a python module that encapsulates Thrift and puts it under an API that looks more or less like the Cassandra wrapper pycassa (which we also use).
When running an HBase cluster, one must very quickly learn the stack from top to bottom and be ready to fix the metadata when catastrophe strikes. Most of the necessary information about HBase regions is stored in the .META. table; unfortunately some of the values therein are serialized HBase Writables. One usually uses JRuby and directly loads Java classes to deal with the deserialization, but we’re a Python shop and doing it all over thrift would be ideal.
Thus, here’s a quick module to parse out HRegionInfo along with a few generic helpers for Writables. I haven’t decided yet whether this kind of thing belongs in pybase.
I’m curious whether there is an idiomatic way to do advancing pointer type operations in python without returning an index everywhere. Perhaps converting an array to a file-like object?
#!/usr/bin/python import struct def vint_size(byte): if byte >= -112: return 1 if byte <= -120: return -119 - byte return -111 - byte def vint_neg(byte): return byte < -120 or -112 <= byte < 0 def read_byte(data, ofs): return (ord(data[ofs]), ofs + 1) def read_long(data, ofs): val = struct.unpack_from(">q", data, offset=ofs)[0] return (val, ofs + 8) def read_vint(data, ofs): firstbyte, ofs = read_byte(data, ofs) sz = vint_size(firstbyte) if sz == 1: return (firstbyte, ofs) for i in xrange(0, sz): (nextb, ofs) = read_byte(data, ofs) val = (val << 8) | nextb if vint_neg(firstbyte): val = ~val return (val, ofs) def read_bool(data, ofs): byte, ofs = read_byte(data, ofs) return (byte != 0, ofs) def read_array(data, ofs): sz, ofs = read_vint(data, ofs) val = data[ofs:ofs+sz] ofs += sz return (val, ofs) def parse_regioninfo(data, ofs): end_key, ofs = read_array(data, ofs) offline, ofs = read_bool(data, ofs) region_id, ofs = read_long(data, ofs) region_name, ofs = read_array(data, ofs) split, ofs = read_bool(data, ofs) start_key, ofs = read_array(data, ofs) # tabledesc: not about to parse this # hashcode: int result = { 'end_key' : end_key, 'offline' : offline, 'region_id' : region_id, 'region_name' : region_name, 'split' : split, 'start_key' : start_key, } return result