Tag Archives: python

Parsing HRegionInfo in Python

I’ve been doing a fair amount of HBase work lately at $work, not least of which is pybase, a python module that encapsulates Thrift and puts it under an API that looks more or less like the Cassandra wrapper pycassa (which we also use).

When running an HBase cluster, one must very quickly learn the stack from top to bottom and be ready to fix the metadata when catastrophe strikes. Most of the necessary information about HBase regions is stored in the .META. table; unfortunately some of the values therein are serialized HBase Writables. One usually uses JRuby and directly loads Java classes to deal with the deserialization, but we’re a Python shop and doing it all over thrift would be ideal.

Thus, here’s a quick module to parse out HRegionInfo along with a few generic helpers for Writables. I haven’t decided yet whether this kind of thing belongs in pybase.

I’m curious whether there is an idiomatic way to do advancing pointer type operations in python without returning an index everywhere. Perhaps converting an array to a file-like object?

#!/usr/bin/python
import struct

def vint_size(byte):
    if byte >= -112:
        return 1

    if byte <= -120:         return -119 - byte      return -111 - byte  def vint_neg(byte):     return byte < -120 or -112 <= byte < 0          def read_byte(data, ofs):     return (ord(data[ofs]), ofs + 1)  def read_long(data, ofs):     val = struct.unpack_from(">q", data, offset=ofs)[0]
    return (val, ofs + 8)

def read_vint(data, ofs):
    firstbyte, ofs = read_byte(data, ofs)

    sz = vint_size(firstbyte)
    if sz == 1:
        return (firstbyte, ofs)

    for i in xrange(0, sz):
        (nextb, ofs) = read_byte(data, ofs)
        val = (val << 8) | nextb      if vint_neg(firstbyte):         val = ~val      return (val, ofs)  def read_bool(data, ofs):     byte, ofs = read_byte(data, ofs)     return (byte != 0, ofs)  def read_array(data, ofs):     sz, ofs = read_vint(data, ofs)     val = data[ofs:ofs+sz]     ofs += sz     return (val, ofs)  def parse_regioninfo(data, ofs):     end_key, ofs = read_array(data, ofs)     offline, ofs = read_bool(data, ofs)     region_id, ofs = read_long(data, ofs)     region_name, ofs = read_array(data, ofs)     split, ofs = read_bool(data, ofs)     start_key, ofs = read_array(data, ofs)     # tabledesc: not about to parse this     # hashcode: int      result = {         'end_key' : end_key,         'offline' : offline,         'region_id' : region_id,         'region_name' : region_name,         'split' : split,         'start_key' : start_key,     }     return result 

My anaconda don’t want none…

I decided to take a look at this Python language to see what all the hubbub is about. It looks alright. Maybe someday I’ll find an excuse to use it, but my heart still belongs to Perl.

My first python program had to be a quine (that is, a program that prints its own sourcecode). It is a tricky problem at first, but they generally have commonality in being an exercise in quoting: i.e. how can you quote a string and embed the string itself in that string? Suppose you put the text of the program in a string ‘s’ and printed it:

string s=”string s=XYZ; print s”; print s

This is the basic structure of at least the classic version. The problem lies in putting the value of ‘s’ in place of the XYZ, including the enclosing quotes (the escaping of which turns out to be a roadblock). The classic C program uses printf in combination with the character codes for quotes to get by. My image-producing perl quine similarly used the chr() function (I used an eval in the perl version, although it would certainly be possible to just duplicate all of the code outside of the string).

Perhaps a more straightforward approach to setting XYZ to ‘s’ is to use a single regular expression substitution.

Perl has, in my opinion, better semantics for quoting than python. Or maybe I am just used to it. In perl, a single quote means everything until the close quote is literal. In python, it really doesn’t make a difference: ‘n’ and “n” are both newlines. So in python they added letters before the quotes to get different behaviors, particularly the letter r indicates “raw” mode, analagous to perl’s single quoting. My first impression is that this is syntactically a bunch of garbage but I guess it maps to perl’s “q” operator which I do use a lot.

Anyway, mix raw quoting with regex substitutions and you get (imagine this is all on one line):


import re;s=r'import re;s=rC;print re.sub("C","x27"+s+"x27",s,1)';print re.sub("C","x27"+s+"x27",s,1)