JNIs and ABIs

For a recent task at work, I had a compelling reason to break out of the Java jail and do some “real programming” in C. Unfortunately, said C-language bits were to be run on the Windows platform. I had a stripped DLL (no source) for manipulating USB cameras and an API document to work with, and a plan to duct tape this all together with JNI.

I’m not about to touch MSVC if I can help it, so I grabbed mingw packages for my platform and wrote up a small wrapper DLL to do the necessary JNI bridge. Here’s my attempt to archive the working recipe for the Google bots.

Compiling this wrapper library was quite easy with mingw:

$ cat Makefile
# This makefile builds usbcamjni.dll to wrap usbcam.dll
# Use mingw32-make to build
CPPFLAGS=-I$(JAVA_HOME)/include -I$(JAVA_HOME)/include/linux
LIBS=-L . -l usbcam

usbcamjni.dll: usbcamjni.o
        $(CC) -shared -o $@ $< $(LIBS)

clean: .FORCE
        $(RM) usbcamjni.dll usbcamjni.o

.FORCE:

$ mingw32-make

It turns out on Windows there are a few different calling conventions that you can choose from at compile time. Garden-variety code uses __cdecl, which is how the world should operate. The symbols are named just as they are named in the C code with a leading underscore, and the caller is responsible for cleaning up his own stack mess.

In the Windows world, there is __stdcall, where the symbol names (by default) have trailing decorations for the size of parameters on the stack (e.g. _write@4) and the callee cleans up the stack. The benefit is somewhat smaller code, but you can't have variadic functions.

So I was testing my JNI wrapper and getting JVM crashes. Looking at the register dump showed that the program counter was not pointing to a normal text virtual address, which is symptomatic of stack confusion. Once I re-acquainted myself with the above arcana, I guessed correctly that I was using the wrong calling convention for the DLL: together the function epilogue and caller were popping off too many levels of stack. However, in this case the DLL was using undecorated symbol names (e.g. _write) but still expecting __stdcall semantics. As it happens, this is how Win32 API works.

The problem is that annotating the prototypes in header files with "__stdcall" to get correct calling conventions also makes the dependent object file expect the decorated function name. Linking against the DLL results in lots of unsatisfied symbol errors.

The solution took some digging and doesn't make a whole lot of sense to me, but the way out of this using mingw tools is to build an interface library. First, you create a def file like so:

LIBRARY MYLIB.DLL
EXPORTS
Symbol1@0
Symbol2@4
Symbol3@8

Note that symbols in this file include the decorations. Then you run dlltool from mingw with -k to remove the decorations from the generated library. As I understand it, this option actually creates aliases from the decorated to undecorated name so that subsequent linking can work. The manpage says "-A" is actually for this, but before I read it, I found a mailing list post advocating "-k", and, well, it worked, so I'm joining that particular cargo cult.

$ dlltool -k -d mylib.def -l mylib.a

Then when building the JNI library wrapper, link against mylib.a instead of the original DLL. Easyish!

Kernel updates

During baby naps I actually did some Rio Karma work this week. First, I found and fixed a recent bug in usb storage that makes the Karma not work at all in 2.6.35. Then I finally set up my omfs git tree on kernel.org, with a few patches I wrote almost two years ago, plus a recently found memory leak. I’ll try to get Linus to pull it in 2.6.36. Then, I made a new release of omfsprogs rewritten to use libomfs from the FUSE version of omfs (I also did this work two years ago). Now all is right in the Karma world again; time to ignore it for another 6 months.

Then, on to ath5k. One nagging issue that the driver has always had is that there’s pretty much a total lack of synchronization among the myriad threads. There’s an ad-hoc spin lock here and there, but for example the reset procedure, which reloads hundreds of registers, can be called concurrently with itself and other driver ops, and can be invoked while trying to transmit and receive packets. So one idea we’ve been hashing out on the ML is to serialize reset with a mutex by putting it into process context via a workqueue, and disabling tasklets while it runs. I hacked that together and it’s now available in wireless-testing. Hopefully this will fix some of those weird DMA-into-freed-skb errors that can potentially be caused by untimely modification of descriptor pointers.

Finally: tracing. For the last year or so, Linux has grown its own NIH answer to DTrace in the form of ftrace and tracepoints. The wireless subsystem already has a few tracepoints here and there, and I really like it better than debug printks because they have smaller overhead in the compiled-but-not-enabled case (a branch compared to an out-of-line function call). You can pick and choose individual tracepoints to sample, or combine them with a general tracing facility such as ftrace. I have some patches baking to add various tracepoints to ath5k [1, 2, 3].

One really neat idea that I appropriated from Johannes Berg’s work on iwlwifi is to store entire packets in the tracing ring buffer. Then you can have a trace-cmd plugin to write these out as tcpdump pcap files. This is really nice: you get a packet dump that works with your existing debug infrastructure, plus you have the tracing information so you know what the code is trying to do, at whatever granularity you chose when setting up the trace. Hopefully, this will be a good one-stop debug gathering tool, not just for developers, but for users, as well.

On S-Expressions

I suppose I’m old enough now, and the rate of change in computing is fast enough, that I can regale young software developers with I-remember-when stories. You know, the ones that start off like “all of my punchcards were rubber-banded together, but I had neglected to number them…” I’ll spare you the tale of how we used to screen-scrape the mainframes at LargeAirline.com for today; instead you get to read my ramblings on languages-older-than-Go.

Today’s topic is inspired by my reading on LWN about the fact that Emacs is still around, which led me to David Wheeler’s argument that parentheses are why no one uses Lisp and that we can do better. Perhaps so, but I found his plugging of SKILL interesting, since once, long ago, when I was but a fresh young hacker, I spent a rather significant bit of time writing software in this language.

Since you have likely never heard of it, SKILL is a Lisp implementation that is (or was?) embedded in the Cadence suite of Electronic Design Automation products. As I recall, the biggest warts were:

  • the fact that you could use C-like prefixing — f(x y) was equivalent to (f x y) — but other constructs, like using parentheses for grouping, behaved as in Lisp (i.e. errored).
  • the runtime library used C-like functions, such as printf, fread, etc
  • the developer’s manual encouraged using imperative style programming like loops; indeed using recursion instead would usually lead to stack overflows

I typically wrote in the Lisp style and just ignored the C-like features as much as possible. The experience soured me on functional languages forever, which isn’t really fair, because SKILL was particularly bad.

But I’ve often wondered why it was so horrible, and how it came to be that Cadence, a large successful company, selected it as its primary extension language. Of course, almost all EDA software is horrible so the bar is pretty low, but I wondered if this was a case of Cadence not realizing how bad it was, or if it was just historical baggage. Or did a bored programmer just decide one day, “hey, let’s write our own crappy Lisp!” (cf. much of Android).

Luckily, there’s a paper on it to shed some light: “SKILL: A Lisp-based extension language” by E. Petrus, 1993. I snagged a version from behind the ACM paywall, so I won’t post it, but I’ll quote a few interesting bits here.

SKILL was originally conceived as a base language for developing simulation and test description languages… Lisp was chosen as a base language because it lent itself well to an interpreter based implementation and Lisp is a natural choice for code generation and as a meta language. The first implementation was an interpreter based on the Franz Lisp developed at Cadence Design Systems in the mid-1980s.

Fair enough. Mostly historical baggage, partly not-invented-here.

The popular syntax for SKILL is a C-like alternative to Lisp. When SKILL was first introduced, the connection with Lisp was not promoted in electronic engineering circles.

Too bad. Maybe if the same engineers had compared it to an allegedly decent Lisp, they might have demanded better.

Because of the large installed SKILL base, it has been difficult to move SKILL [in the direction of Common Lisp].

Oops.

For the most part, SKILL developers come from a EE background and lack general familiarity with functional languages. As a result, company wide code inspections revealed the need for a high level code analysis tool.

Even Cadence internal staff couldn’t use SKILL.

The paper goes on to describe why extension languages are useful as a component of software-reuse. While I won’t disagree, I find the characterizations of EDA software more an indictment on the state of software development in the industry circa mid-90s. The products I used from Mentor Graphics, Cadence, and so on were just horrendous. Each application was typically a 40 meg, crashy, statically linked executable, in which a lot of functionality was shared between a few other executables that mostly did the same thing but had subtly different missing features (probably in the name of their byzantine licensing strategies).

  • ECAD software is too large and diverse to be contained in a single executable
  • Need “fire-walls” between products. Separating major products into separate executables reduces the chance of one set of code from corrupting data in another.
  • Change control is easier when the software is built separately.
  • Independent release of products. Products linked together in the same executable must be shipped together. In separate executables, products can be shipped whenever necessary.

(Shared libraries anyone?)

I think I wouldn’t mind SKILL if it had been a decent Lisp, but unfortunately Cadence got stuck with maintaining their own one-off fork and the software ecosystem that grew up around it. At the EDA firm at which I worked, we were lucky to be a late-comer, which allowed us to use a reasonable extension language: perl. I haven’t kept up with the industry, but I hope others have followed suit.

Search Trees Revisted

I have been investigating search trees to see if binary search trees’ memory layout can, in practice, make a large difference in performance. Now that I’ve fixed the bugs in my test application, I can say: yes! The theory here is in so-called “cache-oblivious” algorithms. The idea is that you can use divide-and-conquer algorithms and data structures to exploit locality and ensure that at some point you will be working with the granularity of your memory transfer size (such as the size of a cache line or disk block).

How well does it work? Take a look:

This machine has a 2 meg L2 cache and 32 K L1 d-cache. The size of nodes in this test was 32 bytes, so the L1 cache is blown at 2**10 items, and the L2 cache at 2**16.

With the exception of “base,” the trees are all laid out in an array; “vEB” is the cache-oblivious version in which the tree uses a van Emde Boas layout: the top half of the tree is stored in the array, followed by each of the subtrees in left-to-right order, recursively. Unfortunately, constructing the tree dynamically is rather complicated.

So far I’m just exploring ground that’s already been covered, but I do think this is an interesting class of data structures that hasn’t trickled out into the real world yet.

A tale of two search trees

As a warm up for a research topic I’m doing this semester, I am looking at static search tree performance. Consider the following tree fragment where nodes are numbered according to BFS order:

            1
      2            3
   4     5      6      7
  8 9  10 11  12 13  14 15

The nodes are located in memory wherever malloc() put them. An interesting question is whether a different in-memory layout has better search performance. One possible layout is in an array with the following order: 1,2,3,4,8,9,5,10,11,6,12,13,7,14,15 (this pattern is not original to me).

So far, the answer, surprisingly to me, is: not with this layout. With a benchmark of 10 million searches of a randomly-inserted (otherwise unbalanced) tree, I get:

Original Tree: 77s
Reordered Tree: 87s

The tree searches are exactly the same — only the memory layout changed. I can reduce these numbers quite a bit by using -O2 and reducing the size of tree nodes, but there’s still a consistent net loss to the array ordering.

I suspect my cache-emptying routines don’t, but if not, it will be interesting to see where I made the stupid mistake / mental lapse.

Android rebuilt

While stuck inside for days due to snow, I once again downloaded the Android source trees from the repos. I plan to update my Android wifi page with better, step-by-step instructions to go from nothing to a working set of wireless utilities on the phone. In the meantime, I’ve re-familiarized myself with the current Android build environment (for various values of current — this uses donut since as far as I know, eclair doesn’t have a definition for the Dream (G1/ADP1) phone yet).

I do have to say that things in this area have improved a lot. While it still took some google-fu to figure out which branch to check out to get the correct software (tag donut-plus-aosp), buildling the entire image from scratch was rather straightforward (. build/envsetup.sh; lunch aosp_dream_us-eng; make). My main complaint is that the build system is far too slow to use for day-to-day work. Is Google still not eating their dogfood here?

So far I have libnl, iw, and packetspammer now integrated properly via Android.mk makefiles, and a custom kernel plus the support modules for wl1251. Unfortunately, space is really tight on the system partition, so the TI driver had to go to make room for everything. I’m still waiting for someone to port the Android runtime to emdebian or maemo with /usr on the sdcard, but perhaps I’ll mess around with bind mounts until then.

It seems wl1251 SDIO doesn’t like the new power-saving code, so that is something else I’ll look into soon. For now, one can disable that in the driver or possibly via “iwconfig power off”.

wl1251: cmd set ps mode
wl1251: cmd configure
mmc0: Data timeout
wl1251: ERROR sdio write failed (-110)
...
wl1251: ERROR elp wakeup timeout

Storage Menagerie

I recently found myself needing to become familiar with the disk layouts of a few filesystems for a future research project. At the same time, I wanted to experience the fail whale that is github.com (my other git trees are hosted on repo.or.cz, which is spare and of tenuous funding, but works just fine). So over the weekend I threw together this little git tree.

Not much there, just ext2 and yaffs2 so far. I chose the latter because I wanted to look at a log-structured filesystem, while simultaneously producing something useful for Android work. I plan to add btrfs shortly and maybe some other log FS. This is just junk code: if you really want to use these filesystems in FUSE, there are much better implementations. Also there’s a great hack out there to run the kernel FS implementations in userspace via UML.

Still, it’s fun reinventing the wheel, and it’s a nice way to get a handle on this stuff.

JBoss Hash

Here’s how to figure out the hashes of method names used by JBoss’ RMI invoker from the shell. The first two numbers in the method signature form the length of the string. Maybe there’s a better way to do the backrefs in sed, but you do need to swap them around.

$ echo $[`printf "0031ping(Ljava/lang/String;)V" | 
    sha1sum | 
    sed -e "s/(..)(..)(..)(..)(..)(..)(..)(..).*/0x87654321/g"`]

Don’t ask me why I figured that out.

Numerology

I’m technically on break from school now, although I’m still putting the final polish on one last project. But things have slowed down enough that I can bore you with technical crap again.

There’s an interesting section of TPOP wherein the authors discuss studying the “numerology” of a bug; that is, look for patterns in the data instead of just looking at the code. I’ve had a lot of practice with this idea lately when trying to decipher kernel crashes with nothing but an Oops message to go on (these being much easier than infrequent lockups with no output at all!). It’s quite useful for userspace as well, if you are using a language where memory corruption is a fact of life. To show why, here’s a lengthy anecdote about how I recently used the power of man ascii to find and fix a memory corruption bug in about five minutes while racing to meet a project deadline.

First, some background: in the kernel, linked lists are implemented by embedding the next and previous pointers directly into the linked object, by way of the struct list_node. For example:

struct list_node {
    struct list_node *next;
    struct list_node *prev;
};

/* one item in a list */
struct item {
    int id;
    struct list_node list;
};

/* some other object that has a list of items */
struct bar {
    struct list_node item_list_head;
};

This is a curious construction, but once you get used to it, it’s quite handy — you can embed an object in multiple simultaneous lists without having to worry about the lifetime of void pointers, and without having to malloc container structures all the time. The list code is object-agnostic, operating only on the list_node structures, but of course you need a way to get from a struct list_node back to the object that contains it. For that, there is the container_of macro and friends. These are just C macros that, given the address of the struct list_node, the variable name of the structure, and the type of the item in which the struct is embedded, finds the start of the containing structure with a little pointer arithmetic.

I had implemented a similar set of list routines for a userspace program and started getting weird crashes, so I fired up gdb to see something like the following:

gdb > p *item
$1 = {id = 134269966, item_list = {next = 0x804b008, prev = 0x804b008},
      global_list = {next = 0x72657375, prev = 0x40313030},
      src = "machine.org", '00' <repeats 69 times>}

Had I taken a minute to look at this in detail, I would have noticed that the value of id was quite higher than it should be, and that the leading part of src was truncated. But what jumped out at me immediately was that global_list.next and global_list.prev were obviously wrong pointer values. In Linux on x86, heap addresses are usually above 0x0804b000, and stack addresses are below 0xbfffc000. Clearly, the global_list pointers don’t meet these ranges. However, the item_list.next and item_list.prev addresses looked fine. So either the global list pointers weren’t being initialized, or they were being corrupted somehow. Sometimes that kind of corruption happens if the list pointers aren’t updated properly, causing the code to follow a next pointer into random memory, but then it would be unlikely that the item_list pointers and src had meaningful values. Plus, I unit-tested the list routines and was pretty confident they were working correctly.

My initial theory was that something was overwriting the global_list struct, and the byte values looked to be in the ascii range. So I turned to the ascii(7) man page, which gives you a nice little table of hex ordinal to character value mappings. Looking at the bytes in turn gives: 0x72 = ‘r’, 0x65 = ‘e’, 0x73 = ‘s’, 0x75 = ‘u’, 0x40 = ‘@’, 0x31 = ‘1’, 0x30 = ‘0’, 0x30 = ‘0’. Since x86 is little-endian, this works out to the string “user001@”, which should have been the first part of src. The bug must be where the code filled in the src string. However, that code looked fine. I then decided that the pointer to the entire structure must be offset somehow. Sure enough, when I looked at the surrounding code, I saw something like this:

    struct list_node *n;
    struct item *item;

    for (n = list_start(&item_list_head); n;
         n = list_next(&item_list_head, n))
    {
        item = container_of(n, struct item, global_list);
        /* strncpy and other stuff ... */
    }

Oops. I had passed the wrong list name to container_of(), with the effect that the item pointer was actually 8 bytes prior to the real start of the structure. As a result, the string copy overwrote the pointer values of just one of the list structures, a bug which didn’t actually have a negative effect until later.

There are morals a-plenty, including “macros are evil” and “this is why templates were invented.” Yet, I’ve seen Linus use this technique more than once when others were stuck on a kernel memory corruption bug, so I’ll go with “Kernighan and Pike are always right[1].”

[1] Except, I’m still not sold on using capital letters in Go. Perhaps in time I can be convinced.

Ath5k AP Mode

People seem to keep asking about this, despite there being a quality page on kernel.org on how to run an AP with any mac80211 driver. So for what it’s worth, here’s how my setup works. Note if you are seeing flakiness with certain clients, e.g. it works fine with a computer but not with your cell phone, it is likely there is some bug with the power saving handling. I’m currently working on a few such issues, so it may be fixed soon enough.

To get started, you need the following:

  • some sort of network uplink, like a wired ethernet port
  • a kernel with ap mode support for ath5k (2.6.31+) and netfilter (for NAT)
  • dnsmasq
  • hostapd

You have two basic options for interfacing the wireless network with your wired network. One is by using a bridge directly to the wired network. Another is to use NAT so the wireless network is on its own subnet. The former is more typical of an embedded device, but I prefer the latter on my home LAN, so that’s what I’ll describe.

Turn off any wireless daemon (such as NetworkManager) while you experiment with hostapd to ensure that nothing else has the device open.

The hostapd.conf is large but self-explanatory. One can get away with a rather small config file if using the defaults. This example sets up an AP on wlan0 with the SSID “hostapd_ath5k”. It has a wlan-facing IP address 192.168.10.1, it’s on channel 11, supports 802.11g, and has the WPA pre-shared key “my_password”. I have also set up EAP, it works but requires making a lot of certs and such, so Google is your friend here.

hostapd.conf:

interface=wlan0
driver=nl80211
ssid=hostapd_ath5k
hw_mode=g
channel=11
auth_algs=3
own_ip_addr=192.168.10.1
wpa=1
wpa_passphrase=my_password
wpa_key_mgmt=WPA-PSK
wpa_pairwise=CCMP

For DHCP and DNS forwarding, I use dnsmasq. This couldn’t be easier. Just put something like the following in /etc/dnsmasq.conf:

   dhcp-range=192.168.10.50,192.168.10.150,12h

This will hand out addresses in the 192.168.10.X subnet. Then I use a small script to enable IP masquerading before launching hostapd (note, it will flush all iptables rules, which may not be what you want, so use with caution).

run.sh:

#!/bin/sh

DEV=wlan0
GW_DEV=eth0

# set IPs
ifconfig $DEV down
ifconfig $DEV up
ifconfig $DEV 192.168.10.1

# setup ip masq
echo 1 > /proc/sys/net/ipv4/ip_forward
iptables -F
iptables -t nat -F
iptables -A FORWARD -i $DEV -j ACCEPT
iptables -A FORWARD -o $DEV -j ACCEPT
iptables -t nat -A POSTROUTING -o $GW_DEV -j MASQUERADE

./hostapd hostapd.conf

Running the script will launch hostapd, and if all goes well, you’ll see it show up in scans from other computers.

If anything goes wrong, make sure:

  • you’ve started dnsmasq
  • you have a valid backhaul connection
  • you aren’t using power save mode on client (iwconfig wlan0 power off — see previous comment about PS bugs)
  • you haven’t found a bug