Is there a patch for bureaucracy?

August 24th, 2010

So I’ve now had experience fixing other peoples’ software in several different domains. Here’s how it plays out:

Open-source project:

Thanks! Applied.

Proprietary application:

Thank you for your submission. We have taken your request under consideration, and assigned it ticket #17231. In our review for the next release, scheduled to be released in Q4 next year, we will consider each resolved ticket for possible inclusion in our product.

Government project:
[paraphrasing an angry letter to division heads]

Please find attached a 9-page document describing why we cannot include any changes in our software, even 80-line changes that fix crashes such as the ones your agency has experienced. We suggest your agency maintain your own copy of the software with the change applied.

P.S. We doubt seriously it’s really our problem even though you gave us an explanation and a reasonable code fix. Please waste some more time trying to prove it to us.

JNIs and ABIs

August 10th, 2010

For a recent task at work, I had a compelling reason to break out of the Java jail and do some “real programming” in C. Unfortunately, said C-language bits were to be run on the Windows platform. I had a stripped DLL (no source) for manipulating USB cameras and an API document to work with, and a plan to duct tape this all together with JNI.

I’m not about to touch MSVC if I can help it, so I grabbed mingw packages for my platform and wrote up a small wrapper DLL to do the necessary JNI bridge. Here’s my attempt to archive the working recipe for the Google bots.

Compiling this wrapper library was quite easy with mingw:

$ cat Makefile
# This makefile builds usbcamjni.dll to wrap usbcam.dll
# Use mingw32-make to build
CPPFLAGS=-I$(JAVA_HOME)/include -I$(JAVA_HOME)/include/linux
LIBS=-L . -l usbcam

usbcamjni.dll: usbcamjni.o
        $(CC) -shared -o $@ $< $(LIBS)

clean: .FORCE
        $(RM) usbcamjni.dll usbcamjni.o

.FORCE:

$ mingw32-make

It turns out on Windows there are a few different calling conventions that you can choose from at compile time. Garden-variety code uses __cdecl, which is how the world should operate. The symbols are named just as they are named in the C code with a leading underscore, and the caller is responsible for cleaning up his own stack mess.

In the Windows world, there is __stdcall, where the symbol names (by default) have trailing decorations for the size of parameters on the stack (e.g. _write@4) and the callee cleans up the stack. The benefit is somewhat smaller code, but you can't have variadic functions.

So I was testing my JNI wrapper and getting JVM crashes. Looking at the register dump showed that the program counter was not pointing to a normal text virtual address, which is symptomatic of stack confusion. Once I re-acquainted myself with the above arcana, I guessed correctly that I was using the wrong calling convention for the DLL: together the function epilogue and caller were popping off too many levels of stack. However, in this case the DLL was using undecorated symbol names (e.g. _write) but still expecting __stdcall semantics. As it happens, this is how Win32 API works.

The problem is that annotating the prototypes in header files with "__stdcall" to get correct calling conventions also makes the dependent object file expect the decorated function name. Linking against the DLL results in lots of unsatisfied symbol errors.

The solution took some digging and doesn't make a whole lot of sense to me, but the way out of this using mingw tools is to build an interface library. First, you create a def file like so:

LIBRARY MYLIB.DLL
EXPORTS
Symbol1@0
Symbol2@4
Symbol3@8

Note that symbols in this file include the decorations. Then you run dlltool from mingw with -k to remove the decorations from the generated library. As I understand it, this option actually creates aliases from the decorated to undecorated name so that subsequent linking can work. The manpage says "-A" is actually for this, but before I read it, I found a mailing list post advocating "-k", and, well, it worked, so I'm joining that particular cargo cult.

$ dlltool -k -d mylib.def -l mylib.a

Then when building the JNI library wrapper, link against mylib.a instead of the original DLL. Easyish!

Kernel updates

July 17th, 2010

During baby naps I actually did some Rio Karma work this week. First, I found and fixed a recent bug in usb storage that makes the Karma not work at all in 2.6.35. Then I finally set up my omfs git tree on kernel.org, with a few patches I wrote almost two years ago, plus a recently found memory leak. I’ll try to get Linus to pull it in 2.6.36. Then, I made a new release of omfsprogs rewritten to use libomfs from the FUSE version of omfs (I also did this work two years ago). Now all is right in the Karma world again; time to ignore it for another 6 months.

Then, on to ath5k. One nagging issue that the driver has always had is that there’s pretty much a total lack of synchronization among the myriad threads. There’s an ad-hoc spin lock here and there, but for example the reset procedure, which reloads hundreds of registers, can be called concurrently with itself and other driver ops, and can be invoked while trying to transmit and receive packets. So one idea we’ve been hashing out on the ML is to serialize reset with a mutex by putting it into process context via a workqueue, and disabling tasklets while it runs. I hacked that together and it’s now available in wireless-testing. Hopefully this will fix some of those weird DMA-into-freed-skb errors that can potentially be caused by untimely modification of descriptor pointers.

Finally: tracing. For the last year or so, Linux has grown its own NIH answer to DTrace in the form of ftrace and tracepoints. The wireless subsystem already has a few tracepoints here and there, and I really like it better than debug printks because they have smaller overhead in the compiled-but-not-enabled case (a branch compared to an out-of-line function call). You can pick and choose individual tracepoints to sample, or combine them with a general tracing facility such as ftrace. I have some patches baking to add various tracepoints to ath5k [1, 2, 3].

One really neat idea that I appropriated from Johannes Berg’s work on iwlwifi is to store entire packets in the tracing ring buffer. Then you can have a trace-cmd plugin to write these out as tcpdump pcap files. This is really nice: you get a packet dump that works with your existing debug infrastructure, plus you have the tracing information so you know what the code is trying to do, at whatever granularity you chose when setting up the trace. Hopefully, this will be a good one-stop debug gathering tool, not just for developers, but for users, as well.

New syscall proposal

June 26th, 2010
CLONE3(2)                   System Programmer's Manual                 CLONE3(2)

NAME
       clone3 - create a child process

SYNOPSIS
       #include <clone.h>

       int clone3(pid_t parent, pid_t co_parent,
                 int (*fn)(void *), void *child_stack,
                 int flags, void *arg, ... );

DESCRIPTION
       clone3()  creates  a new process, in a manner similar to fork(2).

       Unlike fork(2), these calls allow the child process to share  parts  of
       its  execution  context  with  the  parent and co-parent processes,
       such as their home, food, and approximately half of their genetic
       material.

       The main use of clone3() is to implement offspring: sentient beings
       that run concurrently with the parents in a shared environment.

       When  the  child  process  is  created  with  clone3(),  it executes the
       function  application  fn(arg).   (This  differs  from  fork(2),  where
       execution  continues  in the child from the point of the fork(2) call.)
       The fn argument is a pointer to a function that is called by the  child
       process  at the beginning of its execution.  The arg argument is passed
       to the fn function.  Although the parent processes may supply a suitable
       fn function, the child process has a mind of its own and may deviate
       from its execution at will.

       Executing clone3 will randomly select half of the genes from each
       parent, resulting in a selection of phenotypes according to
       Mendelian inheritance.  By design this process increases the
       number of bits in the global entropy pool, making recovery of the
       random state impractical.

       The  child_stack  argument  specifies the location of the pile of
       diapers used by the child process.  After the child process leaves
       the EMBRYO state and until it acquires the CAP_POTTY capability,
       it may frequently dump core.  The calling process must therefore
       set up a large store of waste disposal units and pass a pointer to
       this store to clone3().  After a successful core dump it is necessary
       for a parent to handle the condition as soon as possible.

       The flags parameter may  be  bitwise-or'ed  with zero or more of the
       following constants, in order to specify  the behavior of the system
       call:

        CLONE_XY
              Setting CLONE_XY indicates to the system that a child of the
              male gender is requested.  This is only a hint to the system
              and may not be honored.

        CLONE_XX
              Setting CLONE_XX indicates to the system that a child of the
              female gender is requested.  This is only a hint to the system
              and may not be honored.

              If both CLONE_XY and CLONE_XX are set, results are undefined.

RETURN VALUE
       On  completion,  the  thread  ID  of  the child process is returned in
       both parents' thread of execution.  During execution, the child process
       may signal an error condition by raising any of the following signals.

SIGNALS
       HUNGRY  The stomach monitor process has detected an empty condition.
               The parent may correct the condition by supplying food.

       TIRED   The process has executed too many cycles and must enter
               sleep(3).

       GRUMPY  The process is raising signals at full volume for unknown
               reasons.  The parent process should attach pacifier(2) to
               the child process and attempt to cause it to enter sleep(3).

       HAPPY   The process is in a suitable state for taking pictures for
               grandparents.

CONFORMING TO
       Biology 101.

BUGS
       Child process may spit-up on your new shirt; a burping cloth is
       recommended.

EXAMPLE
       The following psuedo-code illustrates the intended usage:

       #include <clone.h>

       int main(int argc, char *argv[])
       {
            pid_t ange = /* ... */
            pid_t bob =  /* ... */
            pid_t alex;

            alex = clone3(ange, bob, fn, stack, CLONE_XY, NULL);
            sleep(86400 * 30 * 9);

            take_snapshot(alex);
       }

       This yields the following:

       
       Alexander Yit-Keung Copeland
       Born 26 June, 2010
       Weight 7 lbs 12 oz

SEE ALSO
       fork(2),    futex(2),    getpid(2),    gettid(2),   set_thread_area(2),
       set_tid_address(2),  tkill(2),  unshare(2),  wait(2),  capabilities(7),
       pthreads(7)

CLONE3(2)                         2010-06-26                          CLONE3(2)

Blowtorch Cuisine

June 21st, 2010

Ever mindful of fire safety, I’ve had a full propane canister (the small blowtorch size) rattling around in the back of my trunk for two and a half years. Luckily our car never decided to explode, but just in case, I’ve decided to finally use up the fuel.

Since I plan to never do any sort of plumbing again, the natural application for my blowtorch flame is on food, specifically crème brûlée. So I baked up some custard in my soup bowls, tossed some sugar on top, and burnt that bad boy.

burn

The problem is these torches really only work well upright, so you wind up having to bring the food to the flame rather than the other way around. I did at least have my fire extinguisher at the ready in case a wall decided to ignite. The smoke detector only went off once.

burnt

Our tasters were split on the outcome. Ange really liked it. I thought the custard was a bit too eggy, but that could be the recipe or personal taste. As I’m not a big CB eater, I don’t really have much to compare it to. It looked good, though.

Next up, fire grilled cheese?

Moved

June 4th, 2010

Apologies, dear reader, for not having updates lately: I have been quite busy.

Angeline and I just “completed” moving from one side of the street to the other side (the lie-quotes indicating our plan to live out of boxes for the near future). If you have our old mailing address or home phone number, it’s probably a good idea to drop me an email to get the new data.

The new place seems nice so far except for the water being shut off the day we moved in, and the towing of my car from the space they told us to park in (evidently they were just kidding when they said that).

Here’s the view from our window:


Moving on up

Proof (as if you needed any) that Ange and I are inveterate nerds — 8 of the dozens of boxes crammed full of books:


Books

On S-Expressions

April 7th, 2010

I suppose I’m old enough now, and the rate of change in computing is fast enough, that I can regale young software developers with I-remember-when stories. You know, the ones that start off like “all of my punchcards were rubber-banded together, but I had neglected to number them…” I’ll spare you the tale of how we used to screen-scrape the mainframes at LargeAirline.com for today; instead you get to read my ramblings on languages-older-than-Go.

Today’s topic is inspired by my reading on LWN about the fact that Emacs is still around, which led me to David Wheeler’s argument that parentheses are why no one uses Lisp and that we can do better. Perhaps so, but I found his plugging of SKILL interesting, since once, long ago, when I was but a fresh young hacker, I spent a rather significant bit of time writing software in this language.

Since you have likely never heard of it, SKILL is a Lisp implementation that is (or was?) embedded in the Cadence suite of Electronic Design Automation products. As I recall, the biggest warts were:

  • the fact that you could use C-like prefixing — f(x y) was equivalent to (f x y) — but other constructs, like using parentheses for grouping, behaved as in Lisp (i.e. errored).
  • the runtime library used C-like functions, such as printf, fread, etc
  • the developer’s manual encouraged using imperative style programming like loops; indeed using recursion instead would usually lead to stack overflows

I typically wrote in the Lisp style and just ignored the C-like features as much as possible. The experience soured me on functional languages forever, which isn’t really fair, because SKILL was particularly bad.

But I’ve often wondered why it was so horrible, and how it came to be that Cadence, a large successful company, selected it as its primary extension language. Of course, almost all EDA software is horrible so the bar is pretty low, but I wondered if this was a case of Cadence not realizing how bad it was, or if it was just historical baggage. Or did a bored programmer just decide one day, “hey, let’s write our own crappy Lisp!” (cf. much of Android).

Luckily, there’s a paper on it to shed some light: “SKILL: A Lisp-based extension language” by E. Petrus, 1993. I snagged a version from behind the ACM paywall, so I won’t post it, but I’ll quote a few interesting bits here.

SKILL was originally conceived as a base language for developing simulation and test description languages… Lisp was chosen as a base language because it lent itself well to an interpreter based implementation and Lisp is a natural choice for code generation and as a meta language. The first implementation was an interpreter based on the Franz Lisp developed at Cadence Design Systems in the mid-1980s.

Fair enough. Mostly historical baggage, partly not-invented-here.

The popular syntax for SKILL is a C-like alternative to Lisp. When SKILL was first introduced, the connection with Lisp was not promoted in electronic engineering circles.

Too bad. Maybe if the same engineers had compared it to an allegedly decent Lisp, they might have demanded better.

Because of the large installed SKILL base, it has been difficult to move SKILL [in the direction of Common Lisp].

Oops.

For the most part, SKILL developers come from a EE background and lack general familiarity with functional languages. As a result, company wide code inspections revealed the need for a high level code analysis tool.

Even Cadence internal staff couldn’t use SKILL.

The paper goes on to describe why extension languages are useful as a component of software-reuse. While I won’t disagree, I find the characterizations of EDA software more an indictment on the state of software development in the industry circa mid-90s. The products I used from Mentor Graphics, Cadence, and so on were just horrendous. Each application was typically a 40 meg, crashy, statically linked executable, in which a lot of functionality was shared between a few other executables that mostly did the same thing but had subtly different missing features (probably in the name of their byzantine licensing strategies).

  • ECAD software is too large and diverse to be contained in a single executable
  • Need “fire-walls” between products. Separating major products into separate executables reduces the chance of one set of code from corrupting data in another.
  • Change control is easier when the software is built separately.
  • Independent release of products. Products linked together in the same executable must be shipped together. In separate executables, products can be shipped whenever necessary.

(Shared libraries anyone?)

I think I wouldn’t mind SKILL if it had been a decent Lisp, but unfortunately Cadence got stuck with maintaining their own one-off fork and the software ecosystem that grew up around it. At the EDA firm at which I worked, we were lucky to be a late-comer, which allowed us to use a reasonable extension language: perl. I haven’t kept up with the industry, but I hope others have followed suit.

Pork

April 2nd, 2010

Inspired by the ‘tinga’ (Mexican roast pork tostadas) recipe in this month’s Cook’s Illustrated, I bought a $12 picnic shoulder at the grocery store on our last trip out. The recipe recommends a boneless boston butt instead, which is probably a good idea given all of the tendons in the lower cut. But my grocer only had the shoulder and it’s cheap so what the heck.

I spent last Saturday morning carving all of the meat off of the bone, at which point I realized just how much pig we’re going to be eating for the next few weeks. As Ange and I try to subscribe to the ‘use everything but the squeal’ philosophy, I portioned the slab for various future meals: two pounds of meat for the aforementioned tinga, a couple of pounds cut into thin slabs for char siu, about another pound or so of trimmings for barbecue or pork tacos, and the bones went into the freezer for congee.

Which left a big hunk of skin. I tried making cracklings out of this, but the porcine gods were not having it: it was a big sticky mess. I ultimately gave up after a one sizzling piece of skin and lard hopped out of the pan only to land on my face about a centimeter away from my right eye. Even fried pig skin isn’t worth blindness.

The tostadas were pretty awesome though.

Search Trees Revisted

March 17th, 2010

I have been investigating search trees to see if binary search trees’ memory layout can, in practice, make a large difference in performance. Now that I’ve fixed the bugs in my test application, I can say: yes! The theory here is in so-called “cache-oblivious” algorithms. The idea is that you can use divide-and-conquer algorithms and data structures to exploit locality and ensure that at some point you will be working with the granularity of your memory transfer size (such as the size of a cache line or disk block).

How well does it work? Take a look:

This machine has a 2 meg L2 cache and 32 K L1 d-cache. The size of nodes in this test was 32 bytes, so the L1 cache is blown at 2**10 items, and the L2 cache at 2**16.

With the exception of “base,” the trees are all laid out in an array; “vEB” is the cache-oblivious version in which the tree uses a van Emde Boas layout: the top half of the tree is stored in the array, followed by each of the subtrees in left-to-right order, recursively. Unfortunately, constructing the tree dynamically is rather complicated.

So far I’m just exploring ground that’s already been covered, but I do think this is an interesting class of data structures that hasn’t trickled out into the real world yet.

A tale of two search trees

March 11th, 2010

As a warm up for a research topic I’m doing this semester, I am looking at static search tree performance. Consider the following tree fragment where nodes are numbered according to BFS order:

            1
      2            3
   4     5      6      7
  8 9  10 11  12 13  14 15

The nodes are located in memory wherever malloc() put them. An interesting question is whether a different in-memory layout has better search performance. One possible layout is in an array with the following order: 1,2,3,4,8,9,5,10,11,6,12,13,7,14,15 (this pattern is not original to me).

So far, the answer, surprisingly to me, is: not with this layout. With a benchmark of 10 million searches of a randomly-inserted (otherwise unbalanced) tree, I get:

Original Tree: 77s
Reordered Tree: 87s

The tree searches are exactly the same — only the memory layout changed. I can reduce these numbers quite a bit by using -O2 and reducing the size of tree nodes, but there’s still a consistent net loss to the array ordering.

I suspect my cache-emptying routines don’t, but if not, it will be interesting to see where I made the stupid mistake / mental lapse.