Serial offender

There comes a time in every budding kernel developer’s life that he has to debug a mysterious lockup, and nothing will do but a serial console. Well, for my future recollections, here’s how to set it up:

  1. Get out your handy pl2303-based usb to serial adapter, because chances are good your laptop doesn’t have a serial port
  2. Build your kernel with CONFIG_USB_SERIAL_CONSOLE=y
  3. Add to your kernel command line: console=ttyUSB0,115200 console=tty0
  4. Hook your computer up to the other computer via a null-modem cable (man, these are pricey these days, $30 for something no one still uses?)
  5. Set up minicom to use your serial port, say ttyS0, at 115200 baud, 8N1, and turn off all the modem init strings
  6. Don’t bother futzing with getty, you only need it if you want to also allow logins over serial. For logging, it’s unnecessary


Now, start minicom on computer 2 and reboot your computer under test. If all goes well, you’ll capture a panic on the serial console. If all goes poorly (my case), you’ll have a lockup with no oops. The usual thing to try in this case is adding “nmi_watchdog=1” to the command line, which will use the non-maskable interrupt to break into any frozen code. Also, if you have CONFIG_DETECT_SOFTLOCKUP set, hopefully after 60 seconds or so you’ll get a soft lockup warning.

In my case, I still have a hard lock with no output. Ho hum.

Prebama

So it’s inauguration eve in DC. You can tell because all of the subway ads have some ‘Welcome to DC’ theme, and there are portable toilets spread all throughout the city. However, the most obvious sign of the new administration is all the utter crap you can buy with the First Family’s picture on it: key chains, mugs, postcards, buttons, playing cards, shirts, hats, knit caps, underwear and neckties. Radio Shack is even advertising: “Get your inauguration supplies here!” (you know, in case you need some speaker wire for the weekend). I hope he gets royalties somehow. Angeline and I are planning on braving the crowds tomorrow to hang with the groundlings in the non-ticketed section. We’ll see how that goes.

In hacking news, I have the following patches queued so far for 2.6.29:

Bob Copeland (12):
mac80211: fix a few typos in mac80211 kernel doc
ath9k: remove useless conditional
ath5k: fix keytable type buglet in ath5k_hw_reset_key
ath5k: enable hardware encryption for WEP
ath5k: update keycache to support TKIP handling
ath5k: set mac address in add_interface
ath5k: preserve higher order bits when setting mac address
ath5k: clean up ath5k_hw_set_key
ath5k: enable combined michael mic in key cache
ath5k: fix endianness of bitwise ops when installing mic
ath5k: correct packet length in tx descriptors
ath5k: fix return values from ath5k_tx

Basically, hardware crypto support, nothing else notable. In my unbaked tree, I have the mac80211 suspend/resume support patches (pushed today), some fixes for mixed b/g networking, and some silly LED patches. Most of that is 2.6.30 material.

In other news, glibc finally has endianness functions. I can’t say that I’m crazy about the names, and it has a bit of unnecessary Not Invented Here, but at least it gives an alternative to always using my own or using glib.

Old code

The projects section of my webpage got a few tweaks last night. Namely, I resurrected the rigid body simulator back to more-or-less compiling state (what a pile of crap code!) and put it back on the internets. The i-collide library may need a few Makefile tweaks to run on anything newer than RedHat 4. I ran it last night, then I realized GL-over-remote-X wasn’t working on Windows. So much for that. It’s super fast on modern hardware though.

Hacking, the good kind

I could write about the election here, but citizen905 already summed it up pretty well. So instead, here’s what I’ve been breaking in the Linux kernel lately:

  • My final patch count for 2.6.27 was 14, I think. Enough, anyway, that I can stop counting and just deal with all the work I’ve created for myself.
  • I added myself to MAINTAINERS for ath5k, which felt like a pretty ridiculous notoriety grab, but Nick asked me to do so twice, so there.
  • I have some fixes for ath5k for 2.6.28, nothing major but an oops should be fixed, and a WARN_ON removed. The oops fix, incidentally, had an obvious bug despite 3 sign-offs. I suck.
  • Also committed but to-be-reverted for suckiness is a patch to remove beaconing in STA mode. Turns out ath9k, from which I stole this idea, was just busted. The new plan is to use the beacon miss interrupt; until then, your wireless card has to wake up the CPU about 100 times a second.
  • For 2.6.29, I have added hardware encryption to ath5k and hopefully will get some time to hack on the suspend/resume support for mac80211. Then I have some omfs patches I’ve been sitting on for months.

SYSRQ on MacBook

Lately I’ve really needed SysRq in situations where /proc/sysrq-trigger just doesn’t do the job, and my MacBook is missing lots of crusty old XT-era keys. Finally, I know how to do this!

/* includes and error handling omitted for brevity... */
#define USAGE_CODE 0x070044 /* USB hid for F11 */

int main() 
{
int codes[2];
int fd = open("/dev/input/by-id/usb-Apple_Computer_Apple_"
"Internal_Keyboard_._Trackpad-event-kbd", O_NONBLOCK);

codes[0] = USAGE_CODE;
codes[1] = KEY_SYSRQ;  /* from linux/input.h */
ioctl(fd, EVIOCSKEYCODE, codes);
}

Awesome. Supposedly, a tool called keyfuzz is also efficacious.

OSS, I has it

I just sat in on a conference call as a representative (by default, since no one else called in) of the Linux ath5k community, with Atheros, makers of my MacBook’s wireless ethernet card. Atheros have really done a 180 for supporting the community, first by releasing ath9k, then by releasing the source to their previously-closed HAL last week. Thanks to that, 6 patches have already gone out fixing various problems. BTW, conference calls are just as pointless in the OSS community as they are in real life. But at least I did learn that it is pronounced “uh-THERE-ose”, not “ATH-er-ose.”

Buy laptops with Atheros wireless cards!

Oops

I am finally getting the hang of debugging kernel crashes. None too soon as I got my first OOPS report from the -rc kernel with OMFS, from a gentleman who is intentionally corrupting his FS (“fuzzing” in the infosec lingo). After a frustrating weekend in which I had inadvertantly fixed the bug but didn’t realize it because I was testing the wrong module, I can now claim success. One down, several more to go.

Detective work after the jump if you care for the nerdy stuff.
Oops report:


BUG: unable to handle kernel paging request at c978e004
IP: [(c032298e)] omfs_readdir+0x18e/0x32f
Oops: 0000 [#1] PREEMPT DEBUG_PAGEALLOC
[...]
EIP: 0060:[(c032298e)] EFLAGS: 00010287 CPU: 0
EIP is at omfs_readdir+0x18e/0x32f
EAX: c978d000 EBX: 00000000 ECX: cbfcfaf8 EDX: cb2cf100
ESI: 00001000 EDI: 00000800 EBP: cb2d3f68 ESP: cb2d3f0c
DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
[...]
[(c018a820)] ? filldir64+0x0/0xcd
[(c018a9f2)] ? vfs_readdir+0x56/0x82
[(c018a820)] ? filldir64+0x0/0xcd
[(c018aa7c)] ? sys_getdents64+0x5e/0xa0
[(c01038bd)] ? sysenter_do_call+0x12/0x31
=======================
Code: 00 89 f0 89 f3 0f ac f8 14 81 e3 ff ff 0f 00 48 8d
14 c5 b8 01 00 00 89 45 cc 89 55 f0 e9 8c 01 00 00 8b 4d c8 8b 75 f0 8b
41 18 (8b) 54 30 04 8b 04 30 31 f6 89 5d dc 89 d1 8b 55 b8 0f c8 0f c9

First step is to look at the faulting instruction. Running the “Code:” part through ~/linux/scripts/decodecode yields the disassembly:


8b 4d c8             	mov    -0x38(%ebp),%ecx
8b 75 f0             	mov    -0x10(%ebp),%esi
8b 41 18             	mov    0x18(%ecx),%eax
8b 54 30 04          	mov    0x4(%eax,%esi,1),%edx <=== here
8b 04 30             	mov    (%eax,%esi,1),%eax
31 f6                	xor    %esi,%esi

So the instruction is dereferencing the address [(eax+esi)*1+4]. From the register dump, EAX=c978d000. That looks like a pointer. ESI is 00001000, which is probably the index to an array. 0x1000 happens to be PAGE_SIZE which explains the page fault (kernel paging request) at the top of the oops.

Next, let’s look at the C code. There are two ways:


$ gdb omfs.ko
(gdb) l *(omfs_readdir+0x18e)

Or (and I find this a little more obvious since it has mixed C and assembly):


$ objdump -S omfs.ko > foo.S
# now look for instruction opcodes in foo.S: "8b 54 30 04"

From the output of the above commands, it’s apparent that the +4 index in the instruction comes from be64_to_cpu() converting a 64-bit big-endian number to little-endian. And we do that when reading directory pointers in omfs_readdir, specifically:


fsblock = be64_to_cpu(*((__be64 *) &bh->b_data[offset]));

EAX is bh->b_data so ESI must be offset. I happen to know it should never be above 2048, but it is 4096 in the register dump. Since the range is ultimately controlled by the directory inode size, I immediately suspected that that size got corrupted. For some reason I chased a bunch of other dead ends until I finally did look at the disk image and saw that the directory size was all wrong. Rule one of debugging: go with your gut.

Oh well. I guess all that assembly coding from years ago was useful after all.

Merged


$ git-log --author="Bob Copeland" v2.6.26..master  | git-shortlog

Bob Copeland (10):
ath5k: Fix loop variable initializations
ath5k: convert LED code to use mac80211 triggers
omfs: add filesystem documentation
omfs: define filesystem structures
omfs: add inode routines
omfs: add directory routines
omfs: add file routines
omfs: add bitmap routines
omfs: update kbuild to include OMFS
omfs: add MAINTAINERS entry

Woot! I had an 11th patch, for ath5k, but the maintainer fixed it independently. Very nice to finally get omfs in and not have to maintain that sucker out of tree.

New Banshee Plugin

Thanks to a few hours of hacking during the holiday weekend, I have a new version of the Karma plugin for Banshee (and a newer version of omfs too). I’m too lazy to write my own release notes so read someone else’s! Thanks, Ben.