This article documents the steps taken with Ubuntu 8.04 and LTSP5 to do kernel debugging using a null-modem serial connection.
We have been experiencing random lock-ups on some ltsp thin clients There have been posts about random lock-ups on ltsp mailing lists and on #ltsp irc channel. As it became evident that our setups are not the only ones affected, we decided to dig deeper in to the issue.
Investigating these lock-ups has not been trivial as most of the time the thin clients run fine for weeks. And when they hang everything freezes – display, mouse, keyboard, network. Nothing responds to anything and nothing is sent to syslog. There has also been no pattern to the freezes. Sometimes it’s Evolution, sometimes Firefox. Sometimes it’s enough to click mouse once to freeze everything. Only thing in common seemed to be that if a particular java applet was running, freezes would happen more frequently.
At first all regular debugging methods were tried – remote syslog, running top through ssh, xrestop, etc – but nothing seemed to be wrong. There was plenty of free memory at all times, no errors logged on syslog and not even X was hogging memory. Updating to Ubuntu 8.10 kernel did not help either. After it became clear that syslog and friends would not help here, it was time to turn to kernel serial debugging. With no prior experience in this art, various howtos around the net were helpful. Thanks also to the helpful people on #ltsp on encouraging words.
The howto’s on the net are comprehensive, so I won’t go through all the possible options here. Ubuntu’s kernel has all the required options compiled in, so no kernel compiling is needed. For more information check for example the Remote Serial Console HOWTO at tldp.org.
So after equipping myself with a basic Belkin usb-serial adapter and a null-modem cable, I was ready for some kernel fun. As the kernel needs the console= parameter to define the serial port where everything is sent, the boot parameters under /var/lib/tftpboot/ltsp/i386/pxelinux.cfg/default needed some changes.
Normally the file looks something like this:
DEFAULT vmlinuz ro initrd=initrd.img quiet splash IPAPPEND 2
After adding the required parameters, it became like this:
DEFAULT vmlinuz ro initrd=initrd.img console=tty console=ttyS0,115200n8 IPAPPEND 2
Now when the thin client boots, pxelinux passes the console= parameters to the kernel and the kernel sends the output to /dev/ttyS0. There needs to be something on the other end listening to the thin clients kernel. This time I connected the usb-serial adapter to my laptop that recognized the adapter as /dev/ttyUSB0. As I remembered from the old days how minicom works, I turned to it. Starting it from command line to setup mode is done with command:
minicom -o -s
The serial port setting need to match the settings passed as kernel parameters:
+-----------------------------------------------------------------------+ | A - Serial Device : /dev/ttyUSB0 | | B - Lockfile Location : /var/lock | | C - Callin Program : | | D - Callout Program : | | E - Bps/Par/Bits : 115200 8N1 | | F - Hardware Flow Control : No | | G - Software Flow Control : No | | | | Change which setting? | +-----------------------------------------------------------------------+
After the settings are correct, the kernel on the other end should start answering to SysRq key commands. Turning up log level is good (ctrl-a f sends break command from minicom) so that the dump shows up automatically when the kernel crashes:
ctrl-a f 9
Now I was ready to see what happens when the terminal locks, but of course it didn’t. Being frustrated over too stable thin client didn’t feel right either. So finally after stressing the thin client in all possible ways, it froze and things started happening on the minicom screen:
[35774.932022] Modules linked in: ipv6 wmi video output sbs sbshc pci_slot container battery ac lp padlock_aes crypto_blkcipher aes_generic sg snd_via82xx gameport snd_ac97_codec ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm snd_mpu401_uart snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event evdev snd_seq snd_timer snd_seq_device snd pata_acpi ata_generic serio_raw parport_pc pata_via button parport snd_page_alloc i2c_viapro libata soundcore psmouse i2c_core shpchp scsi_mod pci_hotplug via_agp dock floppy agpgart pcspkr af_packet usbhid hid uhci_hcd ehci_hcd via_rhine mii usbcore r8169 thermal processor fan unionfs squashfs nbd fbcon tileblit font bitblit softcursor fuse
[35774.932022]
[35774.932022] Pid: 0, comm: swapper Not tainted (2.6.27-7-generic #1)
[35774.932022] EIP: 0060:[<c017708b>] EFLAGS: 00000246 CPU: 0
[35774.932022] EIP is at handle_IRQ_event+0x1b/0x80
[35774.932022] EAX: 0000000b EBX: de9ac8c0 ECX: de9ac8c0 EDX: de9ac8c0
[35774.932022] ESI: 0000000b EDI: c049a330 EBP: c04a7dec ESP: c04a7ddc
[35774.932022] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[35774.932022] CR0: 8005003b CR2: b6e44000 CR3: 1d961000 CR4: 00000690
[35774.932022] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[35774.932022] DR6: ffff0ff0 DR7: 00000400
[35774.932022] [<c0178d43>] handle_level_irq+0x83/0x100
[35774.932022] [<c037e61d>] ? _spin_lock+0xd/0x10
[35774.932022] [<c0106c15>] do_IRQ+0x45/0x80
[35774.932022] [<c0105003>] common_interrupt+0x23/0x30
[35774.932022] [<df865561>] ? rtl8169_rx_interrupt+0x11/0x4e0 [r8169]
[35774.932022] [<df868530>] rtl8169_poll+0x30/0xc0 [r8169]
[35774.932022] [<c02f4825>] net_rx_action+0xf5/0x230
[35774.932022] [<c0137682>] __do_softirq+0x92/0x120
[35774.932022] [<c013776d>] do_softirq+0x5d/0x60
[35774.932022] [<c01378e5>] irq_exit+0x55/0x90
[35774.932022] [<c0106c1a>] do_IRQ+0x4a/0x80
[35774.932022] [<c0179c5f>] ? rcu_needs_cpu+0x3f/0x50
[35774.932022] [<c0105003>] common_interrupt+0x23/0x30
[35774.932022] [<c011aba5>] ? native_safe_halt+0x5/0x10
[35774.932022] [<c010aaed>] default_idle+0x5d/0x60
[35774.932022] [<c010288d>] cpu_idle+0x7d/0x140
[35774.932022] [<c036edc3>] rest_init+0x53/0x60
[35774.932022] =======================
Next task is to figure out what the message from kernel actually means. Googling and searching Launchpad’s and kernel.org’s bug reports quickly revealed quite a few r8169 related bugs, but the exact cause for this particular freeze is still under investigation. Various patches touch the code in question and the dumps from lock-ups look different with newer kernel versions, but the bug is still there in Ubuntu’s kernel 2.6.27.18. We managed to nail some other freezes with the same method already, so this certainly seems to be an effective way of finding out why some ltsp thin clients are freezing.
Happy debugging!
Veli-Matti Lintu
Thanks for the useful article. I used to use minicom for serial connections too but always found it a bit of a pain. Then I found that GNU Screen (which I use all the time anyway) is a really handy serial console program. screen also has a configurable searchable history and can be easily told to log everything to command line, etc. etc.
Though in fairness, I suspect so can minicom, I just prefer screen for its other abilities 😉