Category Archives: Debian

Update on the Debian kernel bug.

It might not be caused by the kernel, but by the Xen hypervisor.  What I did up to know:

  • I installed the problematic kernel in a virtual machine (DomU) while the host (Dom0) was running Jessie and a thus different kernel.  Within that environment, no problem occurred.
  • I reinstalled Wheezy on the machine, but this time, I did not install Xen and did exactly the same dd command.  The problem did not arise.  (I also simplified the disk setup and upgraded my BIOS to the latest version, for good measure.  It shouldn’t make much of a difference, I was surprised there was a new BIOS in the first place)
  • Being confident, it might perhaps be caused by the disk setup (Originally I had 4 disk raid6 with one spare, and now I simply have a 2-disk raid1, with no spares), I installed Xen and rebooted.  When I tried the dd command, I got my Oops.

Conclusion:  the error only seems to occur on a Dom0 while using Xen.  It can be avoided by upgrading to Jessie.

While that is good news for the new setup, it still implies that under no circumstances, I can reboot hammerhead before mako is ready.  It might of course be linked to AMD specific code, but I’m really not willing to take that risk.

I wonder if I should file a bug with the Debian kernel team.

Recently, I decided to reinstall my old self-built rack server (AMD A6-3650, 16GB RAM, Asus F1A75-V PRO)  It wasn’t really being used and since I want to reconfigure my Dell R210-II, I decided the AMD should, at least temporary, take over the Dells tasks.  Yes, I know it’s not real server hardware, and yes, I think of buying another R2xx when I’ve got a bit money to waste, which is not now.

So, I installed Debian Wheezy and the Xen Hypervisor on it, as always using PXE, which means you end up with an installation that is fully up-to-date, unlike my other machines who tend to have older kernels because I rarely see a reason to reboot.

Then, one of the first things I tried was to clone a disk over network ( dd if=”/dev/vg0/vm-root” | ssh root@mako “dd if=/dev/vg0/vm-root”).  I have done these things before, and I know they work.  It’s not the quickest way, but I had my reasons to do as such.  Thing is: I got a kernel oops.  While only a “oops”, it does make the system unstable so a reboot is truly recommended.

I thought it would perhaps be a fluke, so I tried again… Same thing, so I removed the networking component and tried a simple dd if=/dev/zero of=/dev/vg0/big-lv bs=1073741824 and, yes again a kernel oops.  It looks something like this:

Feb 11 23:00:08 mako kernel: [ 8450.177200] BUG: unable to handle kernel paging request at ffff88013f800000
Feb 11 23:00:08 mako kernel: [ 8450.177222] IP: [<ffffffff811b3e27>] clear_page_c+0x7/0x10
Feb 11 23:00:08 mako kernel: [ 8450.177237] PGD 1606067 PUD bdd89067 PMD bdf86067 PTE 0
Feb 11 23:00:08 mako kernel: [ 8450.177256] Oops: 0002 [#1] SMP 
Feb 11 23:00:08 mako kernel: [ 8450.177268] CPU 2 
Feb 11 23:00:08 mako kernel: [ 8450.177272] Modules linked in: fuse btrfs crc32c libcrc32c zlib_deflate ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs reiserfs ext3 jbd ext2 efivars xen_gntdev xen_evtchn xenfs nfsd nfs nfs_acl auth_rpcgss fscache lockd sunrpc bridge stp loop radeon ttm drm_kms_helper eeepc_wmi snd_hda_codec_hdmi psmouse asus_wmi sparse_keymap snd_hda_intel snd_hda_codec rfkill drm snd_hwdep snd_pcm powernow_k8 mperf pl2303 snd_page_alloc power_supply serio_raw pcspkr i2c_piix4 evdev snd_timer k10temp wmi snd usbserial soundcore button processor thermal_sys ext4 crc16 jbd2 mbcache dm_mod raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq md_mod usbhid hid sg sd_mod crc_t10dif r8169 mii ohci_hcd ahci libahci xhci_hcd ehci_hcd libata igb i2c_algo_bit i2c_core dca scsi_mod usbcore usb_common [last unloaded: scsi_wait_scan]
Feb 11 23:00:08 mako kernel: [ 8450.177604] 
Feb 11 23:00:08 mako kernel: [ 8450.177610] Pid: 4221, comm: sshd Not tainted 3.2.0-4-amd64 #1 Debian 3.2.65-1+deb7u1 System manufacturer System Product Name/F1A75-V PRO
Feb 11 23:00:08 mako kernel: [ 8450.177630] RIP: e030:[<ffffffff811b3e27>]  [<ffffffff811b3e27>] clear_page_c+0x7/0x10
Feb 11 23:00:08 mako kernel: [ 8450.177647] RSP: e02b:ffff8801f1f17b30  EFLAGS: 00010246
Feb 11 23:00:08 mako kernel: [ 8450.177658] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000200
Feb 11 23:00:08 mako kernel: [ 8450.177669] RDX: ffffea00045e4000 RSI: 0000000000000000 RDI: ffff88013f800000
Feb 11 23:00:08 mako kernel: [ 8450.177681] RBP: ffffea00045e4000 R08: 0000000000000000 R09: 00000000000401d7
Feb 11 23:00:08 mako kernel: [ 8450.177693] R10: 0000000000000002 R11: 0000000000000fc4 R12: 0000000000000000
Feb 11 23:00:08 mako kernel: [ 8450.177704] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8801f1f16000
Feb 11 23:00:08 mako kernel: [ 8450.177718] FS:  00007f7904fce7c0(0000) GS:ffff8803cb500000(0000) knlGS:0000000000000000
Feb 11 23:00:08 mako kernel: [ 8450.177734] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
Feb 11 23:00:08 mako kernel: [ 8450.177745] CR2: ffff88013f800000 CR3: 000000014239e000 CR4: 0000000000000660
Feb 11 23:00:08 mako kernel: [ 8450.177757] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Feb 11 23:00:08 mako kernel: [ 8450.177769] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Feb 11 23:00:08 mako kernel: [ 8450.177781] Process sshd (pid: 4221, threadinfo ffff8801f1f16000, task ffff8803b356e100)
Feb 11 23:00:08 mako kernel: [ 8450.177796] Stack:
Feb 11 23:00:08 mako kernel: [ 8450.177804]  ffffffff810bb8cd ffff8803cb515628 ffffea00045e4000 0000000000000000
Feb 11 23:00:08 mako kernel: [ 8450.177830]  00000001000280da ffffffff00000041 00000003caf73025 ffff8803cb72ac08
Feb 11 23:00:08 mako kernel: [ 8450.177856]  ffff8803cb72ac00 0000000081004f2f 0000000000000030 ffff8803cb72ac08
Feb 11 23:00:08 mako kernel: [ 8450.177882] Call Trace:
Feb 11 23:00:08 mako kernel: [ 8450.177894]  [<ffffffff810bb8cd>] ? get_page_from_freelist+0x57a/0x665
Feb 11 23:00:08 mako kernel: [ 8450.177907]  [<ffffffff810bbb3e>] ? __alloc_pages_nodemask+0x186/0x7ab
Feb 11 23:00:08 mako kernel: [ 8450.177921]  [<ffffffff810d1a97>] ? handle_pte_fault+0x298/0x79f
Feb 11 23:00:08 mako kernel: [ 8450.177933]  [<ffffffff81004e44>] ? pte_pfn_to_mfn+0x26/0x77
Feb 11 23:00:08 mako kernel: [ 8450.177945]  [<ffffffff8100569f>] ? __xen_set_pte+0x11/0x51
Feb 11 23:00:08 mako kernel: [ 8450.177957]  [<ffffffff810e6ee9>] ? alloc_pages_vma+0x12d/0x136
Feb 11 23:00:08 mako kernel: [ 8450.177969]  [<ffffffff810d1964>] ? handle_pte_fault+0x165/0x79f
Feb 11 23:00:08 mako kernel: [ 8450.177981]  [<ffffffff810cefaf>] ? pmd_val+0x7/0x8
Feb 11 23:00:08 mako kernel: [ 8450.177992]  [<ffffffff810cf02d>] ? pte_offset_kernel+0x16/0x35
Feb 11 23:00:08 mako kernel: [ 8450.178005]  [<ffffffff81353e74>] ? do_page_fault+0x320/0x345
Feb 11 23:00:08 mako kernel: [ 8450.178018]  [<ffffffff81095461>] ? arch_local_irq_save+0x11/0x15
Feb 11 23:00:08 mako kernel: [ 8450.178029]  [<ffffffff81095e17>] ? __call_rcu+0x21/0x12c
Feb 11 23:00:08 mako kernel: [ 8450.178041]  [<ffffffff8110b26f>] ? dput+0x27/0xee
Feb 11 23:00:08 mako kernel: [ 8450.178052]  [<ffffffff810fc21e>] ? fput+0x17a/0x1a1
Feb 11 23:00:08 mako kernel: [ 8450.178063]  [<ffffffff810eb3fb>] ? arch_local_irq_restore+0x7/0x8
Feb 11 23:00:08 mako kernel: [ 8450.178074]  [<ffffffff81351415>] ? page_fault+0x25/0x30
Feb 11 23:00:08 mako kernel: [ 8450.178084] Code: 20 4c 89 4c 24 48 c7 44 24 08 10 00 00 00 48 89 44 24 18 e8 8c f9 ff ff 48 83 c4 58 c3 90 90 90 90 90 90 90 b9 00 02 00 00 31 c0 <f3> 48 ab c3 0f 1f 44 00 00 b9 00 10 00 00 31 c0 f3 aa c3 66 0f 
Feb 11 23:00:08 mako kernel: [ 8450.178270] RIP  [<ffffffff811b3e27>] clear_page_c+0x7/0x10
Feb 11 23:00:08 mako kernel: [ 8450.178283]  RSP <ffff8801f1f17b30>
Feb 11 23:00:08 mako kernel: [ 8450.178291] CR2: ffff88013f800000
Feb 11 23:00:08 mako kernel: [ 8450.178436] ---[ end trace c0e1c75d9283be10 ]---
Feb 11 23:00:08 mako kernel: [ 8450.178466] note: sshd[4221] exited with preempt_count 1
Feb 11 23:00:08 mako kernel: [ 8450.178971] BUG: scheduling while atomic: sshd/4221/0x10000001
Feb 11 23:00:08 mako kernel: [ 8450.179002] Modules linked in: fuse btrfs crc32c libcrc32c zlib_deflate ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs reiserfs ext3 jbd ext2 efivars xen_gntdev xen_evtchn xenfs nfsd nfs nfs_acl auth_rpcgss fscache lockd sunrpc bridge stp loop radeon ttm drm_kms_helper eeepc_wmi snd_hda_codec_hdmi psmouse asus_wmi sparse_keymap snd_hda_intel snd_hda_codec rfkill drm snd_hwdep snd_pcm powernow_k8 mperf pl2303 snd_page_alloc power_supply serio_raw pcspkr i2c_piix4 evdev snd_timer k10temp wmi snd usbserial soundcore button processor thermal_sys ext4 crc16 jbd2 mbcache dm_mod raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq md_mod usbhid hid sg sd_mod crc_t10dif r8169 mii ohci_hcd ahci libahci xhci_hcd ehci_hcd libata igb i2c_algo_bit i2c_core dca scsi_mod usbcore usb_common [last unloaded: scsi_wait_scan]
Feb 11 23:00:08 mako kernel: [ 8450.181496] Pid: 4221, comm: sshd Tainted: G      D      3.2.0-4-amd64 #1 Debian 3.2.65-1+deb7u1
Feb 11 23:00:08 mako kernel: [ 8450.181530] Call Trace:
Feb 11 23:00:08 mako kernel: [ 8450.181560]  [<ffffffff8134a2be>] ? __schedule_bug+0x3e/0x52
Feb 11 23:00:08 mako kernel: [ 8450.181591]  [<ffffffff8134f4a5>] ? __schedule+0x85/0x610
Feb 11 23:00:08 mako kernel: [ 8450.181621]  [<ffffffff8110b26f>] ? dput+0x27/0xee
Feb 11 23:00:08 mako kernel: [ 8450.181652]  [<ffffffff81042090>] ? __cond_resched+0x1d/0x26
Feb 11 23:00:08 mako kernel: [ 8450.181682]  [<ffffffff8134fa7f>] ? _cond_resched+0x12/0x1c
Feb 11 23:00:08 mako kernel: [ 8450.181713]  [<ffffffff81049a2a>] ? put_files_struct+0x65/0xad
Feb 11 23:00:08 mako kernel: [ 8450.181743]  [<ffffffff8104a02c>] ? do_exit+0x292/0x713
Feb 11 23:00:08 mako kernel: [ 8450.181774]  [<ffffffff8107130f>] ? arch_local_irq_disable+0x7/0x8
Feb 11 23:00:08 mako kernel: [ 8450.184239]  [<ffffffff81071307>] ? arch_local_irq_restore+0x7/0x8
Feb 11 23:00:08 mako kernel: [ 8450.184271]  [<ffffffff81350e3f>] ? _raw_spin_unlock_irqrestore+0xe/0xf
Feb 11 23:00:08 mako kernel: [ 8450.184304]  [<ffffffff81048345>] ? kmsg_dump+0x52/0xdd
Feb 11 23:00:08 mako kernel: [ 8450.184336]  [<ffffffff81350e3f>] ? _raw_spin_unlock_irqrestore+0xe/0xf
Feb 11 23:00:08 mako kernel: [ 8450.184368]  [<ffffffff81351d14>] ? oops_end+0xb1/0xb6
Feb 11 23:00:08 mako kernel: [ 8450.184399]  [<ffffffff81349d8b>] ? no_context+0x1ff/0x20e
Feb 11 23:00:08 mako kernel: [ 8450.184430]  [<ffffffff81349619>] ? pmd_val+0x7/0x8
Feb 11 23:00:08 mako kernel: [ 8450.184460]  [<ffffffff81349638>] ? pte_offset_kernel+0x16/0x35
Feb 11 23:00:08 mako kernel: [ 8450.184491]  [<ffffffff81353d0a>] ? do_page_fault+0x1b6/0x345
Feb 11 23:00:08 mako kernel: [ 8450.184522]  [<ffffffff81004e44>] ? pte_pfn_to_mfn+0x26/0x77
Feb 11 23:00:08 mako kernel: [ 8450.184553]  [<ffffffff81004375>] ? __raw_callee_save_xen_make_pte+0x11/0x1e
Feb 11 23:00:08 mako kernel: [ 8450.184584]  [<ffffffff81351415>] ? page_fault+0x25/0x30
Feb 11 23:00:08 mako kernel: [ 8450.184615]  [<ffffffff811b3e27>] ? clear_page_c+0x7/0x10
Feb 11 23:00:08 mako kernel: [ 8450.184646]  [<ffffffff810bb8cd>] ? get_page_from_freelist+0x57a/0x665
Feb 11 23:00:08 mako kernel: [ 8450.184677]  [<ffffffff810bbb3e>] ? __alloc_pages_nodemask+0x186/0x7ab
Feb 11 23:00:08 mako kernel: [ 8450.184709]  [<ffffffff810d1a97>] ? handle_pte_fault+0x298/0x79f
Feb 11 23:00:08 mako kernel: [ 8450.184739]  [<ffffffff81004e44>] ? pte_pfn_to_mfn+0x26/0x77
Feb 11 23:00:08 mako kernel: [ 8450.184770]  [<ffffffff8100569f>] ? __xen_set_pte+0x11/0x51
Feb 11 23:00:08 mako kernel: [ 8450.184800]  [<ffffffff810e6ee9>] ? alloc_pages_vma+0x12d/0x136
Feb 11 23:00:08 mako kernel: [ 8450.184831]  [<ffffffff810d1964>] ? handle_pte_fault+0x165/0x79f
Feb 11 23:00:08 mako kernel: [ 8450.184862]  [<ffffffff810cefaf>] ? pmd_val+0x7/0x8
Feb 11 23:00:08 mako kernel: [ 8450.184892]  [<ffffffff810cf02d>] ? pte_offset_kernel+0x16/0x35
Feb 11 23:00:08 mako kernel: [ 8450.184922]  [<ffffffff81353e74>] ? do_page_fault+0x320/0x345
Feb 11 23:00:08 mako kernel: [ 8450.184954]  [<ffffffff81095461>] ? arch_local_irq_save+0x11/0x15
Feb 11 23:00:08 mako kernel: [ 8450.184984]  [<ffffffff81095e17>] ? __call_rcu+0x21/0x12c
Feb 11 23:00:08 mako kernel: [ 8450.185014]  [<ffffffff8110b26f>] ? dput+0x27/0xee
Feb 11 23:00:08 mako kernel: [ 8450.185044]  [<ffffffff810fc21e>] ? fput+0x17a/0x1a1
Feb 11 23:00:08 mako kernel: [ 8450.185074]  [<ffffffff810eb3fb>] ? arch_local_irq_restore+0x7/0x8
Feb 11 23:00:08 mako kernel: [ 8450.185105]  [<ffffffff81351415>] ? page_fault+0x25/0x30

Okay, a Debian stable kernel causing kernel oopses?  Nah, can’t be…  Damn, probably the memory is broken.  As such, an overnight memtestx86+ is scheduled and in the morning, it tells me everything is just fine.

At this point, I worry that it truly is a kernel bug.  I verify my other machines and none of them run 3.2.65-1+deb7u1, but all of them have it installed already.  Unlike Ubuntu, Debian doesn’t seem to amass old kernels in /boot.  I would have tried using an older kernel, but somehow I didn’t find the magic apt invocation to do so.

I still wanted to verify whether it’s the kernel causing this, so I upgraded the AMD machine to Jessie.  After I did so, I tried the same tests as on the original install and it works exactly as expected.  No more oops.

I also realize that my other machines are a reboot away from instability!  Scary thought.  Now, I’ll probably just trash the system, try wheezy again and see whether the problem comes back.  If so, it must be kernel bug.  The question is whether I should report it to the Debian kernel team.  I’m not sure I can really help them, also I could just ignore it and go Jessie (getting rid of systemd isn’t all that hard on a server as I found out today).