Sometimes XenServer can be a real pain in the neck.
Here's what I ran into a little earlier:
I upgrade my VM's kernel and reboot. Wait for a while, but the VM doesn't appear to come up. So I go to my Logs and lo and behold:
Ugh, wat?
OK, let's see if we can start it manually, worst case scenario is the command line providing us with more info, right?
Oh, how informative. Luckily XenServer provides us with an off-line VM boot editor. That should do the trick:
This should have worked. It's usually as good as logging into the VM and editing the GRUB menu.lst yourself. What on earth happened here? Well, I'm pretty sure I know my systems and my partitions so I'm sure the /boot partition is the first one!
What now? Well, for sure we'll need to boot from a rescue disk. So, make sure your NFS ISO library is online or you have a storage already handy that has a rescue CD .iso. No? We don't? Well, no reason to panic. We'll just create a directory named "/rescue/ISOs", download a Debian Live CD there and create an SR with the name label "RESCUE" that points to that directory:
Please note that you need to make sure you create this SR on a partition with plenty of disk space, the Dom0's default partition is just too small and therefore I do not recommend having any rescue ISOs on it.
Cool. Now let's select that ISO as our DVD disk on our VM and finally boot off of it. The latter task can be done by going to VM -> Start/Shut Down -> Start in Recovery Mode.
Now that everything's started up, let's see what's going on. As I already mentioned, my /boot partition is my first partition so let's mount it and inspect the grub.conf file:
Now, we have two options really:
a) Solving this through XenServer. We would have to issue:
xe vm-param-set uuid=vm-uuid PV-bootloader-args="--kernel=/vmlinuz-3.8.13-35.1.2.el6uek.x86_64 --ramdisk=/initramfs-3.8.13-35.1.2.el6uek.x86_64.img"
xe vm-param-set uuid=vm-uuid PV-args="root=root-device ro quiet"
Since I would have to change this every time I upgrade the kernel and I really want to find what went wrong, I'll pass this option for now.
b) Trying to debug what's wrong with grub.conf.
Since the kernel is too recent to not support XenServer, it should just be a matter of patience to debug it.
Here's what's usually wrong with grub.conf/menu.lst:
i ) The root(hdx,y) is wrong:
x should point to the hard drive number where our boot partition is located at;
y should point to the partition number of our boot partition.
In this case, root(hd0,0) is correct.
ii ) The paths of vmlinuz-... and/or initramfs-... are wrong.
The paths should be relative to the partition root directory. So for example if the boot directory is in a dedicated partition, it should be /vmlinuz-... and /initramfs-... but if the boot directory is in the same partition as the linux root (/) directory it should be /boot/vmlinuz-... and /boot/initramfs... if that explanation makes sense.
iii) The root directive is wrong.
Here I mean the root directive that defines the linux root (/) directory, and not the boot partition which has been already declared with the root(hdx,y) statement. It could be root=/dev/xvda3 for instance. In my case it is root=UUID=e9b3edd9-15e0-4cfd-a8fa-7dc24f6aeefa.
The three cases above can be easily examined with a simple ls command on the /a directory and a blkid /dev/xvda3 or ls -l /dev/disk/by-uuid to find if the UUID of the device that hosts our root (/) directory is the correct one.
Another case is for the partition to have been corrupted, so we just umount /a and fsck /dev/xvda1
In my case, as you can see the error was in the default setting. Changing this to 0 did the trick. Default signifies which item in the menu is the one that will boot after the user-interaction timeout occurs. The count starts from 0 and not from 1 so my system couldn't boot. What happened is that the OS provider had decided to change the default kernel order, making a mess out of my server.
Save changes, shutdown, reboot. Should be fine.
Here's what I ran into a little earlier:
I upgrade my VM's kernel and reboot. Wait for a while, but the VM doesn't appear to come up. So I go to my Logs and lo and behold:
Ugh, wat?
OK, let's see if we can start it manually, worst case scenario is the command line providing us with more info, right?
[root@xen]# xe vm-start vm="My VM's Name"
The bootloader returned an error
vm: .... (My VM's Name)
msg: Unable to find partition containing kernel
Oh, how informative. Luckily XenServer provides us with an off-line VM boot editor. That should do the trick:
[root@xen]# xe-edit-bootloader -n "
My VM's Name
" -p 1Plugging VBD: Creating dom0 VBD: ...
add map ...1 (252:5): 0 1048576 linear /dev/sm/backend/.../... 2048 add map ...2 (252:6): 0 33554432 linear /dev/sm/backend/.../... 1050624 add map ...3 (252:7): 0 1886386176 linear /dev/sm/backend/.../... 34605056 Waiting for /dev/mapper/...1: .....Device /dev/mapper/...1 not found. You must specify the correct partition number with -p Unplugging VBD: . done
This should have worked. It's usually as good as logging into the VM and editing the GRUB menu.lst yourself. What on earth happened here? Well, I'm pretty sure I know my systems and my partitions so I'm sure the /boot partition is the first one!
What now? Well, for sure we'll need to boot from a rescue disk. So, make sure your NFS ISO library is online or you have a storage already handy that has a rescue CD .iso. No? We don't? Well, no reason to panic. We'll just create a directory named "/rescue/ISOs", download a Debian Live CD there and create an SR with the name label "RESCUE" that points to that directory:
[root@xen]# mkdir -p /rescue/ISOs
[root@xen]# cd /rescue/ISOs
[root@xen]# wget http://cdimage.debian.org/debian-cd/current-live/amd64/iso-hybrid/debian-live-7.5.0-amd64-rescue.iso
[root@xen]# xe sr-create name-label=RESCUE type=iso device-config:legacy_mode=true device-config:location=/rescue/ISOs content-type=iso
303
bd588
-c675-2bc0-89
82
-8a691141
226a
Please note that you need to make sure you create this SR on a partition with plenty of disk space, the Dom0's default partition is just too small and therefore I do not recommend having any rescue ISOs on it.
Cool. Now let's select that ISO as our DVD disk on our VM and finally boot off of it. The latter task can be done by going to VM -> Start/Shut Down -> Start in Recovery Mode.
Now that everything's started up, let's see what's going on. As I already mentioned, my /boot partition is my first partition so let's mount it and inspect the grub.conf file:
root@debian:~# cat /proc/partitionsmajor minor #blocks name
202 0 960495616 xvda
202 1 524288 xvda1
202 2 16777216 xvda2
202 3 943193088 xvda3
11 0 696320 sr0
7 0 548992 loop0
root@debian:~# mkdir /a
root@debian:~# mount /dev/xvda1 /a
root@debian:~# vi /a/grub/grub.conf
# grub.conf generated by anaconda
#
# Note that you do not have to rerun grub after making changes to this file
# NOTICE: You have a /boot partition. This means that
# all kernel and initrd paths are relative to /boot/, eg.
# root (hd0,0)
# kernel /vmlinuz-version ro root=/dev/xvda3
# initrd /initrd-[generic-]version.img
#boot=/dev/xvda
default=1
timeout=5
splashimage=(hd0,0)/grub/splash.xpm.gz
hiddenmenu
title Oracle Linux Server (3.8.13-35.1.2.el6uek.x86_64)
root (hd0,0)
kernel /vmlinuz-3.8.13-35.1.2.el6uek.x86_64 ro root=UUID=e9b3edd9-15e0-4cfd-a8fa-7dc24f6aeefa rd_NO_LUKS LANG=en_US.UTF-8 rd_NO_MD console=hvc0 KEYTABLE=us SYSFONT=latarcyrheb-sun16 rd_NO_LVM rd_NO_DM rhgb quiet
initrd /initramfs-3.8.13-35.1.2.el6uek.x86_64.img
Now, we have two options really:
a) Solving this through XenServer. We would have to issue:
xe vm-param-set uuid=vm-uuid PV-bootloader-args="--kernel=/vmlinuz-3.8.13-35.1.2.el6uek.x86_64 --ramdisk=/initramfs-3.8.13-35.1.2.el6uek.x86_64.img"
xe vm-param-set uuid=vm-uuid PV-args="root=root-device ro quiet"
Since I would have to change this every time I upgrade the kernel and I really want to find what went wrong, I'll pass this option for now.
b) Trying to debug what's wrong with grub.conf.
Since the kernel is too recent to not support XenServer, it should just be a matter of patience to debug it.
Here's what's usually wrong with grub.conf/menu.lst:
i ) The root(hdx,y) is wrong:
x should point to the hard drive number where our boot partition is located at;
y should point to the partition number of our boot partition.
In this case, root(hd0,0) is correct.
ii ) The paths of vmlinuz-... and/or initramfs-... are wrong.
The paths should be relative to the partition root directory. So for example if the boot directory is in a dedicated partition, it should be /vmlinuz-... and /initramfs-... but if the boot directory is in the same partition as the linux root (/) directory it should be /boot/vmlinuz-... and /boot/initramfs... if that explanation makes sense.
iii) The root directive is wrong.
Here I mean the root directive that defines the linux root (/) directory, and not the boot partition which has been already declared with the root(hdx,y) statement. It could be root=/dev/xvda3 for instance. In my case it is root=UUID=e9b3edd9-15e0-4cfd-a8fa-7dc24f6aeefa.
The three cases above can be easily examined with a simple ls command on the /a directory and a blkid /dev/xvda3 or ls -l /dev/disk/by-uuid to find if the UUID of the device that hosts our root (/) directory is the correct one.
Another case is for the partition to have been corrupted, so we just umount /a and fsck /dev/xvda1
In my case, as you can see the error was in the default setting. Changing this to 0 did the trick. Default signifies which item in the menu is the one that will boot after the user-interaction timeout occurs. The count starts from 0 and not from 1 so my system couldn't boot. What happened is that the OS provider had decided to change the default kernel order, making a mess out of my server.
Save changes, shutdown, reboot. Should be fine.