Saturday, March 21, 2015

Introduction to Parallel Computing Part 1f - Creating a Hadoop cluster (the easy way -- Cloudera)

Supported Operating Systems

Cloudera Manager supports the following operating systems:
  • RHEL-compatible
    • Red Hat Enterprise Linux and CentOS
      • 5.7, 64-bit
      • 6.4, 64-bit
      • 6.4 in SE Linux mode
      • 6.5, 64-bit
    • Oracle Enterprise Linux with default kernel and Unbreakable Enterprise Kernel, 64-bit
      • 5.6 (UEK R2)
      • 6.4 (UEK R2)
      • 6.5 (UEK R2, UEK R3)
  • SLES - SUSE Linux Enterprise Server 11, 64-bit. Service Pack 2 or later is required.
  • Debian - Wheezy (7.0 and 7.1), Squeeze (6.0) (deprecated), 64-bit
  • Ubuntu - Trusty (14.04), Precise (12.04), Lucid (10.04) (deprecated), 64-bit
I'm going to use RHEL 6.6 for this.

Unfortunately, the menial tasks that involve system configuration cannot be avoided, so let's press on:
First of all, let's update everything (This should be issued on every server in our cluster):

[root@hadoop1 ~]# yum -y update
[root@hadoop1 ~]# yum -y install wget

We'll need this (This should be issued on every server in our cluster.):

[root@hadoop1 ~]# yum -y install openssh-clients.x86_64

Note: It's always best to actually have your node FQDNs on your DNS server and skip the next two steps (editing the /etc/hosts and the /etc/host.conf files). 

Now, let's edit our /etc/hosts to reflect our cluster (This should be issued on every server in our cluster):

[root@hadoop1 ~]# vi /etc/hosts
192.168.0.101   hadoop1
192.168.0.102   hadoop2
192.168.0.103   hadoop3
192.168.0.104   hadoop4
192.168.0.105   hadoop5
192.168.0.106   hadoop6
192.168.0.107   hadoop7
192.168.0.108   hadoop8

We should also check our /etc/host.conf and our /etc/nsswitch.conf, unless we want to have resolvable hostnames:

[hadoop@hadoop1 ~]$ vi /etc/host.conf
multi on
order hosts bind
[hadoop@hadoop1 ~]$ vi /etc/nsswitch.conf
....
#hosts:     db files nisplus nis dns
hosts:      files dns
....

We'll need a large number of file descriptors (This should be issued on every server in our cluster):

[root@hadoop1 ~]# vi /etc/security/limits.conf
....
* soft nofile 65536
* hard nofile 65536
....

We should make sure that our network interface comes up automatically:

[root@hadoop1 ~]# vi /etc/sysconfig/network-scripts/ifcfg-eth0
....
ONBOOT="yes"
....

And of course make sure our other networking functions, such as our hostname are correct:

[root@hadoop1 ~]# vi /etc/sysconfig/network
NETWORKING=yes
HOSTNAME=hadoop1
GATEWAY=192.168.0.1

We'll need to log in using SSH as root, so for the time being let's allow root logins. We might want to turn that off after we're done, as this is as insecure as they come:

[root@hadoop1 ~]# vi /etc/ssh/sshd_config
....
PermitRootLogin yes
....
[root@hadoop1 ~]# service sshd restart

NTP should be installed on every server in our cluster. Now that we've edited our hosts file things are much easier though:

[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" "exec yum -y install ntp ntpdate"; done 
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" chkconfig ntpd on; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" ntpdate pool.ntp.org; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" service ntpd start; done

Set up passwordless ssh authentication (note that this will be configured automatically during the actual installation so this in not necessary; it is useful though, since it saves us from a lot of typing):

[root@hadoop1 ~]# ssh-keygen -t rsa 
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}' | grep -v hadoop1); do ssh "$host" mkdir -p .ssh; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh-copy-id -i ~/.ssh/id_rsa.pub "$host"; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" chmod 700 .ssh; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" chmod 640 .ssh/authorized_keys; done

Time to tone down our security a bit so that our cluster runs without problems. My PC's IP is 192.168.0.55 so I will allow that as well:

[root@hadoop1 ~]# iptables -F
[root@hadoop1 ~]# iptables -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
[root@hadoop1 ~]# iptables -A INPUT -i lo -j ACCEPT
[root@hadoop1 ~]# iptables -A INPUT -s 192.168.0.101,192.168.0.102,192.168.0.103,192.168.0.104,192.168.0.105,192.168.0.106,192.168.0.107,192.168.0.108 -j ACCEPT
[root@hadoop1 ~]# iptables -A INPUT -s 192.168.0.55 -j ACCEPT
[root@hadoop1 ~]# iptables -A INPUT -j DROP
[root@hadoop1 ~]# iptables -A FORWARD -j DROP
[root@hadoop1 ~]# iptables -A OUTPUT -j ACCEPT
[root@hadoop1 ~]# iptables-save > /etc/sysconfig/iptables
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}' | grep -v hadoop1); do scp /etc/sysconfig/iptables "$host":/etc/sysconfig/iptables; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" "iptables-restore < /etc/sysconfig/iptables"; done

Let's disable SELinux:

[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" setenforce 0; done
[root@hadoop1 ~]# vi /etc/sysconfig/selinux
# This file controls the state of SELinux on the system.
# SELINUX= can take one of these three values:
#     enforcing - SELinux security policy is enforced.
#     permissive - SELinux prints warnings instead of enforcing.
#     disabled - No SELinux policy is loaded.
SELINUX=disabled
# SELINUXTYPE= can take one of these two values:
#     targeted - Targeted processes are protected,
#     mls - Multi Level Security protection.
SELINUXTYPE=targeted
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}' | grep -v hadoop1); do scp /etc/sysconfig/selinux "$host":/etc/sysconfig/selinux; done

Turn down swappiness. Cloudera actually recommend turning down swappiness to 0, I prefer 1:

[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" "echo vm.swappiness = 1 >> /etc/sysctl.conf"; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" sysctl -p; done

We've made quite a few changes, including kernel updates. Let's reboot and pick this up later.

[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}' | grep -v hadoop1); do ssh "$host" reboot; done      
[root@hadoop1 ~]# reboot

Now, let's start installing Hadoop by downloading and running the Cloudera manager and installation script.

[root@hadoop1 ~]# wget http://archive.cloudera.com/cm5/installer/latest/cloudera-manager-installer.bin     
[root@hadoop1 ~]# chmod u+x cloudera-manager-installer.bin
[root@hadoop1 ~]# ./cloudera-manager-installer.bin

The installation process is very straight-forward to say the least. You only have to read a few licence agreements and select a few "Next" options.



After a while, you'll need to point your web browser to the system whose IP you installed cloudera manager on, port 7180. In my case therefore it's 192.168.0.101:7180.

Just in case, take a look at the logs before actually logging in. If there doesn't seem to be a cloudera manager service available listening at that address and port, wait for a bit until you see the relevant message:

[root@hadoop1 ~]# netstat -ntpl
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address               Foreign Address             State       PID/Program name   
tcp        0      0 0.0.0.0:7432                0.0.0.0:*                   LISTEN      1807/postgres       
tcp        0      0 0.0.0.0:7182                0.0.0.0:*                   LISTEN      2419/java           
tcp        0      0 0.0.0.0:22                  0.0.0.0:*                   LISTEN      920/sshd            
tcp        0      0 127.0.0.1:25                0.0.0.0:*                   LISTEN      1045/master         
tcp        0      0 :::7432                     :::*                        LISTEN      1807/postgres       
tcp        0      0 :::22                       :::*                        LISTEN      920/sshd            
tcp        0      0 ::1:25                      :::*                        LISTEN      1045/master         
[root@hadoop1 ~]# tail -f /var/log/cloudera-scm-server/cloudera-scm-server.log 
....
2015-03-20 05:54:54,092 INFO WebServerImpl:org.mortbay.log: jetty-6.1.26.cloudera.4
2015-03-20 05:54:54,140 INFO WebServerImpl:org.mortbay.log: Started SelectChannelConnector@0.0.0.0:7180
2015-03-20 05:54:54,140 INFO WebServerImpl:com.cloudera.server.cmf.WebServerImpl: Started Jetty server.
2015-03-20 05:54:54,844 INFO SearchRepositoryManager-0:com.cloudera.server.web.cmf.search.components.SearchRepositoryManager: Finished constructing repo:2015-03-20T03:54:54.844Z


The default username and password are admin and admin.


As before, it is an extremely straight-forward process. The only point where you might be uncertain in regards as to what you should choose is when Cloudera asks if it should install using traditional installation methods such as .rpm or .deb packages or Cloudera's parcels method.


According to Cloudera, among other benefits, parcels provide a mechanism for upgrading the packages installed on a cluster from within the Cloudera Manager Admin Console with minimal disruption. So let's proceed using parcels.

Once it starts the cluster installation, it needs a fair bit of time to complete, so sit back and relax. Note that if the installation on a particular node appears to get stuck at "Acquiring Installation lock", just log on there, remove the lock:

[root@hadoop1 ~]# rm -f /tmp/.scm_prepare_node.lock

and abort and retry.


Once it's finished it will give you a warning that Cloudera recommend turning down swappiness to 0, but other than that, should be fine.



After that, you need to pick which Hadoop services should run on which server, create your databases (at which point you should also note the usernames, passwords and database names for future reference), and review base directory locations. We're going to do a pretty basic vanilla installation here so we choose custom and:



 
After that, we'll need to wait for a tad for the manager to start all the services and after that we'll be good to go.

And that's what you get for installing a Hadoop cluster on tiny VMs!
All right, time to do something with our new cluster. Let's go to our hue server.

To do that, go to your cloudera manager UI, select "Hue" and click on "Hue Web UI".


Just select your username and password that you will use for hue. As soon as you're in, it will do a few automatic checks and ask you if you need to create new users.

Which means that we have everything up and running and we can actually use our Hadoop cluster using a Web browser instead of going through everything manually!

References: http://www.cloudera.com/content/cloudera/en/documentation/cloudera-manager/v4-6-2/Cloudera-Manager-Managing-Clusters/cmmc_parcel_upgrade.html


No comments:

Post a Comment