The Sysadminosaurus' IT blog: Introduction to Parallel Computing Part 1d - Creating a Hadoop cluster (the easy way -- Hortonworks)

As in our previous "Creating a Hadoop Cluster" post, we'll need to use one of the following, 64-bit operating systems:

Red Hat Enterprise Linux (RHEL) v6.x
Red Hat Enterprise Linux (RHEL) v5.x (deprecated)
CentOS v6.x
CentOS v5.x (deprecated)
Oracle Linux v6.x
Oracle Linux v5.x (deprecated)
SUSE Linux Enterprise Server (SLES) v11, SP1 and SP3
Ubuntu Precise v12.04

I'm going to use RHEL 6.6 for this.

Let's press on with the environment set up. For this, I will have one namenode (hadoop1), two secondary servers (hadoop2 and hadoop3) and 5 datanodes (hadoop4, hadoop5, hadoop6, hadoop7 and hadoop8).

Hadoop Cluster
Node Type and Number	Node Name	IP
Namenode	hadoop1	192.168.0.101
Secondary Namenode	hadoop2	192.168.0.102
Tertiary Services	hadoop3	192.168.0.103
Datanode #1	hadoop4	192.168.0.104
Datanode #2	hadoop5	192.168.0.105
Datanode #3	hadoop6	192.168.0.106
Datanode #4	hadoop7	192.168.0.107
Datanode #5	hadoop8	192.168.0.108

Unfortunately, the menial tasks that involve system configuration cannot be avoided, so let's press on:

First of all, let's update everything (This should be issued on every server in our cluster):

[root@hadoop1 ~]# yum -y update
[root@hadoop1 ~]# yum -y install wget

We'll need this (This should be issued on every server in our cluster.):

[root@hadoop1 ~]# yum -y install openssh-clients.x86_64

Note: It's always best to actually have your node FQDNs on your DNS server and skip the next two steps (editing the /etc/hosts and the /etc/host.conf files).

Now, let's edit our /etc/hosts to reflect our cluster (This should be issued on every server in our cluster):

[root@hadoop1 ~]# vi /etc/hosts192.168.0.101   hadoop1
192.168.0.102   hadoop2
192.168.0.103   hadoop3
192.168.0.104   hadoop4
192.168.0.105   hadoop5
192.168.0.106   hadoop6
192.168.0.107   hadoop7
192.168.0.108   hadoop8

We should also check our /etc/host.conf and our /etc/nsswitch.conf, unless we want to have resolvable hostnames:

[hadoop@hadoop1 ~]$ vi /etc/host.confmulti on
order hosts bind[hadoop@hadoop1 ~]$ vi /etc/nsswitch.conf....
#hosts:     db files nisplus nis dns
hosts:      files dns
....

We'll need a large number of file descriptors (This should be issued on every server in our cluster):

[root@hadoop1 ~]# vi /etc/security/limits.conf....
* soft nofile 65536
* hard nofile 65536
....

We should make sure that our network interface comes up automatically:

[root@hadoop1 ~]# vi /etc/sysconfig/network-scripts/ifcfg-eth0....
ONBOOT="yes"
....

And of course make sure our other networking functions, such as our hostname are correct:

[root@hadoop1 ~]# vi /etc/sysconfig/networkNETWORKING=yes
HOSTNAME=hadoop1
GATEWAY=192.168.0.1

We'll need to log in using SSH as root, so for the time being let's allow root logins. We might want to turn that off after we're done, as this is as insecure as they come:

[root@hadoop1 ~]# vi /etc/ssh/sshd_config....
PermitRootLogin yes
....
[root@hadoop1 ~]# service sshd restart

NTP should be installed on every server in our cluster. Now that we've edited our hosts file things are much easier though:

[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" "exec yum -y install ntp ntpdate"; done 
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" chkconfig ntpd on; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" ntpdate pool.ntp.org; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" service ntpd start; done

Set up passwordless ssh authentication (note that this will be configured automatically during the actual installation so this in not necessary; it is useful though, since it saves us from a lot of typing):

[root@hadoop1 ~]# ssh-keygen -t rsa 
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}' | grep -v hadoop1); do ssh "$host" mkdir -p .ssh; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh-copy-id -i ~/.ssh/id_rsa.pub "$host"; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" chmod 700 .ssh; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" chmod 640 .ssh/authorized_keys; done

Time to tone down our security a bit so that our cluster runs without problems. My PC's IP is 192.168.0.55 so I will allow that as well:

[root@hadoop1 ~]# iptables -F
[root@hadoop1 ~]# iptables -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
[root@hadoop1 ~]# iptables -A INPUT -i lo -j ACCEPT
[root@hadoop1 ~]# iptables -A INPUT -s 192.168.0.101,192.168.0.102,192.168.0.103,192.168.0.104,192.168.0.105,192.168.0.106,192.168.0.107,192.168.0.108 -j ACCEPT
[root@hadoop1 ~]# iptables -A INPUT -s 192.168.0.55 -j ACCEPT
[root@hadoop1 ~]# iptables -A INPUT -j DROP
[root@hadoop1 ~]# iptables -A FORWARD -j DROP
[root@hadoop1 ~]# iptables -A OUTPUT -j ACCEPT
[root@hadoop1 ~]# iptables-save > /etc/sysconfig/iptables
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}' | grep -v hadoop1); do scp /etc/sysconfig/iptables "$host":/etc/sysconfig/iptables; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" "iptables-restore < /etc/sysconfig/iptables"; done

Let's disable SELinux:

[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" setenforce 0; done
[root@hadoop1 ~]# vi /etc/sysconfig/selinux# This file controls the state of SELinux on the system.
# SELINUX= can take one of these three values:
#     enforcing - SELinux security policy is enforced.
#     permissive - SELinux prints warnings instead of enforcing.
#     disabled - No SELinux policy is loaded.
SELINUX=disabled
# SELINUXTYPE= can take one of these two values:
#     targeted - Targeted processes are protected,
#     mls - Multi Level Security protection.
SELINUXTYPE=targeted[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}' | grep -v hadoop1); do scp /etc/sysconfig/selinux "$host":/etc/sysconfig/selinux; done

Turn down swappiness:

[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" "echo vm.swappiness = 1 >> /etc/sysctl.conf"; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" sysctl -p; done

We've made quite a few changes, including kernel updates. Let's reboot and pick this up later.

[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}' | grep -v hadoop1); do ssh "$host" reboot; done      
[root@hadoop1 ~]# reboot

Now, let's start installing Hadoop by downloading the Ambari repo and installing ambari-server.

[root@hadoop1 ~]# wget -nv http://public-repo-1.hortonworks.com/ambari/centos6/1.x/updates/1.7.0/ambari.repo -O /etc/yum.repos.d/ambari.repo
[root@hadoop1 ~]# yum -y install ambari-server 
[root@hadoop1 ~]# ambari-server setup

At which time, we'll be prompted for the following:

If you have not temporarily disabled SELinux, you may get a warning. Accept the default (y), and continue.
By default, Ambari Server runs under root. Accept the default (n) at the Customize user account for ambari-server daemon prompt, to proceed as root.
If you want to create a different user to run the Ambari Server, or to assign a previously created user, select y at the Customize user account for ambari-server daemon prompt, then provide a user name.
If you have not temporarily disabled iptables you may get a warning. Enter y to continue.
Select a JDK version to download. Enter 1 to download Oracle JDK 1.7.
Accept the Oracle JDK license when prompted. You must accept this license to download the necessary JDK from Oracle. The JDK is installed during the deploy phase.
Select n at Enter advanced database configuration to use the default, embedded PostgreSQL database for Ambari. The default PostgreSQL database name is ambari. The default user name and password are ambari/bigdata.
Otherwise, to use an existing PostgreSQL, MySQL or Oracle database with Ambari, select y
1. To use an existing Oracle 11g r2 instance, and select your own database name, user name, and password for that database, enter 2. Select the database you want to use and provide any information requested at the prompts, including host name, port, Service Name or SID, user name, and password.
2. To use an existing MySQL 5.x database, and select your own database name, user name, and password for that database, enter 3. Select the database you want to use and provide any information requested at the prompts, including host name, port, database name, user name, and password.
3. To use an existing PostgreSQL 9.x database, and select your own database name, user name, and password for that database, enter 4. Select the database you want to use and provide any information requested at the prompts, including host name, port, database name, user name, and password.
At Proceed with configuring remote database connection properties [y/n] choose y
Setup completes.

Now, we just have to start the server:

[root@hadoop1 ~]# ambari-server start

And we can navigate using our web browser to our namenode IP:8080 and continue any configuration and installation steps from there.

The default credentials are
username: admin
password: admin

Then, we just need to select "Create a Cluster" and HDP takes us by the hand and pretty much does everything needed. And when I say everything, I mean it.

The only issue that you may encounter if you followed this guide is that Ambari will detect that iptables is running. We've made sure it allows everything we need so we can safely ignore this warning.