Before we start deploying, we need to plan a few things ahead. For instance, is your system supported by Hadoop?
In theory, pretty much any system that can run Java can run Hadoop. In practice, it's best to use one of the following, 64-bit operating systems:
If you intend to deploy on one of the distros, then great. I'm going to use RHEL 6.6.
Let's press on with the environment set up. For this, I will have just one namenode (hadoop1) and 5 datanodes (hadoop4, hadoop5, hadoop6, hadoop7 and hadoop8).
First of all, let's update everything (This should be issued on every server in our cluster):
We'll need these (This should be issued on every server in our cluster.):
Note: It's always best to actually have your node FQDNs on your DNS server and skip the next two steps (editing the /etc/hosts and the /etc/host.conf files).
Now, let's edit our /etc/hosts to reflect our cluster (This should be issued on every server in our cluster):
We should also check our /etc/host.conf and our /etc/nsswitch.conf, unless we want to have resolvable hostnames:
We'll need a large number of file descriptors (This should be issued on every server in our cluster):
We should make sure that our network interface comes up automatically:
And of course make sure our other networking functions, such as our hostname are correct:
We'll need to log in using SSH as root, so for the time being let's allow root logins. We might want to turn that off after we're done, as this is as insecure as they come:
NTP should be installed on every server in our cluster. Now that we've edited our hosts file things are much easier though:
Let's set up our hadoop user:
Set up passwordless ssh authentication:
Time to tone down our security a bit so that our cluster runs without problems. My PC's IP is 192.168.0.55 so I will allow that as well:
Let's disable SELinux:
Turn down swappiness:
Hadoop needs the Java SDK installed. Although it will probably work if you just pick the latest one, you'd better make sure which one fits your version better. Οpenjdk 1.7.0 was recommended for my version (2.6.0).
Download the latest Java SDK:
a) Go to http://www.oracle.com/technetwork/java/javase/downloads/index.html
b) Select the proper Java SDK version (make sure it's the x64 version)
c) Accept the licence
d) Select the the .tar.gz archive that is the proper architecture for your system
e) Copy its link location
f) And:
I'm going to install Java SDK in /opt/jdk:
Let's prepare our system environment variables. Here I'm going to install Hadoop in /opt/hadoop.
Time to download and extract the latest Hadoop:
Go to:
http://hadoop.apache.org/releases.html
select "Download a release now", select a mirror and copy its link location
Let's edit our config files:
core-site.xml
Notice that I have put my intended namenode as a fs.default.name value.
hdfs-site.xml
The replication factor is a property that can be set in the HDFS configuration file that will allow you to
adjust the global replication factor for the entire cluster.
For each block stored in HDFS, there will be n–1 duplicated blocks distributed across the cluster.
For example, if the replication factor was set to 3 (default value in HDFS) there would be one original block
and two replicas.
mapred-site.xml
This is the host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.
yarn-site.xml
hadoop-env.sh
Let's create the namenode and datanode directories that we set in hdfs-site.xml and do the rest of the menial configuration tasks:
Okay, let's start it all up and see if it works:
This will format our HDFS filesystem. And:
Looks like we're up and running, let's make sure:
Yup, how about on the other nodes?
Great. Now you can navigate to your namenode IP:50070 with your web browser and you should have an overview of your cluster.
To finish this off, let's install Pig and Hive. WebHCat and HCatalog are installed with Hive, starting with Hive release 0.11.0.
Go to http://pig.apache.org/releases.html, choose a version, choose a mirror and download:
I've already taken care of the environment variables earlier (you need to add /opt/pig/bin to the PATH), so all I need to do now is:
Great. Now go to https://hive.apache.org/downloads.html, choose download, choose mirror:
I've already taken care of the environment variables earlier (you need to add /opt/hive/bin to the PATH and export HIVE_HOME=/opt/hive), so all I need to do now is:
Yup, looks like it all works.
In theory, pretty much any system that can run Java can run Hadoop. In practice, it's best to use one of the following, 64-bit operating systems:
- Red Hat Enterprise Linux (RHEL) v6.x
- Red Hat Enterprise Linux (RHEL) v5.x (deprecated)
- CentOS v6.x
- CentOS v5.x (deprecated)
- Oracle Linux v6.x
- Oracle Linux v5.x (deprecated)
- SUSE Linux Enterprise Server (SLES) v11, SP1 and SP3
- Ubuntu Precise v12.04
If you intend to deploy on one of the distros, then great. I'm going to use RHEL 6.6.
Let's press on with the environment set up. For this, I will have just one namenode (hadoop1) and 5 datanodes (hadoop4, hadoop5, hadoop6, hadoop7 and hadoop8).
Node Type and Number | Node Name | IP |
---|---|---|
Namenode #1 | hadoop1 | 192.168.0.101 |
Datanode #1 | hadoop4 | 192.168.0.104 |
Datanode #2 | hadoop5 | 192.168.0.105 |
Datanode #3 | hadoop6 | 192.168.0.106 |
Datanode #4 | hadoop7 | 192.168.0.107 |
Datanode #5 | hadoop8 | 192.168.0.108 |
First of all, let's update everything (This should be issued on every server in our cluster):
[root@hadoop1 ~]# yum -y update
We'll need these (This should be issued on every server in our cluster.):
[root@hadoop1 ~]# yum -y install openssh-clients.x86_64
[root@hadoop1 ~]# yum -y install wget
Note: It's always best to actually have your node FQDNs on your DNS server and skip the next two steps (editing the /etc/hosts and the /etc/host.conf files).
Now, let's edit our /etc/hosts to reflect our cluster (This should be issued on every server in our cluster):
[root@hadoop1 ~]# vi /etc/hosts192.168.0.101 hadoop1
192.168.0.104 hadoop4
192.168.0.105 hadoop5
192.168.0.106 hadoop6
192.168.0.107 hadoop7
192.168.0.108 hadoop8
We should also check our /etc/host.conf and our /etc/nsswitch.conf, unless we want to have resolvable hostnames:
[hadoop@hadoop1 ~]$ vi /etc/host.confmulti on
order hosts bind[hadoop@hadoop1 ~]$ vi /etc/nsswitch.conf....
#hosts: db files nisplus nis dns
hosts: files dns
....
We'll need a large number of file descriptors (This should be issued on every server in our cluster):
[root@hadoop1 ~]# vi /etc/security/limits.conf....
* soft nofile 65536
* hard nofile 65536
....
We should make sure that our network interface comes up automatically:
[root@hadoop1 ~]# vi /etc/sysconfig/network-scripts/ifcfg-eth0....
ONBOOT="yes"
....
And of course make sure our other networking functions, such as our hostname are correct:
[root@hadoop1 ~]# vi /etc/sysconfig/networkNETWORKING=yes
HOSTNAME=hadoop1
GATEWAY=192.168.0.1
We'll need to log in using SSH as root, so for the time being let's allow root logins. We might want to turn that off after we're done, as this is as insecure as they come:
[root@hadoop1 ~]# vi /etc/ssh/sshd_config....
PermitRootLogin yes
....
[root@hadoop1 ~]# service sshd restart
NTP should be installed on every server in our cluster. Now that we've edited our hosts file things are much easier though:
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" "exec yum -y install ntp ntpdate"; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" chkconfig ntpd on; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" ntpdate pool.ntp.org; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" service ntpd start; done
Let's set up our hadoop user:
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" groupadd hadoop; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" useradd hadoop -g hadoop -m -s /bin/bash; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" passwd hadoop; done
Set up passwordless ssh authentication:
[root@hadoop1 ~]# su - hadoop
[hadoop@hadoop1 ~]$
ssh-keygen -t rsa
[hadoop@hadoop1 ~]$
for host in $(grep hadoop /etc/hosts | awk '{print $2}' | grep -v hadoop1); do ssh "$host" mkdir -p .ssh; done
[hadoop@hadoop1 ~]$
for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh-copy-id -i ~/.ssh/id_rsa.pub "$host"; done
[hadoop@hadoop1 ~]$
for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" chmod 700 .ssh; done
[hadoop@hadoop1 ~]$
for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" chmod 640 .ssh/authorized_keys; done
Time to tone down our security a bit so that our cluster runs without problems. My PC's IP is 192.168.0.55 so I will allow that as well:
[root@hadoop1 ~]# iptables -F
[root@hadoop1 ~]# iptables -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
[root@hadoop1 ~]# iptables -A INPUT -i lo -j ACCEPT
[root@hadoop1 ~]# iptables -A INPUT -s 192.168.0.101,192.168.0.104,192.168.0.105,192.168.0.106,192.168.0.107,192.168.0.108 -j ACCEPT
[root@hadoop1 ~]# iptables -A INPUT -s 192.168.0.55 -j ACCEPT
[root@hadoop1 ~]# iptables -A INPUT -j DROP
[root@hadoop1 ~]# iptables -A FORWARD -j DROP
[root@hadoop1 ~]# iptables -A OUTPUT -j ACCEPT
[root@hadoop1 ~]# iptables-save > /etc/sysconfig/iptables
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}' | grep -v hadoop1); do scp /etc/sysconfig/iptables "$host":/etc/sysconfig/iptables; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" "iptables-restore < /etc/sysconfig/iptables"; done
Let's disable SELinux:
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" setenforce 0; done
[root@hadoop1 ~]# vi /etc/sysconfig/selinux# This file controls the state of SELinux on the system.
# SELINUX= can take one of these three values:
# enforcing - SELinux security policy is enforced.
# permissive - SELinux prints warnings instead of enforcing.
# disabled - No SELinux policy is loaded.
SELINUX=disabled
# SELINUXTYPE= can take one of these two values:
# targeted - Targeted processes are protected,
# mls - Multi Level Security protection.
SELINUXTYPE=targeted[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}' | grep -v hadoop1); do scp /etc/sysconfig/selinux "$host":/etc/sysconfig/selinux; done
Turn down swappiness:
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" "echo vm.swappiness = 1 >> /etc/sysctl.conf"; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" sysctl -p; done
Hadoop needs the Java SDK installed. Although it will probably work if you just pick the latest one, you'd better make sure which one fits your version better. Οpenjdk 1.7.0 was recommended for my version (2.6.0).
Download the latest Java SDK:
a) Go to http://www.oracle.com/technetwork/java/javase/downloads/index.html
b) Select the proper Java SDK version (make sure it's the x64 version)
c) Accept the licence
d) Select the the .tar.gz archive that is the proper architecture for your system
e) Copy its link location
f) And:
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" 'wget --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/7u75-b13/jdk-7u75-linux-x64.tar.gz'; done
I'm going to install Java SDK in /opt/jdk:
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" mkdir /opt/jdk; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" 'tar -zxf jdk-7u75-linux-x64.tar.gz -C /opt/jdk'; done
[root@hadoop1 ~]# /opt/jdk/jdk1.7.0_75/bin/java -versionjava version "1.7.0_75"
Java(TM) SE Runtime Environment (build 1.7.0_75-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.75-b04, mixed mode)
Let's prepare our system environment variables. Here I'm going to install Hadoop in /opt/hadoop.
[root@hadoop1 ~]# su - hadoop
[hadoop@hadoop1 ~]$ vi ~/.bashrc....
export HADOOP_HOME=/opt/hadoop
export HADOOP_PREFIX=$HADOOP_HOME
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
export HADOOP_COMMON_LIB_NATIVE_DIR=/opt/hadoop/lib/native
export YARN_HOME=$HADOOP_HOME
export JAVA_HOME=/opt/jdk/jdk1.7.0_75
export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native:$JAVA_LIBRARY_PATH
export HIVE_HOME=/opt/hive
export PATH=$PATH:/opt/jdk/jdk1.7.0_75/bin:/opt/hadoop/bin:/opt/hadoop/sbin:/opt/pig/bin:/opt/hive/bin:/opt/hive/hcatalog/bin:/opt/hive/hcatalog/sbin[hadoop@hadoop1 ~]$ for host in $(grep hadoop /etc/hosts | awk '{print $2}' | grep -v hadoop1); do scp ~/.bashrc "$host":~/.bashrc; done
[hadoop@hadoop1 ~]$ source ~/.bashrc
Time to download and extract the latest Hadoop:
Go to:
http://hadoop.apache.org/releases.html
select "Download a release now", select a mirror and copy its link location
[hadoop@hadoop1 ~]$ for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh hadoop@"$host" 'wget http://www.eu.apache.org/dist/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz'; done
[hadoop@hadoop1 ~]$ for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh root@"$host" mkdir /opt/hadoop; done
[hadoop@hadoop1 ~]$ for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh root@"$host" 'tar -zxf /home/hadoop/hadoop-2.6.0.tar.gz -C /opt/hadoop/ --strip-components=1'; done
[hadoop@hadoop1 ~]$ exit
Let's edit our config files:
core-site.xml
[root@hadoop1 ~]# vi /opt/hadoop/etc/hadoop/core-site.xml<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop1:9000/</value>
<description>Hadoop Filesystem</description>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>
Notice that I have put my intended namenode as a fs.default.name value.
hdfs-site.xml
[root@hadoop1 ~]# vi /opt/hadoop/etc/hadoop/hdfs-site.xml<configuration>
<property>
<name>dfs.datanode.dir</name>
<value>/opt/hadoop/hdfs/datanode</value>
<description>DataNode directory for storing data chunks.</description>
<final>true</final>
</property>
<property>
<name>dfs.namenode.dir</name>
<value>/opt/hadoop/hdfs/namenode</value>
<description>NameNode directory for namespace and transaction logs storage.</description>
<final>true</final>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
<description>Level of replication for each chunk.</description>
</property>
</configuration>
The replication factor is a property that can be set in the HDFS configuration file that will allow you to
adjust the global replication factor for the entire cluster.
For each block stored in HDFS, there will be n–1 duplicated blocks distributed across the cluster.
For example, if the replication factor was set to 3 (default value in HDFS) there would be one original block
and two replicas.
mapred-site.xml
[root@hadoop1 ~]# cp /opt/hadoop/etc/hadoop/mapred-site.xml.template /opt/hadoop/etc/hadoop/mapred-site.xml
[root@hadoop1 ~]# vi /opt/hadoop/etc/hadoop/mapred-site.xml<configuration>
<property>
<name>mapreduce.jobtracker.address</name>
<value>hadoop1:9001</value>
</property>
</configuration>
This is the host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.
yarn-site.xml
[root@hadoop1 ~]# vi /opt/hadoop/etc/hadoop/yarn-site.xml<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop1</value>
<description>The hostname of the ResourceManager</description>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
<description>shuffle service for MapReduce</description>
</property>
</configuration>
hadoop-env.sh
[root@hadoop1 ~]# vi /opt/hadoop/etc/hadoop/hadoop-env.sh....
#export JAVA_HOME=${JAVA_HOME}
export JAVA_HOME=/opt/jdk/jdk1.7.0_75
....
#export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"}
export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
....
Let's create the namenode and datanode directories that we set in hdfs-site.xml and do the rest of the menial configuration tasks:
[root@hadoop1 ~]# mkdir -p /opt/hadoop/hdfs/namenode
[root@hadoop1 ~]# mkdir -p /opt/hadoop/hdfs/datanode
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}' | grep -v hadoop1); do scp -r /opt/hadoop/etc/* "$host":"/opt/hadoop/etc/."; done
[root@hadoop1 ~]# echo "hadoop1" > /opt/hadoop/etc/hadoop/masters
[root@hadoop1 ~]# echo $'hadoop4\nhadoop5\nhadoop6\nhadoop7\nhadoop8' > /opt/hadoop/etc/hadoop/slaves
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh root@"$host" chown -R hadoop:hadoop /opt/hadoop/; done
Okay, let's start it all up and see if it works:
[root@hadoop1 ~]# su - hadoop
[hadoop@hadoop1 ~]$ hdfs namenode -format
This will format our HDFS filesystem. And:
[hadoop@hadoop1 ~]$ start-all.shThis script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
15/03/10 17:43:47 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [hadoop1]
hadoop1: starting namenode, logging to /opt/hadoop/logs/hadoop-hadoop-namenode-hadoop1.out
hadoop4: starting datanode, logging to /opt/hadoop/logs/hadoop-hadoop-datanode-hadoop4.out
hadoop7: starting datanode, logging to /opt/hadoop/logs/hadoop-hadoop-datanode-hadoop7.out
hadoop5: starting datanode, logging to /opt/hadoop/logs/hadoop-hadoop-datanode-hadoop5.out
hadoop6: starting datanode, logging to /opt/hadoop/logs/hadoop-hadoop-datanode-hadoop6.out
hadoop8: starting datanode, logging to /opt/hadoop/logs/hadoop-hadoop-datanode-hadoop8.out
Looks like we're up and running, let's make sure:
[hadoop@hadoop1 ~]$ jps13247 ResourceManager
12922 NameNode
13620 Jps
13113 SecondaryNameNode
Yup, how about on the other nodes?
[hadoop@hadoop1 ~]$ for host in $(grep hadoop /etc/hosts | awk '{print $2}' | grep -v hadoop1); do ssh hadoop@"$host" jps; done1897 DataNode
1990 NodeManager
2159 Jps
2164 Jps
1902 DataNode
1995 NodeManager
2146 Jps
1884 DataNode
1977 NodeManager
2146 Jps
1884 DataNode
1977 NodeManager
12094 DataNode
12795 Jps
12217 NodeManager[hadoop@hadoop1 ~]$ exit
Great. Now you can navigate to your namenode IP:50070 with your web browser and you should have an overview of your cluster.
To finish this off, let's install Pig and Hive. WebHCat and HCatalog are installed with Hive, starting with Hive release 0.11.0.
Go to http://pig.apache.org/releases.html, choose a version, choose a mirror and download:
[root@hadoop1 ~]# wget http://www.eu.apache.org/dist/pig/latest/pig-0.14.0.tar.gz
[root@hadoop1 ~]# mkdir /opt/pig
[root@hadoop1 ~]# tar -zxf pig-0.14.0.tar.gz -C /opt/pig --strip-components=1
[root@hadoop1 ~]# chown -R hadoop:hadoop /opt/pig
I've already taken care of the environment variables earlier (you need to add /opt/pig/bin to the PATH), so all I need to do now is:
[root@hadoop1 ~]# su - hadoop
[hadoop@hadoop1 ~]$ pig -x mapreducegrunt> QUIT;[hadoop@hadoop1 ~]$ exit
Great. Now go to https://hive.apache.org/downloads.html, choose download, choose mirror:
[root@hadoop1 ~]# wget http://www.eu.apache.org/dist/hive/hive-1.1.0/apache-hive-1.1.0-bin.tar.gz
[root@hadoop1 ~]# mkdir /opt/hive
[root@hadoop1 ~]# tar -zxf apache-hive-1.1.0-bin.tar.gz -C /opt/hive --strip-components=1
[root@hadoop1 ~]# chown -R hadoop:hadoop /opt/hive
I've already taken care of the environment variables earlier (you need to add /opt/hive/bin to the PATH and export HIVE_HOME=/opt/hive), so all I need to do now is:
[root@hadoop1 ~]# su - hadoop
[hadoop@hadoop1 ~]$ hadoop fs -mkdir /tmp
[hadoop@hadoop1 ~]$ hadoop fs -mkdir -p /user/hive/warehouse
[hadoop@hadoop1 ~]$ hadoop fs -chmod g+w /tmp
[hadoop@hadoop1 ~]$ hadoop fs -chmod g+w /user/hive/warehouse
[hadoop@hadoop1 ~]$ hcatusage: hcat { -e "" | -f "" } [ -g "" ] [ -p "" ] [ -D"=" ]
-D use hadoop value for given property
-e hcat command given from command line
-f hcat commands in file
-g group for the db/table specified in CREATE statement
-h,--help Print help information
-p permissions for the db/table specified in CREATE statementgrunt> QUIT;
Yup, looks like it all works.
No comments:
Post a Comment