Saturday, March 21, 2015

Introduction to Parallel Computing Part 1f - Creating a Hadoop cluster (the easy way -- Cloudera)

Supported Operating Systems

Cloudera Manager supports the following operating systems:
  • RHEL-compatible
    • Red Hat Enterprise Linux and CentOS
      • 5.7, 64-bit
      • 6.4, 64-bit
      • 6.4 in SE Linux mode
      • 6.5, 64-bit
    • Oracle Enterprise Linux with default kernel and Unbreakable Enterprise Kernel, 64-bit
      • 5.6 (UEK R2)
      • 6.4 (UEK R2)
      • 6.5 (UEK R2, UEK R3)
  • SLES - SUSE Linux Enterprise Server 11, 64-bit. Service Pack 2 or later is required.
  • Debian - Wheezy (7.0 and 7.1), Squeeze (6.0) (deprecated), 64-bit
  • Ubuntu - Trusty (14.04), Precise (12.04), Lucid (10.04) (deprecated), 64-bit
I'm going to use RHEL 6.6 for this.

Unfortunately, the menial tasks that involve system configuration cannot be avoided, so let's press on:
First of all, let's update everything (This should be issued on every server in our cluster):

[root@hadoop1 ~]# yum -y update
[root@hadoop1 ~]# yum -y install wget

We'll need this (This should be issued on every server in our cluster.):

[root@hadoop1 ~]# yum -y install openssh-clients.x86_64

Note: It's always best to actually have your node FQDNs on your DNS server and skip the next two steps (editing the /etc/hosts and the /etc/host.conf files). 

Now, let's edit our /etc/hosts to reflect our cluster (This should be issued on every server in our cluster):

[root@hadoop1 ~]# vi /etc/hosts
192.168.0.101   hadoop1
192.168.0.102   hadoop2
192.168.0.103   hadoop3
192.168.0.104   hadoop4
192.168.0.105   hadoop5
192.168.0.106   hadoop6
192.168.0.107   hadoop7
192.168.0.108   hadoop8

We should also check our /etc/host.conf and our /etc/nsswitch.conf, unless we want to have resolvable hostnames:

[hadoop@hadoop1 ~]$ vi /etc/host.conf
multi on
order hosts bind
[hadoop@hadoop1 ~]$ vi /etc/nsswitch.conf
....
#hosts:     db files nisplus nis dns
hosts:      files dns
....

We'll need a large number of file descriptors (This should be issued on every server in our cluster):

[root@hadoop1 ~]# vi /etc/security/limits.conf
....
* soft nofile 65536
* hard nofile 65536
....

We should make sure that our network interface comes up automatically:

[root@hadoop1 ~]# vi /etc/sysconfig/network-scripts/ifcfg-eth0
....
ONBOOT="yes"
....

And of course make sure our other networking functions, such as our hostname are correct:

[root@hadoop1 ~]# vi /etc/sysconfig/network
NETWORKING=yes
HOSTNAME=hadoop1
GATEWAY=192.168.0.1

We'll need to log in using SSH as root, so for the time being let's allow root logins. We might want to turn that off after we're done, as this is as insecure as they come:

[root@hadoop1 ~]# vi /etc/ssh/sshd_config
....
PermitRootLogin yes
....
[root@hadoop1 ~]# service sshd restart

NTP should be installed on every server in our cluster. Now that we've edited our hosts file things are much easier though:

[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" "exec yum -y install ntp ntpdate"; done 
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" chkconfig ntpd on; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" ntpdate pool.ntp.org; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" service ntpd start; done

Set up passwordless ssh authentication (note that this will be configured automatically during the actual installation so this in not necessary; it is useful though, since it saves us from a lot of typing):

[root@hadoop1 ~]# ssh-keygen -t rsa 
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}' | grep -v hadoop1); do ssh "$host" mkdir -p .ssh; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh-copy-id -i ~/.ssh/id_rsa.pub "$host"; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" chmod 700 .ssh; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" chmod 640 .ssh/authorized_keys; done

Time to tone down our security a bit so that our cluster runs without problems. My PC's IP is 192.168.0.55 so I will allow that as well:

[root@hadoop1 ~]# iptables -F
[root@hadoop1 ~]# iptables -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
[root@hadoop1 ~]# iptables -A INPUT -i lo -j ACCEPT
[root@hadoop1 ~]# iptables -A INPUT -s 192.168.0.101,192.168.0.102,192.168.0.103,192.168.0.104,192.168.0.105,192.168.0.106,192.168.0.107,192.168.0.108 -j ACCEPT
[root@hadoop1 ~]# iptables -A INPUT -s 192.168.0.55 -j ACCEPT
[root@hadoop1 ~]# iptables -A INPUT -j DROP
[root@hadoop1 ~]# iptables -A FORWARD -j DROP
[root@hadoop1 ~]# iptables -A OUTPUT -j ACCEPT
[root@hadoop1 ~]# iptables-save > /etc/sysconfig/iptables
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}' | grep -v hadoop1); do scp /etc/sysconfig/iptables "$host":/etc/sysconfig/iptables; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" "iptables-restore < /etc/sysconfig/iptables"; done

Let's disable SELinux:

[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" setenforce 0; done
[root@hadoop1 ~]# vi /etc/sysconfig/selinux
# This file controls the state of SELinux on the system.
# SELINUX= can take one of these three values:
#     enforcing - SELinux security policy is enforced.
#     permissive - SELinux prints warnings instead of enforcing.
#     disabled - No SELinux policy is loaded.
SELINUX=disabled
# SELINUXTYPE= can take one of these two values:
#     targeted - Targeted processes are protected,
#     mls - Multi Level Security protection.
SELINUXTYPE=targeted
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}' | grep -v hadoop1); do scp /etc/sysconfig/selinux "$host":/etc/sysconfig/selinux; done

Turn down swappiness. Cloudera actually recommend turning down swappiness to 0, I prefer 1:

[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" "echo vm.swappiness = 1 >> /etc/sysctl.conf"; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" sysctl -p; done

We've made quite a few changes, including kernel updates. Let's reboot and pick this up later.

[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}' | grep -v hadoop1); do ssh "$host" reboot; done      
[root@hadoop1 ~]# reboot

Now, let's start installing Hadoop by downloading and running the Cloudera manager and installation script.

[root@hadoop1 ~]# wget http://archive.cloudera.com/cm5/installer/latest/cloudera-manager-installer.bin     
[root@hadoop1 ~]# chmod u+x cloudera-manager-installer.bin
[root@hadoop1 ~]# ./cloudera-manager-installer.bin

The installation process is very straight-forward to say the least. You only have to read a few licence agreements and select a few "Next" options.



After a while, you'll need to point your web browser to the system whose IP you installed cloudera manager on, port 7180. In my case therefore it's 192.168.0.101:7180.

Just in case, take a look at the logs before actually logging in. If there doesn't seem to be a cloudera manager service available listening at that address and port, wait for a bit until you see the relevant message:

[root@hadoop1 ~]# netstat -ntpl
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address               Foreign Address             State       PID/Program name   
tcp        0      0 0.0.0.0:7432                0.0.0.0:*                   LISTEN      1807/postgres       
tcp        0      0 0.0.0.0:7182                0.0.0.0:*                   LISTEN      2419/java           
tcp        0      0 0.0.0.0:22                  0.0.0.0:*                   LISTEN      920/sshd            
tcp        0      0 127.0.0.1:25                0.0.0.0:*                   LISTEN      1045/master         
tcp        0      0 :::7432                     :::*                        LISTEN      1807/postgres       
tcp        0      0 :::22                       :::*                        LISTEN      920/sshd            
tcp        0      0 ::1:25                      :::*                        LISTEN      1045/master         
[root@hadoop1 ~]# tail -f /var/log/cloudera-scm-server/cloudera-scm-server.log 
....
2015-03-20 05:54:54,092 INFO WebServerImpl:org.mortbay.log: jetty-6.1.26.cloudera.4
2015-03-20 05:54:54,140 INFO WebServerImpl:org.mortbay.log: Started SelectChannelConnector@0.0.0.0:7180
2015-03-20 05:54:54,140 INFO WebServerImpl:com.cloudera.server.cmf.WebServerImpl: Started Jetty server.
2015-03-20 05:54:54,844 INFO SearchRepositoryManager-0:com.cloudera.server.web.cmf.search.components.SearchRepositoryManager: Finished constructing repo:2015-03-20T03:54:54.844Z


The default username and password are admin and admin.


As before, it is an extremely straight-forward process. The only point where you might be uncertain in regards as to what you should choose is when Cloudera asks if it should install using traditional installation methods such as .rpm or .deb packages or Cloudera's parcels method.


According to Cloudera, among other benefits, parcels provide a mechanism for upgrading the packages installed on a cluster from within the Cloudera Manager Admin Console with minimal disruption. So let's proceed using parcels.

Once it starts the cluster installation, it needs a fair bit of time to complete, so sit back and relax. Note that if the installation on a particular node appears to get stuck at "Acquiring Installation lock", just log on there, remove the lock:

[root@hadoop1 ~]# rm -f /tmp/.scm_prepare_node.lock

and abort and retry.


Once it's finished it will give you a warning that Cloudera recommend turning down swappiness to 0, but other than that, should be fine.



After that, you need to pick which Hadoop services should run on which server, create your databases (at which point you should also note the usernames, passwords and database names for future reference), and review base directory locations. We're going to do a pretty basic vanilla installation here so we choose custom and:



 
After that, we'll need to wait for a tad for the manager to start all the services and after that we'll be good to go.

And that's what you get for installing a Hadoop cluster on tiny VMs!
All right, time to do something with our new cluster. Let's go to our hue server.

To do that, go to your cloudera manager UI, select "Hue" and click on "Hue Web UI".


Just select your username and password that you will use for hue. As soon as you're in, it will do a few automatic checks and ask you if you need to create new users.

Which means that we have everything up and running and we can actually use our Hadoop cluster using a Web browser instead of going through everything manually!

References: http://www.cloudera.com/content/cloudera/en/documentation/cloudera-manager/v4-6-2/Cloudera-Manager-Managing-Clusters/cmmc_parcel_upgrade.html


Saturday, March 14, 2015

Introduction to Parallel Computing Part 1e - Using Hadoop (Using Hue and Hive)

The first thing you need to do (after checking that there are no misconfigurations) as admin user is to create a user that can upload and handle files on your HDFS system.

There should be an account named root if you installed via the Hortonworks/Ambari method and/or hdfs if you installed manually. These users should have Superuser status. Please note that for brevity, I use superuser accounts to actually process my data. Obviously, for security reasons that should not be the case.




The hdfs user should also exist, again with Superuser status.


Now, let's check our actual HDFS permissions. Log in to your HDFS system (hadoop1 for me) and execute:

[root@hadoop1 ~]# hadoop fs -ls /user
Found 6 items
drwxrwx---   - ambari-qa hdfs          0 2015-03-11 19:11 /user/ambari-qa
drwxr-xr-x   - hcat      hdfs          0 2015-03-11 19:09 /user/hcat
drwxr-xr-x   - hdfs      hdfs          0 2015-03-13 12:06 /user/hdfs
drwx------   - hive      hdfs          0 2015-03-11 19:02 /user/hive
drwxrwxr-x   - oozie     hdfs          0 2015-03-11 19:06 /user/oozie
drwxr-xr-x   - root      root          0 2015-03-13 14:46 /user/root

Oh, that's no good. We need to change permissions to directory /user/hive:

[root@hadoop1 ~]# su - hdfs
[hdfs@hadoop1 ~]$ hadoop fs -mkdir -p /user/hive/warehouse
[hdfs@hadoop1 ~]$ hadoop fs -chmod g+w /tmp
[hdfs@hadoop1 ~]$ hadoop fs -chmod g+w /user/hive/
[hdfs@hadoop1 ~]$ hadoop fs -chmod g+w /user/hive/warehouse

Let's add the root user to the hdfs and hadoop groups as well while we're at it.

[root@hadoop1 ~]# groups root
root : root
[root@hadoop1 ~]# usermod -a -G root,hdfs,hadoop root

Now, go to your web browser again and log out from your admin account, and log in as root.

Let's create a directory for our project and set proper permissions to it.

Go to "File Browser" and select "New Directory". Select a descriptive name for it.




Now let's make sure this directory has proper permissions.



Go into your newly created directory and select "Upload", "Files".


I'm just going to go ahead and upload this: https://s3.amazonaws.com/hw-sandbox/tutorial1/NYSE-2000-2001.tsv.gz

It's stock ticker data from the New York Stock Exchange from the years 2000–2001. That was made available by the Hortonworks tutorial I site in my references. Just go ahead and paste that into the dialog that pops up.


After the actual upload to your HDFS, the file should be visible. You can actually click on it and see its contents if you like.

Now, in order for us to be able to process it using Hive and Pig, we'll need to register it with HCatalog first. We're going to take a shortcut though. Go to Beeswax and select "Tables".


Select "Create new table from a file". Enter a name as a table name, choose the file we uploaded and make sure the "Import data from file" option is checked.


Choose Next. This file uses TAB as its delimiter. Its first row is our columns header. So, we choose delimiter "Tab" and we check the "Read Column Headers":


Choose Next. Before going any further, check that the columns data type is correct. Adjust where necessary.


 Finally, select Create Table. Let's put our cluster to the test. Go to "Query Editor" and for instance:


OR:


And now sit back while Hadoop performs its magic.



So the average of the stock price high in the New York Stock exchange in the year 2000 was 27.7649 dollars.

That's it! You know SQL, you know Hive. End of story.

References: http://hortonworks.com/hadoop-tutorial/hello-world-an-introduction-to-hadoop-hcatalog-hive-and-pig/

Friday, March 13, 2015

Introduction to Parallel Computing Part 1e - Using Hadoop (Installing Hue on Hortonworks)

In our previous "Creating a Hadoop Cluster" post, we saw how we can install a Hadoop cluster using Hortonworks.

Great, we built a cluster, but how do we actually feed it with data and how do we make it process it?

The easy way is to install hue and do pretty much everything using a web browser. Sounds good? Let's do it.

Hue supports the following operating systems:
  • Red Hat Enterprise Linux (RHEL) v6.x
  • Red Hat Enterprise Linux (RHEL) v5.x (deprecated)
  • CentOS v6.x
  • CentOS v5.x (deprecated)
  • Oracle Linux v6.x
  • Oracle Linux v5.x (deprecated)
  • SUSE Linux Enterprise Server (SLES) v11, SP1 and SP3
  • Ubuntu Precise v12.04
As before, I'm going to use RHEL 6.6 for this.

You also need to have these Hadoop components installed:

ComponentRequiredApplicationsNotes
HDFSYesCore, FilebrowserHDFS access through WebHDFS or HttpFS
YARNYesJobDesigner, JobBrowser, Hive Transitive dependency via Hive or Oozie
OozieNoJobDesigner, OozieOozie access through REST API
HiveNoHive, HCatalogBeeswax server uses the Hive client libraries
WebHCatNoHCatalog, PigHCatalog and Pig use WebHcat REST API

And let's remember my cluster details:

Hadoop Cluster
Node Type and Number Node Name IP
Namenode hadoop1 192.168.0.101
Secondary Namenode hadoop2 192.168.0.102
Tertiary Services hadoop3 192.168.0.103
Datanode #1 hadoop4 192.168.0.104
Datanode #2 hadoop5 192.168.0.105
Datanode #3 hadoop6 192.168.0.106
Datanode #4 hadoop7 192.168.0.107
Datanode #5 hadoop8 192.168.0.108


First of all, go to Ambari, from the left hand side menu select HDFS and go to "Configs". There you need to ensure that WebHDFS is enabled.


Then, you need to do the following adjustments:
Go to "Custom core-site" and add the following properties:

Key Value
hadoop.proxyuser.hue.hosts *
hadoop.proxyuser.hue.groups *
hadoop.proxyuser.hcat.groups *
hadoop.proxyuser.hcat.hosts *

Save your changes. Restart any services that might need it due to the config changes. Now from the left hand side menu, select Hive. Go to "Custom webhcat-site" and add the following properties:

Key Value
webhcat.proxyuser.hue.hosts *
webhcat.proxyuser.hue.groups *



Save your changes. Restart any services that might need it due to the config changes. From the left hand side menu, select Oozie. Go to "Custom oozie-site" and add the following properties:

Key Value
oozie.service.ProxyUserService.proxyuser.hue.hosts *
oozie.service.ProxyUserService.proxyuser.hue.groups *


Save your changes. Restart any services that might need it due to the config changes.

Finally, go to your left hand side menu, select HDFS and select "Service Actions" and Stop. This is needed since we will be installing Hue.

OK, let's go to the system that will be our Hue server and install it (this should really be the same system that has Hive installed on it, hadoop3 in my case):

[root@hadoop3 ~]# yum -y install hue

We'll need a randomly-generated password:
[root@hadoop3 ~]# perl -e 'my @chars = ("A".."Z", "a".."z", "0".."9", "!", "@", "#", "\$", "%", "\^", "&", "*", "-", "\_", "=", "+", "\\", "|", "[", "{", "]", "}", ";", ":", ",", "<", ".", ">", "/", "?"); my $string; $string .= $chars[rand @chars] for 0..59; print "$string\n";'
QJy9@?s-g5UhS{I]IXkSC_ex%{@#za8?EcV#%@sasYX-ngI+|Qr$KHn/c]g]

Copy this string, you'll need it soon. Now, let's edit the hue.ini configuration file to suit our needs:

[root@hadoop3 ~]# vi /etc/hue/conf/hue.ini
....
  # Set this to a random string, the longer the better.
  # This is used for secure hashing in the session store.
  secret_key=QJy9@?s-g5UhS{I]IXkSC_ex%{@#za8?EcV#%@sasYX-ngI+|Qr$KHn/c]g]

  # Webserver listens on this address and port
  http_host=0.0.0.0
  http_port=8000

  # Time zone name
  time_zone=Etc/GMT
....

Paste your randomly-generated password next to the secret_key= then change the port that Hue will listen on and enter your correct time zone (if required).

We 're not finished yet with this file so, let's continue editing. Go to the [hadoop] section:
....
###########################################################################
# Settings to configure your Hadoop cluster.
###########################################################################

[hadoop]

  # Configuration for HDFS NameNode
  # ------------------------------------------------------------------------
  [[hdfs_clusters]]

    [[[default]]]
      # Enter the filesystem uri
      fs_defaultfs=hdfs://hadoop1:8020

      # Use WebHdfs/HttpFs as the communication mechanism. To fallback to
      # using the Thrift plugin (used in Hue 1.x), this must be uncommented
      # and explicitly set to the empty value.
      webhdfs_url=http://hadoop1:50070/webhdfs/v1

      ## security_enabled=true


  [[yarn_clusters]]

    [[[default]]]
      # Whether to submit jobs to this cluster
      submit_to=true

      ## security_enabled=false

      # Resource Manager logical name (required for HA)
      ## logical_name=

      # URL of the ResourceManager webapp address (yarn.resourcemanager.webapp.address)
      resourcemanager_api_url=http://hadoop2:8088

      # URL of Yarn RPC adress (yarn.resourcemanager.address)
      resourcemanager_rpc_url=http://hadoop2:8050

      # URL of the ProxyServer API
      proxy_api_url=http://hadoop2:8088

      # URL of the HistoryServer API
      history_server_api_url=http://hadoop2:19888

      # URL of the NodeManager API
      node_manager_api_url=http://hadoop1:8042

      # HA support by specifying multiple clusters
      # e.g.

      # [[[ha]]]
        # Enter the host on which you are running the failover Resource Manager
        #resourcemanager_api_url=http://failover-host:8088
        #logical_name=failover
        #submit_to=True
....

And make sure you enter the correct namenodes and the ports which they listen on for the corresponding services. Configure JobDesigner and Oozie:

....
###########################################################################
# Settings to configure liboozie
###########################################################################

[liboozie]
  # The URL where the Oozie service runs on. This is required in order for
  # users to submit jobs.
  oozie_url=http://hadoop3:11000/oozie

  ## security_enabled=true

  # Location on HDFS where the workflows/coordinator are deployed when submitted.
  ## remote_deployement_dir=/user/hue/oozie/deployments
....

Moving on, we'll need to configure beeswax:

....
[beeswax]

  # Host where Hive server Thrift daemon is running.
  # If Kerberos security is enabled, use fully-qualified domain name (FQDN).
  hive_server_host=hadoop3

  beeswax_server_host=hadoop3

  # Port where HiveServer2 Thrift server runs on.
  hive_server_port=10000

  # Hive configuration directory, where hive-site.xml is located
  hive_conf_dir=/etc/hive/conf
  hive_home_dir=/usr/hdp/2.2.0.0-2041/hive

  # Timeout in seconds for thrift calls to Hive service
  ## server_conn_timeout=120

  # Set a LIMIT clause when browsing a partitioned table.
  # A positive value will be set as the LIMIT. If 0 or negative, do not set any limit.
  ## browse_partitioned_table_limit=250

  # A limit to the number of rows that can be downloaded from a query.
  # A value of -1 means there will be no limit.
  # A maximum of 65,000 is applied to XLS downloads.
  ## download_row_limit=1000000

  # Hue will try to close the Hive query when the user leaves the editor page.
  # This will free all the query resources in HiveServer2, but also make its results inaccessible.
  ## close_queries=false

  # Option to show execution engine choice.
  ## show_execution_engine=False

  # "Go to column pop up on query result page. Set to false to disable"
  ## go_to_column=true
....

Your hive_home_dir will be /usr/hdp/your_hdp_version/hive. You might need to check that manually on your Hive server. And finally:

....
###########################################################################
# Settings for the User Admin application
###########################################################################

[useradmin]
  # The name of the default user group that users will be a member of
  default_user_group=hadoop
  default_username=hue
  default_user_password=1111


[hcatalog]
  templeton_url=http://hadoop3:50111/templeton/v1/
  security_enabled=false

[about]
  tutorials_installed=false

[pig]
  udf_path="/tmp/udfs"
....

That was it. Now go to Ambari and start your HDFS again and once that is done, start hue.

[root@hadoop3 ~]# service hue start

If you go to your Hive server's IP:8000 (http://hadoop3:8000 or http://192.168.0.103:8000 in my case), you'll be greeted with this:

Just select your username and password that you will use for hue. As soon as you're in, select "Check for misconfiguration" to check that all is ok. If you missed anything, make sure that you haven't missed a step or maybe forgot to stop your HDFS, or perhaps hue was mistakenly started before you actually edited its file and needs a restart now. After it's done you should get this:


Which means that we have everything up and running and we can actually use our Hadoop cluster using a Web browser instead of going through everything manually!

References: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1.7/bk_installing_manually_book/content/rpm-chap-hue.html

Wednesday, March 11, 2015

Introduction to Parallel Computing Part 1d - Creating a Hadoop cluster (the easy way -- Hortonworks)

As in our previous "Creating a Hadoop Cluster" post, we'll need to use one of the following, 64-bit operating systems:
  • Red Hat Enterprise Linux (RHEL) v6.x
  • Red Hat Enterprise Linux (RHEL) v5.x (deprecated)
  • CentOS v6.x
  • CentOS v5.x (deprecated)
  • Oracle Linux v6.x
  • Oracle Linux v5.x (deprecated)
  • SUSE Linux Enterprise Server (SLES) v11, SP1 and SP3
  • Ubuntu Precise v12.04
I'm going to use RHEL 6.6 for this.

Let's press on with the environment set up. For this, I will have one namenode (hadoop1), two secondary servers (hadoop2 and hadoop3) and 5 datanodes (hadoop4, hadoop5, hadoop6, hadoop7 and hadoop8).

Hadoop Cluster
Node Type and Number Node Name IP
Namenode hadoop1 192.168.0.101
Secondary Namenode hadoop2 192.168.0.102
Tertiary Services hadoop3 192.168.0.103
Datanode #1 hadoop4 192.168.0.104
Datanode #2 hadoop5 192.168.0.105
Datanode #3 hadoop6 192.168.0.106
Datanode #4 hadoop7 192.168.0.107
Datanode #5 hadoop8 192.168.0.108

Unfortunately, the menial tasks that involve system configuration cannot be avoided, so let's press on:

First of all, let's update everything (This should be issued on every server in our cluster):

[root@hadoop1 ~]# yum -y update
[root@hadoop1 ~]# yum -y install wget

We'll need this (This should be issued on every server in our cluster.):

[root@hadoop1 ~]# yum -y install openssh-clients.x86_64

Note: It's always best to actually have your node FQDNs on your DNS server and skip the next two steps (editing the /etc/hosts and the /etc/host.conf files). 

Now, let's edit our /etc/hosts to reflect our cluster (This should be issued on every server in our cluster):

[root@hadoop1 ~]# vi /etc/hosts
192.168.0.101   hadoop1
192.168.0.102   hadoop2
192.168.0.103   hadoop3
192.168.0.104   hadoop4
192.168.0.105   hadoop5
192.168.0.106   hadoop6
192.168.0.107   hadoop7
192.168.0.108   hadoop8

We should also check our /etc/host.conf and our /etc/nsswitch.conf, unless we want to have resolvable hostnames:

[hadoop@hadoop1 ~]$ vi /etc/host.conf
multi on
order hosts bind
[hadoop@hadoop1 ~]$ vi /etc/nsswitch.conf
....
#hosts:     db files nisplus nis dns
hosts:      files dns
....

We'll need a large number of file descriptors (This should be issued on every server in our cluster):

[root@hadoop1 ~]# vi /etc/security/limits.conf
....
* soft nofile 65536
* hard nofile 65536
....

We should make sure that our network interface comes up automatically:

[root@hadoop1 ~]# vi /etc/sysconfig/network-scripts/ifcfg-eth0
....
ONBOOT="yes"
....

And of course make sure our other networking functions, such as our hostname are correct:

[root@hadoop1 ~]# vi /etc/sysconfig/network
NETWORKING=yes
HOSTNAME=hadoop1
GATEWAY=192.168.0.1


We'll need to log in using SSH as root, so for the time being let's allow root logins. We might want to turn that off after we're done, as this is as insecure as they come:

[root@hadoop1 ~]# vi /etc/ssh/sshd_config
....
PermitRootLogin yes
....
[root@hadoop1 ~]# service sshd restart

NTP should be installed on every server in our cluster. Now that we've edited our hosts file things are much easier though:

[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" "exec yum -y install ntp ntpdate"; done 
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" chkconfig ntpd on; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" ntpdate pool.ntp.org; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" service ntpd start; done

Set up passwordless ssh authentication (note that this will be configured automatically during the actual installation so this in not necessary; it is useful though, since it saves us from a lot of typing):

[root@hadoop1 ~]# ssh-keygen -t rsa 
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}' | grep -v hadoop1); do ssh "$host" mkdir -p .ssh; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh-copy-id -i ~/.ssh/id_rsa.pub "$host"; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" chmod 700 .ssh; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" chmod 640 .ssh/authorized_keys; done

Time to tone down our security a bit so that our cluster runs without problems. My PC's IP is 192.168.0.55 so I will allow that as well:

[root@hadoop1 ~]# iptables -F
[root@hadoop1 ~]# iptables -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
[root@hadoop1 ~]# iptables -A INPUT -i lo -j ACCEPT
[root@hadoop1 ~]# iptables -A INPUT -s 192.168.0.101,192.168.0.102,192.168.0.103,192.168.0.104,192.168.0.105,192.168.0.106,192.168.0.107,192.168.0.108 -j ACCEPT
[root@hadoop1 ~]# iptables -A INPUT -s 192.168.0.55 -j ACCEPT
[root@hadoop1 ~]# iptables -A INPUT -j DROP
[root@hadoop1 ~]# iptables -A FORWARD -j DROP
[root@hadoop1 ~]# iptables -A OUTPUT -j ACCEPT
[root@hadoop1 ~]# iptables-save > /etc/sysconfig/iptables
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}' | grep -v hadoop1); do scp /etc/sysconfig/iptables "$host":/etc/sysconfig/iptables; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" "iptables-restore < /etc/sysconfig/iptables"; done

Let's disable SELinux:

[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" setenforce 0; done
[root@hadoop1 ~]# vi /etc/sysconfig/selinux
# This file controls the state of SELinux on the system.
# SELINUX= can take one of these three values:
#     enforcing - SELinux security policy is enforced.
#     permissive - SELinux prints warnings instead of enforcing.
#     disabled - No SELinux policy is loaded.
SELINUX=disabled
# SELINUXTYPE= can take one of these two values:
#     targeted - Targeted processes are protected,
#     mls - Multi Level Security protection.
SELINUXTYPE=targeted
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}' | grep -v hadoop1); do scp /etc/sysconfig/selinux "$host":/etc/sysconfig/selinux; done

Turn down swappiness:

[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" "echo vm.swappiness = 1 >> /etc/sysctl.conf"; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" sysctl -p; done


We've made quite a few changes, including kernel updates. Let's reboot and pick this up later.

[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}' | grep -v hadoop1); do ssh "$host" reboot; done      
[root@hadoop1 ~]# reboot

Now, let's start installing Hadoop by downloading the Ambari repo and installing ambari-server.

[root@hadoop1 ~]# wget -nv http://public-repo-1.hortonworks.com/ambari/centos6/1.x/updates/1.7.0/ambari.repo -O /etc/yum.repos.d/ambari.repo
[root@hadoop1 ~]# yum -y install ambari-server 
[root@hadoop1 ~]# ambari-server setup

At which time, we'll be prompted for the following:
  • If you have not temporarily disabled SELinux, you may get a warning. Accept the default (y), and continue.
  • By default, Ambari Server runs under root. Accept the default (n) at the Customize user account for ambari-server daemon prompt, to proceed as root.
    If you want to create a different user to run the Ambari Server, or to assign a previously created user, select y at the Customize user account for ambari-server daemon prompt, then provide a user name.
  • If you have not temporarily disabled iptables you may get a warning. Enter y to continue.
  • Select a JDK version to download. Enter 1 to download Oracle JDK 1.7.
  • Accept the Oracle JDK license when prompted. You must accept this license to download the necessary JDK from Oracle. The JDK is installed during the deploy phase.
  • Select n at Enter advanced database configuration to use the default, embedded PostgreSQL database for Ambari. The default PostgreSQL database name is ambari. The default user name and password are ambari/bigdata.
    Otherwise, to use an existing PostgreSQL, MySQL or Oracle database with Ambari, select y
    1.  To use an existing Oracle 11g r2 instance, and select your own database name, user name, and password for that database, enter 2. Select the database you want to use and provide any information requested at the prompts, including host name, port, Service Name or SID, user name, and password.
    2. To use an existing MySQL 5.x database, and select your own database name, user name, and password for that database, enter 3. Select the database you want to use and provide any information requested at the prompts, including host name, port, database name, user name, and password.
    3. To use an existing PostgreSQL 9.x database, and select your own database name, user name, and password for that database, enter 4. Select the database you want to use and provide any information requested at the prompts, including host name, port, database name, user name, and password. 
  • At Proceed with configuring remote database connection properties [y/n] choose y
  • Setup completes.
Now, we just have to start the server:

[root@hadoop1 ~]# ambari-server start

And we can navigate using our web browser to our namenode IP:8080 and continue any configuration and installation steps from there.

The default credentials are
username: admin
password: admin


Then, we just need to select "Create a Cluster" and HDP takes us by the hand and pretty much does everything needed. And when I say everything, I mean it.

The only issue that you may encounter if you followed this guide is that Ambari will detect that iptables is running. We've made sure it allows everything we need so we can safely ignore this warning.


You can install just about any hadoop module you want with the click of a button, saving hours and maybe in some cases days of work. Amazing.

It even installs and configures Nagios and Ganglia for you!
That's a whole lot of work done with the press of a button right here!
From there on, it's just a matter of waiting for the installation process to finish.

 

This is what we are greeted with:


And all this is for free. Wow.

References: http://docs.hortonworks.com/HDPDocuments/Ambari-1.7.0.0/Ambari_Install_v170/Ambari_Install_v170.pdf

Introduction to Parallel Computing Part 1c - Major Hadoop Distributions and their differences

In my previous article, we saw a few basics on how to install hadoop.

Honestly the process is rather tedious and time-consuming. That's one of the reasons that a big number of companies out there prefer using an out-of-the-box solution rather than going the manual way.

The top 3 hadoop distributions are Cloudera, HortonWorks and MapR. Hortonworks provides an open-source and completely free distro, while the other two provide free as well as paid versions of their product.

Here are their differences and features:

From http://www.experfy.com/blog/cloudera-vs-hortonworks-comparing-hadoop-distributions/

From http://www.networkworld.com/article/2369327/software/comparing-the-top-hadoop-distributions.html