Wednesday, March 11, 2015

Introduction to Parallel Computing Part 1b - Creating a Hadoop cluster (the hard way)

Before we start deploying, we need to plan a few things ahead. For instance, is your system supported by Hadoop?

In theory, pretty much any system that can run Java can run Hadoop. In practice, it's best to use one of the following, 64-bit operating systems:
  • Red Hat Enterprise Linux (RHEL) v6.x
  • Red Hat Enterprise Linux (RHEL) v5.x (deprecated)
  • CentOS v6.x
  • CentOS v5.x (deprecated)
  • Oracle Linux v6.x
  • Oracle Linux v5.x (deprecated)
  • SUSE Linux Enterprise Server (SLES) v11, SP1 and SP3
  • Ubuntu Precise v12.04

If you intend to deploy on one of the distros, then great. I'm going to use RHEL 6.6.

Let's press on with the environment set up. For this, I will have just one namenode (hadoop1) and 5 datanodes (hadoop4, hadoop5, hadoop6, hadoop7 and hadoop8).

Hadoop Cluster
Node Type and Number Node Name IP
Namenode #1 hadoop1 192.168.0.101
Datanode #1 hadoop4 192.168.0.104
Datanode #2 hadoop5 192.168.0.105
Datanode #3 hadoop6 192.168.0.106
Datanode #4 hadoop7 192.168.0.107
Datanode #5 hadoop8 192.168.0.108


First of all, let's update everything (This should be issued on every server in our cluster):

[root@hadoop1 ~]# yum -y update

We'll need these (This should be issued on every server in our cluster.):

[root@hadoop1 ~]# yum -y install openssh-clients.x86_64
[root@hadoop1 ~]# yum -y install wget

Note: It's always best to actually have your node FQDNs on your DNS server and skip the next two steps (editing the /etc/hosts and the /etc/host.conf files). 

Now, let's edit our /etc/hosts to reflect our cluster (This should be issued on every server in our cluster):

[root@hadoop1 ~]# vi /etc/hosts
192.168.0.101   hadoop1
192.168.0.104   hadoop4
192.168.0.105   hadoop5
192.168.0.106   hadoop6
192.168.0.107   hadoop7
192.168.0.108   hadoop8

We should also check our /etc/host.conf and our /etc/nsswitch.conf, unless we want to have resolvable hostnames:

[hadoop@hadoop1 ~]$ vi /etc/host.conf
multi on
order hosts bind
[hadoop@hadoop1 ~]$ vi /etc/nsswitch.conf
....
#hosts:     db files nisplus nis dns
hosts:      files dns
....

We'll need a large number of file descriptors (This should be issued on every server in our cluster):

[root@hadoop1 ~]# vi /etc/security/limits.conf
....
* soft nofile 65536
* hard nofile 65536
....

We should make sure that our network interface comes up automatically:

[root@hadoop1 ~]# vi /etc/sysconfig/network-scripts/ifcfg-eth0
....
ONBOOT="yes"
....

And of course make sure our other networking functions, such as our hostname are correct:

[root@hadoop1 ~]# vi /etc/sysconfig/network
NETWORKING=yes
HOSTNAME=hadoop1
GATEWAY=192.168.0.1


We'll need to log in using SSH as root, so for the time being let's allow root logins. We might want to turn that off after we're done, as this is as insecure as they come:

[root@hadoop1 ~]# vi /etc/ssh/sshd_config
....
PermitRootLogin yes
....
[root@hadoop1 ~]# service sshd restart

NTP should be installed on every server in our cluster. Now that we've edited our hosts file things are much easier though:

[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" "exec yum -y install ntp ntpdate"; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" chkconfig ntpd on; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" ntpdate pool.ntp.org; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" service ntpd start; done

Let's set up our hadoop user:

[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" groupadd hadoop; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" useradd hadoop -g hadoop -m -s /bin/bash; done       
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" passwd hadoop; done

Set up passwordless ssh authentication:

[root@hadoop1 ~]# su - hadoop
[hadoop@hadoop1 ~]$ ssh-keygen -t rsa 
[hadoop@hadoop1 ~]$ for host in $(grep hadoop /etc/hosts | awk '{print $2}' | grep -v hadoop1); do ssh "$host" mkdir -p .ssh; done
[hadoop@hadoop1 ~]$ for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh-copy-id -i ~/.ssh/id_rsa.pub "$host"; done
[hadoop@hadoop1 ~]$ for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" chmod 700 .ssh; done
[hadoop@hadoop1 ~]$ for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" chmod 640 .ssh/authorized_keys; done

Time to tone down our security a bit so that our cluster runs without problems. My PC's IP is 192.168.0.55 so I will allow that as well:

[root@hadoop1 ~]# iptables -F
[root@hadoop1 ~]# iptables -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
[root@hadoop1 ~]# iptables -A INPUT -i lo -j ACCEPT
[root@hadoop1 ~]# iptables -A INPUT -s 192.168.0.101,192.168.0.104,192.168.0.105,192.168.0.106,192.168.0.107,192.168.0.108 -j ACCEPT
[root@hadoop1 ~]# iptables -A INPUT -s 192.168.0.55 -j ACCEPT
[root@hadoop1 ~]# iptables -A INPUT -j DROP
[root@hadoop1 ~]# iptables -A FORWARD -j DROP
[root@hadoop1 ~]# iptables -A OUTPUT -j ACCEPT
[root@hadoop1 ~]# iptables-save > /etc/sysconfig/iptables
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}' | grep -v hadoop1); do scp /etc/sysconfig/iptables "$host":/etc/sysconfig/iptables; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" "iptables-restore < /etc/sysconfig/iptables"; done

Let's disable SELinux:

[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" setenforce 0; done
[root@hadoop1 ~]# vi /etc/sysconfig/selinux
# This file controls the state of SELinux on the system.
# SELINUX= can take one of these three values:
#     enforcing - SELinux security policy is enforced.
#     permissive - SELinux prints warnings instead of enforcing.
#     disabled - No SELinux policy is loaded.
SELINUX=disabled
# SELINUXTYPE= can take one of these two values:
#     targeted - Targeted processes are protected,
#     mls - Multi Level Security protection.
SELINUXTYPE=targeted
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}' | grep -v hadoop1); do scp /etc/sysconfig/selinux "$host":/etc/sysconfig/selinux; done

Turn down swappiness:

[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" "echo vm.swappiness = 1 >> /etc/sysctl.conf"; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" sysctl -p; done

Hadoop needs the Java SDK installed. Although it will probably work if you just pick the latest one, you'd better make sure which one fits your version better. Οpenjdk 1.7.0 was recommended for my version (2.6.0).

Download the latest Java SDK:

a) Go to http://www.oracle.com/technetwork/java/javase/downloads/index.html
b) Select the proper Java SDK version (make sure it's the x64 version)
c) Accept the licence
d) Select the the .tar.gz archive that is the proper architecture for your system
e) Copy its link location
f) And:

[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" 'wget --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/7u75-b13/jdk-7u75-linux-x64.tar.gz'; done

I'm going to install Java SDK in /opt/jdk:

[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" mkdir /opt/jdk; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" 'tar -zxf jdk-7u75-linux-x64.tar.gz -C /opt/jdk'; done
[root@hadoop1 ~]# /opt/jdk/jdk1.7.0_75/bin/java -version
java version "1.7.0_75"
Java(TM) SE Runtime Environment (build 1.7.0_75-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.75-b04, mixed mode)

Let's prepare our system environment variables. Here I'm going to install Hadoop in /opt/hadoop.

[root@hadoop1 ~]# su - hadoop
[hadoop@hadoop1 ~]$ vi ~/.bashrc
....
export HADOOP_HOME=/opt/hadoop
export HADOOP_PREFIX=$HADOOP_HOME  
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
export HADOOP_COMMON_LIB_NATIVE_DIR=/opt/hadoop/lib/native
export YARN_HOME=$HADOOP_HOME
export JAVA_HOME=/opt/jdk/jdk1.7.0_75
export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native:$JAVA_LIBRARY_PATH
export HIVE_HOME=/opt/hive
export PATH=$PATH:/opt/jdk/jdk1.7.0_75/bin:/opt/hadoop/bin:/opt/hadoop/sbin:/opt/pig/bin:/opt/hive/bin:/opt/hive/hcatalog/bin:/opt/hive/hcatalog/sbin
[hadoop@hadoop1 ~]$ for host in $(grep hadoop /etc/hosts | awk '{print $2}' | grep -v hadoop1); do scp ~/.bashrc "$host":~/.bashrc; done
[hadoop@hadoop1 ~]$ source ~/.bashrc

Time to download and extract the latest Hadoop:

Go to:

http://hadoop.apache.org/releases.html

select "Download a release now", select a mirror and copy its link location

[hadoop@hadoop1 ~]$ for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh hadoop@"$host" 'wget http://www.eu.apache.org/dist/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz'; done
[hadoop@hadoop1 ~]$ for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh root@"$host" mkdir /opt/hadoop; done
[hadoop@hadoop1 ~]$ for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh root@"$host" 'tar -zxf /home/hadoop/hadoop-2.6.0.tar.gz -C /opt/hadoop/ --strip-components=1'; done
[hadoop@hadoop1 ~]$ exit

Let's edit our config files:

core-site.xml

[root@hadoop1 ~]# vi /opt/hadoop/etc/hadoop/core-site.xml
<configuration>
   <property>
      <name>fs.defaultFS</name>
      <value>hdfs://hadoop1:9000/</value>
      <description>Hadoop Filesystem</description>
   </property>
   <property>
      <name>dfs.permissions</name>
      <value>false</value>
   </property>
</configuration>

Notice that I have put my intended namenode as a fs.default.name value.

hdfs-site.xml 

[root@hadoop1 ~]# vi /opt/hadoop/etc/hadoop/hdfs-site.xml
<configuration>
   <property>
      <name>dfs.datanode.dir</name>
      <value>/opt/hadoop/hdfs/datanode</value>       
      <description>DataNode directory for storing data chunks.</description>
      <final>true</final>
   </property>
   <property>
      <name>dfs.namenode.dir</name>
      <value>/opt/hadoop/hdfs/namenode</value>  
      <description>NameNode directory for namespace and transaction logs storage.</description>
      <final>true</final>
   </property>
   <property>
      <name>dfs.replication</name>
      <value>2</value>
      <description>Level of replication for each chunk.</description>
   </property>
</configuration>

The replication factor is a property that can be set in the HDFS configuration file that will allow you to
adjust the global replication factor for the entire cluster.

For each block stored in HDFS, there will be n–1 duplicated blocks distributed across the cluster.
For example, if the replication factor was set to 3 (default value in HDFS) there would be one original block
and two replicas.

mapred-site.xml

[root@hadoop1 ~]# cp /opt/hadoop/etc/hadoop/mapred-site.xml.template /opt/hadoop/etc/hadoop/mapred-site.xml
[root@hadoop1 ~]# vi /opt/hadoop/etc/hadoop/mapred-site.xml
<configuration>
   <property>
      <name>mapreduce.jobtracker.address</name>
      <value>hadoop1:9001</value>
   </property>
</configuration>

This is the host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. 

yarn-site.xml 

[root@hadoop1 ~]# vi /opt/hadoop/etc/hadoop/yarn-site.xml
<configuration>
   <property>
     <name>yarn.resourcemanager.hostname</name>
     <value>hadoop1</value>
     <description>The hostname of the ResourceManager</description>
   </property>
   <property>
     <name>yarn.nodemanager.aux-services</name>
     <value>mapreduce_shuffle</value>
     <description>shuffle service for MapReduce</description>
   </property>
</configuration>


hadoop-env.sh  

[root@hadoop1 ~]# vi /opt/hadoop/etc/hadoop/hadoop-env.sh
....
#export JAVA_HOME=${JAVA_HOME}
export JAVA_HOME=/opt/jdk/jdk1.7.0_75
....
#export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"}
export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
....

Let's create the namenode and datanode directories that we set in hdfs-site.xml and do the rest of the menial configuration tasks:

[root@hadoop1 ~]# mkdir -p /opt/hadoop/hdfs/namenode
[root@hadoop1 ~]# mkdir -p /opt/hadoop/hdfs/datanode
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}' | grep -v hadoop1); do scp -r /opt/hadoop/etc/* "$host":"/opt/hadoop/etc/."; done
[root@hadoop1 ~]# echo "hadoop1" > /opt/hadoop/etc/hadoop/masters
[root@hadoop1 ~]# echo $'hadoop4\nhadoop5\nhadoop6\nhadoop7\nhadoop8' > /opt/hadoop/etc/hadoop/slaves
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh root@"$host" chown -R hadoop:hadoop /opt/hadoop/; done

Okay, let's start it all up and see if it works:

[root@hadoop1 ~]# su - hadoop
[hadoop@hadoop1 ~]$ hdfs namenode -format

This will format our HDFS filesystem. And:

[hadoop@hadoop1 ~]$ start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
15/03/10 17:43:47 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [hadoop1]
hadoop1: starting namenode, logging to /opt/hadoop/logs/hadoop-hadoop-namenode-hadoop1.out
hadoop4: starting datanode, logging to /opt/hadoop/logs/hadoop-hadoop-datanode-hadoop4.out
hadoop7: starting datanode, logging to /opt/hadoop/logs/hadoop-hadoop-datanode-hadoop7.out
hadoop5: starting datanode, logging to /opt/hadoop/logs/hadoop-hadoop-datanode-hadoop5.out
hadoop6: starting datanode, logging to /opt/hadoop/logs/hadoop-hadoop-datanode-hadoop6.out
hadoop8: starting datanode, logging to /opt/hadoop/logs/hadoop-hadoop-datanode-hadoop8.out

Looks like we're up and running, let's make sure:

[hadoop@hadoop1 ~]$ jps
13247 ResourceManager
12922 NameNode
13620 Jps
13113 SecondaryNameNode

Yup, how about on the other nodes?

[hadoop@hadoop1 ~]$ for host in $(grep hadoop /etc/hosts | awk '{print $2}' | grep -v hadoop1); do ssh hadoop@"$host" jps; done
1897 DataNode
1990 NodeManager
2159 Jps
2164 Jps
1902 DataNode
1995 NodeManager
2146 Jps
1884 DataNode
1977 NodeManager
2146 Jps
1884 DataNode
1977 NodeManager
12094 DataNode
12795 Jps
12217 NodeManager
[hadoop@hadoop1 ~]$ exit


Great. Now you can navigate to your namenode IP:50070 with your web browser and you should have an overview of your cluster.

To finish this off, let's install Pig and Hive. WebHCat and HCatalog are installed with Hive, starting with Hive release 0.11.0.

Go to http://pig.apache.org/releases.html, choose a version, choose a mirror and download:

[root@hadoop1 ~]# wget http://www.eu.apache.org/dist/pig/latest/pig-0.14.0.tar.gz
[root@hadoop1 ~]# mkdir /opt/pig
[root@hadoop1 ~]# tar -zxf pig-0.14.0.tar.gz -C /opt/pig --strip-components=1
[root@hadoop1 ~]# chown -R hadoop:hadoop /opt/pig

I've already taken care of the environment variables earlier (you need to add /opt/pig/bin to the PATH), so all I need to do now is:

[root@hadoop1 ~]# su - hadoop
[hadoop@hadoop1 ~]$ pig -x mapreduce
grunt> QUIT;
[hadoop@hadoop1 ~]$ exit

Great. Now go to https://hive.apache.org/downloads.html, choose download, choose mirror:

[root@hadoop1 ~]# wget http://www.eu.apache.org/dist/hive/hive-1.1.0/apache-hive-1.1.0-bin.tar.gz
[root@hadoop1 ~]# mkdir /opt/hive
[root@hadoop1 ~]# tar -zxf apache-hive-1.1.0-bin.tar.gz -C /opt/hive --strip-components=1
[root@hadoop1 ~]# chown -R hadoop:hadoop /opt/hive

I've already taken care of the environment variables earlier (you need to add /opt/hive/bin to the PATH and export HIVE_HOME=/opt/hive), so all I need to do now is:

[root@hadoop1 ~]# su - hadoop
[hadoop@hadoop1 ~]$ hadoop fs -mkdir /tmp
[hadoop@hadoop1 ~]$ hadoop fs -mkdir -p /user/hive/warehouse
[hadoop@hadoop1 ~]$ hadoop fs -chmod g+w /tmp
[hadoop@hadoop1 ~]$ hadoop fs -chmod g+w /user/hive/warehouse
[hadoop@hadoop1 ~]$ hcat
usage: hcat { -e "" | -f "" } [ -g "" ] [ -p "" ] [ -D"=" ]
 -D    use hadoop value for given property
 -e              hcat command given from command line
 -f              hcat commands in file
 -g             group for the db/table specified in CREATE statement
 -h,--help             Print help information
 -p             permissions for the db/table specified in CREATE statementgrunt> QUIT;

Yup, looks like it all works.

Monday, March 9, 2015

Introduction to Parallel Computing Part 1a - Introduction to Hadoop

What Is Apache Hadoop?
The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
The project includes these modules:
  • Hadoop Common: The common utilities that support the other Hadoop modules.
  • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
  • Hadoop YARN: A framework for job scheduling and cluster resource management.
  • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
Other Hadoop-related projects at Apache include:
  • Ambari™: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.
  • Avro™: A data serialization system.
  • Cassandra™: A scalable multi-master database with no single points of failure.
  • Chukwa™: A data collection system for managing large distributed systems.
  • HBase™: A scalable, distributed database that supports structured data storage for large tables.
  • Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • Mahout™: A Scalable machine learning and data mining library.
  • Pig™: A high-level data-flow language and execution framework for parallel computation.
  • Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
  • Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.
  • ZooKeeper™: A high-performance coordination service for distributed applications. 
The Apache Hadoop projects provide a series of tools designed to solve big data problems. The Hadoop cluster implements a parallel computing cluster using inexpensive commodity hardware. The cluster is partitioned across many servers to provide a near linear scalability. The philosophy of the cluster design is to bring the computing to the data. So each datanode will hold part of the overall data and be able to process the data that it holds.

From http://hortonworks.com/hadoop-tutorial/hello-world-an-introduction-to-hadoop-hcatalog-hive-and-pig/

The overall framework for the processing software is called MapReduce. Hadoop MapReduce is a software framework for easily writing applications which process big amounts of data in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
The term MapReduce actually refers to the following two different tasks that Hadoop programs perform:
  • The Map Task: This is the first task, which takes input data and converts it into a set of data, where individual elements are broken down into tuples (key/value pairs).
  • The Reduce Task: This task takes the output from a map task as input and combines those data tuples into a smaller set of tuples. The reduce task is always performed after the map task.
Typically both the input and the output are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.
The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. The master is responsible for resource management, tracking resource consumption/availability and scheduling the jobs component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves TaskTracker execute the tasks as directed by the master and provide task-status information to the master periodically.

From http://hortonworks.com/hadoop-tutorial/hello-world-an-introduction-to-hadoop-hcatalog-hive-and-pig/
What is MapReduce?
MapReduce is a processing technique and a program model for distributed computing based on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task is always performed after the map job.
The major advantage of MapReduce is that it is easy to scale data processing over multiple computing nodes. Under the MapReduce model, the data processing primitives are called mappers and reducers. Decomposing a data processing application into mappers and reducers is sometimes nontrivial. But, once we write an application in the MapReduce form, scaling the application to run over hundreds, thousands, or even tens of thousands of machines in a cluster is merely a configuration change. This simple scalability is what has attracted many programmers to use the MapReduce model.

The Algorithm
  • Generally MapReduce paradigm is based on sending the computer to where the data resides!
  • MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.
    • Map stage : The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data.
    • Reduce stage : This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.
  • During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the cluster.
  • The framework manages all the details of data-passing such as issuing tasks, verifying task completion, and copying data around the cluster between the nodes.
  • Most of the computing takes place on nodes with data on local disks that reduces the network traffic.
  • After completion of the given tasks, the cluster collects and reduces the data to form an appropriate result, and sends it back to the Hadoop server.
From http://www.tutorialspoint.com/hadoop/hadoop_mapreduce.htm
What is HDFS?

Hadoop File System was developed using distributed file system design. It is run on commodity hardware. Unlike other distributed systems, HDFS is highly faulttolerant and designed using low-cost hardware.
HDFS holds very large amount of data and provides easier access. To store such huge data, the files are stored across multiple machines. These files are stored in redundant fashion to rescue the system from possible data losses in case of failure. HDFS also makes applications available to parallel processing.

Features of HDFS
  • It is suitable for the distributed storage and processing.
  • Hadoop provides a command interface to interact with HDFS.
  • The built-in servers of namenode and datanode help users to easily check the status of cluster.
  • Streaming access to file system data.
  • HDFS provides file permissions and authentication.
HDFS Architecture
Given below is the architecture of a Hadoop File System.

From http://www.tutorialspoint.com/hadoop/hadoop_hdfs_overview.htm
HDFS follows the master-slave architecture and it has the following elements.

Namenode
The namenode is the commodity hardware that contains the GNU/Linux operating system and the namenode software. It is a software that can be run on commodity hardware. The system having the namenode acts as the master server and it does the following tasks:
  • Manages the file system namespace.
  • Regulates client’s access to files.
  • It also executes file system operations such as renaming, closing, and opening files and directories.
Secondary Namenode
You might think that the SecondaryNameNode is a hot backup daemon for the NameNode. You’d be wrong. The SecondaryNameNode is a poorly understood component of the HDFS architecture, but one which provides the important function of lowering NameNode restart time.

The NameNode is responsible for the reliable storage and interactive lookup and modification of the metadata for HDFS. To maintain interactive speed, the filesystem metadata is stored in the NameNode’s RAM. Storing the data reliably necessitates writing it to disk as well. To ensure that these writes do not become a speed bottleneck, instead of storing the current snapshot of the filesystem every time, a list of modifications is continually appended to a log file called the EditLog. Restarting the NameNode involves replaying the EditLog to reconstruct the final system state.

The SecondaryNameNode periodically compacts the EditLog into a “checkpoint;” the EditLog is then cleared. A restart of the NameNode then involves loading the most recent checkpoint and a shorter EditLog containing only events since the checkpoint. Without this compaction process, restarting the NameNode can take a very long time. Compaction ensures that restarts do not incur unnecessary downtime.

The duties of the SecondaryNameNode end there; it cannot take over the job of serving interactive requests from the NameNode. Although, in the event of the loss of the primary NameNode, an instance of the NameNode daemon could be manually started on a copy of the NameNode metadata retrieved from the SecondaryNameNode.
Datanode
The datanode is a commodity hardware having the GNU/Linux operating system and datanode software. For every node (Commodity hardware/System) in a cluster, there will be a datanode. These nodes manage the data storage of their system.
  • Datanodes perform read-write operations on the file systems, as per client request.
  • They also perform operations such as block creation, deletion, and replication according to the instructions of the namenode.
Block
Generally the user data is stored in the files of HDFS. The file in a file system will be divided into one or more segments and/or stored in individual data nodes. These file segments are called as blocks. In other words, the minimum amount of data that HDFS can read or write is called a Block. The default block size is 64MB, but it can be increased as per the need to change in HDFS configuration.

Goals of HDFS
  • Fault detection and recovery : Since HDFS includes a large number of commodity hardware, failure of components is frequent. Therefore HDFS should have mechanisms for quick and automatic fault detection and recovery.
  • Huge datasets : HDFS should have hundreds of nodes per cluster to manage the applications having huge datasets.
  • Hardware at data : A requested task can be done efficiently, when the computation takes place near the data. Especially where huge datasets are involved, it reduces the network traffic and increases the throughput.
Hadoop architecture, from http://www.ibm.com/developerworks/data/library/techarticle/dm-1209hadoopbigdata/

What are the components of Hadoop?
The Apache Hadoop project has two core components, the file store called Hadoop Distributed File System (HDFS), and the programming framework called MapReduce. There are a number of supporting projects that leverage HDFS and MapReduce. This article will provide a summary, and encourages you to get the OReily book "Hadoop The Definitive Guide", 3rd Edition, for more details.
The definitions below are meant to provide just enough background for you to use the code examples that follow. This article is really meant to get you started with hands-on experience with the technology. This is a how-to article more than a what-is or let's-discuss article.
  • HDFS: If you want 4000+ computers to work on your data, then you'd better spread your data across 4000+ computers. HDFS does this for you. HDFS has a few moving parts. The Datanodes store your data, and the Namenode keeps track of where stuff is stored. There are other pieces, but you have enough to get started.
  • MapReduce: This is the programming model for Hadoop. There are two phases, not surprisingly called Map and Reduce. To impress your friends tell them there is a shuffle-sort between the Map and Reduce phase. The JobTracker manages the 4000+ components of your MapReduce job. The TaskTrackers take orders from the JobTracker. If you like Java then code in Java. If you like SQL or other non-Java languages you are still in luck, you can use a utility called Hadoop Streaming.
  • Hadoop Streaming: A utility to enable MapReduce code in any language: C, Perl, Python, C++, Bash, etc. The examples include a Python mapper and an AWK reducer.
  • Hive: If you like SQL, you will be delighted to hear that you can write SQL and have Hive convert it to a MapReduce job. No, you don't get a full ANSI-SQL environment, but you do get 4000 notes and multi-Petabyte scalability. Hue gives you a browser-based graphical interface to do your Hive work. 
  • Hue: Hue aggregates the most common Apache Hadoop components into a single interface and targets the user experience. Its main goal is to have the users "just use" Hadoop without worrying about the underlying complexity or using a command line.
  • Pig: A higher-level programming environment to do MapReduce coding. The Pig language is called Pig Latin. You may find the naming conventions somewhat unconventional, but you get incredible price-performance and high availability. Pig is a language for expressing data analysis and infrastructure processes. Pig is translated into a series of MapReduce jobs that are run by the Hadoop cluster. Pig is extensible through user-defined functions that can be written in Java and other languages. Pig scripts provide a high level language to create the MapReduce jobs needed to process data in a Hadoop cluster.
  • Sqoop: Provides bi-directional data transfer between Hadoop and your favorite relational database.
  • Oozie: Manages Hadoop workflow. This doesn't replace your scheduler or BPM tooling, but it does provide if-then-else branching and control within your Hadoop jobs.
  • HBase: A super-scalable key-value store. It works very much like a persistent hash-map (for python fans think dictionary). It is not a relational database despite the name HBase.
  • FlumeNG: A real time loader for streaming your data into Hadoop. It stores data in HDFS and HBase. You'll want to get started with FlumeNG, which improves on the original flume.
  • Whirr: Cloud provisioning for Hadoop. You can start up a cluster in just a few minutes with a very short configuration file.
  • Mahout: Machine learning for Hadoop. Used for predictive analytics and other advanced analysis.
  • Fuse: Makes the HDFS system look like a regular file system so you can use ls, rm, cd, and others on HDFS data
  • Zookeeper: Used to manage synchronization for the cluster. You won't be working much with Zookeeper, but it is working hard for you. If you think you need to write a program that uses Zookeeper you are either very, very, smart and could be a committee for an Apache project, or you are about to have a very bad day.
How Does Hadoop Work?

Stage 1

A user/application can submit a job to the Hadoop (a hadoop job client) for required process by specifying the following items:
  1. The location of the input and output files in the distributed file system.
  2. The java classes in the form of jar file containing the implementation of map and reduce functions.
  3. The job configuration by setting different parameters specific to the job.

Stage 2

The Hadoop job client then submits the job (jar/executable etc) and configuration to the JobTracker which then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.

Stage 3

The TaskTrackers on different nodes execute the task as per MapReduce implementation and output of the reduce function is stored into the output files on the file system.

Advantages of Hadoop
  • Hadoop framework allows the user to quickly write and test distributed systems. It is efficient, and it automatic distributes the data and work across the machines and in turn, utilizes the underlying parallelism of the CPU cores.
  • Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA), rather Hadoop library itself has been designed to detect and handle failures at the application layer.
  • Servers can be added or removed from the cluster dynamically and Hadoop continues to operate without interruption.
  • Another big advantage of Hadoop is that apart from being open source, it is compatible on all the platforms since it is Java based.

References:
http://www.ibm.com/developerworks/data/library/techarticle/dm-1209hadoopbigdata/
http://hadoop.apache.org/
http://hortonworks.com/hadoop-tutorial/hello-world-an-introduction-to-hadoop-hcatalog-hive-and-pig/
http://www.tutorialspoint.com/hadoop/hadoop_introduction.htm
http://www.tutorialspoint.com/hadoop/hadoop_mapreduce.htm 
http://www.tutorialspoint.com/hadoop/hadoop_hdfs_overview.htm 
http://blog.cloudera.com/blog/2009/02/multi-host-secondarynamenode-configuration/
http://en.wikipedia.org/wiki/Hue_(Hadoop)

Friday, March 6, 2015

Create your own Cloud PBX with Asterisk and FreePBX Part 3

We can now go to our web browser and type the IP of our system to manage our FreePBX. The first thing we'll be required to do is set up the admin account. After that, we can just go to "FreePBX Administration" and begin setting things up.



I guess that's expected when you use a t2.micro instance
We can go right ahead and upgrade, this interface makes the whole process too easy.


 

 


And after it's finished, if we check again:


After it's done, we want to "Apply Config", then more modules will become available, so feel free to update everything, applying the config and checking again until there's nothing more to upgrade.

root@localhost:/usr/src/freepbx# amportal a ma refreshsignatures

After you enable and update the modules in FreePBX, you might see the following error: Symlink from modules failed

To correct this error just delete the list of failed files:

root@localhost:/usr/src/freepbx# cd /etc/asterisk
root@localhost:/etc/asterisk# rm ccss.conf confbridge.conf features.conf sip.conf iax.conf logger.conf extensions.conf sip_notify.conf

Then on the FreePBX webUI go to the ‘Module Admin’ and uninstall and reinstall the ‘Camp On’ module. This should resolve the Symlink issue.

Before we move any further, I'd like to recommend something: Head on over to Digium and buy their g729 codec. It's amazing. I am in no way affiliated with Digium but please consider this. Not just because of the fact that Asterisk is a great product and you should support its development team; g729 is really, really worth every cent of its cost.

These are the steps required to install the g729 codec.

We need to know our Asterisk version and our platform before we install it:

root@localhost:~# asterisk -V
Asterisk 11.16.0
root@localhost:~# uname -a
Linux localhost 3.2.0-77-virtual #112-Ubuntu SMP Tue Feb 10 15:34:22 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

Right, so we're running Asterisk 11.16.0 on a 64-bit system.

root@localhost:~# mkdir g729_codec
root@localhost:~# cd g729_codec/

After we've received the license key in our mailbox, we need to go to http://downloads.digium.com/pub/register and find which version of the file we need to download.

root@localhost:~/g729_codec# wget http://downloads.digium.com/pub/register/linux/register
root@localhost:~/g729_codec# chmod 700 register 
root@localhost:~/g729_codec# ./register

We now have to go to http://downloads.digium.com/pub/telephony/codec_g729/benchg729 and find which version of the file we need to download.

root@localhost:~/g729_codec# wget http://downloads.digium.com/pub/telephony/codec_g729/benchg729/x86-64/benchg729-1.0.8-x86_64
root@localhost:~/g729_codec# chmod 700 benchg729-1.0.8-x86_64
root@localhost:~/g729_codec# ./benchg729-1.0.8-x86_64 
Results:
Average for flavor 'generic' is 238 milliseconds of CPU time.
Average for flavor 'nocona' is 241 milliseconds of CPU time.
Average for flavor 'core2' is 237 milliseconds of CPU time.
Average for flavor 'opteron' is 440 milliseconds of CPU time.
Average for flavor 'opteron-sse3' is 438 milliseconds of CPU time.
Average for flavor 'barcelona' is 237 milliseconds of CPU time.

Recommended flavor for this system is 'core2' with an average of 237 milliseconds of CPU time.

So the recommended flavor for my Linode is 'core2'. Now go to http://my.digium.com/en/docs/G729/g729-download/ and select your platform and the recommended flavor, which in my case is core2.

root@localhost:~/g729_codec# wget http://downloads.digium.com/pub/telephony/codec_g729/asterisk-11.0/x86-64/codec_g729a-11.0_3.1.7-core2_64.tar.gz
root@localhost:~/g729_codec# tar zxvf codec*
root@localhost:~/g729_codec# cd codec*
root@localhost:~/g729_codec/codec_g729a-11.0_3.1.7-core2_64# cp codec*.so /usr/lib/asterisk/modules
root@localhost:~/g729_codec/codec_g729a-11.0_3.1.7-core2_64# amportal restart

That's it!

root@localhost:~/g729_codec# root@localhost:~# asterisk -rvvvv
Asterisk 11.16.0, Copyright (C) 1999 - 2013 Digium, Inc. and others.
Created by Mark Spencer 
Asterisk comes with ABSOLUTELY NO WARRANTY; type 'core show warranty' for details.
This is free software, with components licensed under the GNU General Public
License version 2 and other licenses; you are welcome to redistribute it under
certain conditions. Type 'core show license' for details.
=========================================================================
Connected to Asterisk 11.16.0 currently running on localhost (pid = 839)
localhost*CLI> core show translation
         Translation times between formats (in microseconds) for one second of data
          Source Format (Rows) Destination Format (Columns)

            gsm  ulaw  alaw  g726 adpcm  slin lpc10  g729  ilbc g726aal2  g722 slin16 testlaw slin12 slin24 slin32 slin44 slin48 slin96 slin192
      gsm     - 15000 15000 15000 15000  9000 15000 15000 15000    15000 17250  17000   15000  17000  17000  17000  17000  17000  17000   17000
     ulaw 15000     -  9150 15000 15000  9000 15000 15000 15000    15000 17250  17000   15000  17000  17000  17000  17000  17000  17000   17000
     alaw 15000  9150     - 15000 15000  9000 15000 15000 15000    15000 17250  17000   15000  17000  17000  17000  17000  17000  17000   17000
     g726 15000 15000 15000     - 15000  9000 15000 15000 15000    15000 17250  17000   15000  17000  17000  17000  17000  17000  17000   17000
    adpcm 15000 15000 15000 15000     -  9000 15000 15000 15000    15000 17250  17000   15000  17000  17000  17000  17000  17000  17000   17000
     slin  6000  6000  6000  6000  6000     -  6000  6000  6000     6000  8250   8000    6000   8000   8000   8000   8000   8000   8000    8000
    lpc10 15000 15000 15000 15000 15000  9000     - 15000 15000    15000 17250  17000   15000  17000  17000  17000  17000  17000  17000   17000
     g729 15000 15000 15000 15000 15000  9000 15000     - 15000    15000 17250  17000   15000  17000  17000  17000  17000  17000  17000   17000
     ilbc 15000 15000 15000 15000 15000  9000 15000 15000     -    15000 17250  17000   15000  17000  17000  17000  17000  17000  17000   17000
 g726aal2 15000 15000 15000 15000 15000  9000 15000 15000 15000        - 17250  17000   15000  17000  17000  17000  17000  17000  17000   17000
     g722 15600 15600 15600 15600 15600  9600 15600 15600 15600    15600     -   9000   15600  17500  17000  17000  17000  17000  17000   17000
   slin16 14500 14500 14500 14500 14500  8500 14500 14500 14500    14500  6000      -   14500   8500   8000   8000   8000   8000   8000    8000
  testlaw 15000 15000 15000 15000 15000  9000 15000 15000 15000    15000 17250  17000       -  17000  17000  17000  17000  17000  17000   17000
   slin12 14500 14500 14500 14500 14500  8500 14500 14500 14500    14500 14000   8000   14500      -   8000   8000   8000   8000   8000    8000
   slin24 14500 14500 14500 14500 14500  8500 14500 14500 14500    14500 14500   8500   14500   8500      -   8000   8000   8000   8000    8000
   slin32 14500 14500 14500 14500 14500  8500 14500 14500 14500    14500 14500   8500   14500   8500   8500      -   8000   8000   8000    8000
   slin44 14500 14500 14500 14500 14500  8500 14500 14500 14500    14500 14500   8500   14500   8500   8500   8500      -   8000   8000    8000
   slin48 14500 14500 14500 14500 14500  8500 14500 14500 14500    14500 14500   8500   14500   8500   8500   8500   8500      -   8000    8000
   slin96 14500 14500 14500 14500 14500  8500 14500 14500 14500    14500 14500   8500   14500   8500   8500   8500   8500   8500      -    8000
  slin192 14500 14500 14500 14500 14500  8500 14500 14500 14500    14500 14500   8500   14500   8500   8500   8500   8500   8500   8500       -


Don't forget to reboot after all this is done.

OK, let's begin setting up our FreePBX:

Go to Settings -> Asterisk SIP settings.



What we want to do there is Disallow Anonymous Inbound SIP Calls, enter our IP and our network, define a STUN server (although not really required since this is a public IP), and check the codecs we're planning on using. There's no reason to use codecs our SIP provider doesn't.


Submit and Apply.

Now let's go to Applications -> Extensions. There we select "Generic CHAN SIP device" from the drop-down menu and click on submit.


This one's real easy. We only need to specify the extension number and maybe the Display name. Everything can be left to defaults, except maybe the "secret" under "Device Options", which will be the password your VoIP phone or softphone will need to give to your server, so you might want to change it to something more memorable. Or maybe not.


Submit and Apply Config and repeat the process for as many extensions you want.

Next, we'll create an IVR. We need this so that when someone dials our number, Asterisk will know that it needs to wait for the caller to enter the extension. Ideally we'd have a recording of a person saying "Please enter an extension" or the like, but many companies just leave this without one

So go to Applications -> IVR and select "Add a new IVR". Select a name and a description for it and make sure that "Direct Dial" points to "Extensions". Make a few more choices about what Asterisk should do in the case of an invalid destination or a timeout and finally add all the extensions you added before on the bottom of the page, making sure that "Extensions" is selected under "Destination".


Submit and Apply Config.

Time to create our trunk. Go to Connectivity -> Trunks and select "Add SIP (chan_sip) Trunk".

For a trunk name, I usually select my SIP provider's name. Enter your DID number in the "Outbound CallerID" and select "Force Trunk CID" under CID options.

In Outgoing Settings, Trunk Name just put the same thing you put under General Settings. In the PEER Details box, you should put the settings provided by your VoIP provider. If you weren't provided with any, you could just ask them or google for them.


If your VoIP provider provided you with user (not peer!) connection parameters, this is what you need to enter in USER context and USER Details. Otherwise delete everything from there and leave them blank.


If you left USER context and USER Details blank, then you will need to enter a REGISTER string. That will be in the form of:



username:secret@VoIP_Provider_host/DID

Submit and Apply Config.

Now, on to creating an Inbound route. Go to Connectivity -> Inbound Routes. Here, any Description will do. Just make sure your DID number is in the "DID number" field, and on the bottom of the page under "Set Destination" select IVR and your recently created ivr name (mine was ivr1). You might also want to check Signal RINGING and put 1 in the "Pause Before Answer" field.

Submit and Apply Config.

Finally, let's set up our outbound route. Go to Connectivity -> Outbound Routes. Just put anything in the "Description" field and make sure you put your DID number in the Route CID field. Make sure you enter a dial pattern for this. 

Some examples to get you started can be found here, but you can get away with just entering a dot (literally the character .) in the "match pattern" field. That means that it will route everything. Finally, choose the trunk you created earlier under "Trunk Sequence For Matched Routes".


Submit and Apply Config.

Some VoIP providers require you to have actually activated your DID, or allowed your account to make calls, etc. If you've gone through the process and you're sure there's nothing else to do on that end, you're set. All that remains is for you to get a softphone or a VoIP phone, set it up and start doing calls over IP.




Note that you might need to dial using an international prefix!

Create your own Cloud PBX with Asterisk and FreePBX Part 2


But what if we want to deploy on AWS? Well, here's the guide to that too.

a) Go to your AWS Management Console and under "Compute" select "EC2".
b) Allocate a new elastic IP
c) Launch a Red Hat Enterprise Linux 6.6 HVM instance
d) When asked about security Groups, just allow all. We'll be using iptables for this purpose, which is more flexible than Amazon's firewall
e) Set up your key pairs
f) Launch
g) Associate your elastic IP with your instance
h) SSH to your server using the key pair and username ec2-user

Let's start with our iptables rules first. My server's IP is 1.2.3.4, my offices' external IP is 2.3.4.5 and my SIP provider's IP is 3.4.5.6. Here I allow SIP, IAX, IAX2 and MGCP connections from my SIP provider. If you're not interested in IAX, IAX2 and MGCP just skip the lines with ports 5036, 4569 and 2727:

[root@ip-1-2-3-4 ~]# iptables -F
[root@ip-1-2-3-4 ~]# iptables -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
[root@ip-1-2-3-4 ~]# iptables -A INPUT -i lo -j ACCEPT
[root@ip-1-2-3-4 ~]# iptables -A 1.2.3.4 -j ACCEPT
[root@ip-1-2-3-4 ~]# iptables -A 2.3.4.5 -j ACCEPT
[root@ip-1-2-3-4 ~]# iptables -A INPUT -s 3.4.5.6 -p tcp -m multiport --dports 5060:5070 -j ACCEPT
[root@ip-1-2-3-4 ~]# iptables -A INPUT -s 3.4.5.6 -p udp -m multiport --dports 5060:5070 -j ACCEPT
[root@ip-1-2-3-4 ~]# iptables -A INPUT -s 3.4.5.6 -p udp -m udp --dport 4569 -j ACCEPT
[root@ip-1-2-3-4 ~]# iptables -A INPUT -s 3.4.5.6 -p udp -m udp --dport 5036 -j ACCEPT
[root@ip-1-2-3-4 ~]# iptables -A INPUT -s 3.4.5.6 -p udp -m udp --dport 2727 -j ACCEPT
[root@ip-1-2-3-4 ~]# iptables -A INPUT -j DROP
[root@ip-1-2-3-4 ~]# iptables -A FORWARD -j DROP
[root@ip-1-2-3-4 ~]# iptables -A OUTPUT -j ACCEPT
[root@ip-1-2-3-4 ~]# iptables-save > /etc/sysconfig/iptables

Let's disable SELinux:

[root@ip-1-2-3-4 ~]# setenforce 0
[root@ip-1-2-3-4 ~]# vi /etc/sysconfig/selinux
# This file controls the state of SELinux on the system.
# SELINUX= can take one of these three values:
#     enforcing - SELinux security policy is enforced.
#     permissive - SELinux prints warnings instead of enforcing.
#     disabled - No SELinux policy is loaded.
SELINUX=disabled
# SELINUXTYPE= can take one of these two values:
#     targeted - Targeted processes are protected,
#     mls - Multi Level Security protection.
SELINUXTYPE=targeted


OK, here's a fun fact: I couldn't find newt-devel or audiofile-devel anywhere on a trusted RHEL repo. So, I decided to cheat a bit:

[root@ip-1-2-3-4 ~]#  vi /etc/yum.repos.d/centos.repo
[centos]
name=CentOS $releasever - $basearch
baseurl=http://ftp.heanet.ie/pub/centos/6/os/$basearch/
enabled=0
gpgcheck=0

Let's install the EPEL and Remi repos, update our system and install any required dependencies:

[root@ip-1-2-3-4 ~]# wget http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
[root@ip-1-2-3-4 ~]# rpm -ivh epel-release-6-8.noarch.rpm
[root@ip-1-2-3-4 ~]# rpm --import http://rpms.famillecollet.com/RPM-GPG-KEY-remi
[root@ip-1-2-3-4 ~]# rpm -ivh http://rpms.famillecollet.com/enterprise/remi-release-6.rpm
[root@ip-1-2-3-4 ~]# yum -y update
[root@ip-1-2-3-4 ~]# yum -y groupinstall core
[root@ip-1-2-3-4 ~]# yum -y groupinstall base
[root@ip-1-2-3-4 ~]# yum -y install --enablerepo=epel,remi,centos gcc gcc-c++ lynx bison mysql-devel mysql-server php php-mysql php-pear php-mbstring tftp-server httpd make ncurses-devel libtermcap-devel sendmail sendmail-cf caching-nameserver sox newt newt-devel libxml2-devel libtiff-devel audiofile audiofile-devel sqlite-devel gtk2-devel kernel-devel git subversion php-process crontabs cronie cronie-anacron openssl-devel
[root@ip-1-2-3-4 ~]# yum -y install kernel-headers-`uname -r` kernel-devel-`uname -r` glibc-headers

Autostart MySQL and Apache:

[root@ip-1-2-3-4 ~]# chkconfig --level 345 mysqld on
[root@ip-1-2-3-4 ~]# chkconfig --level 345 httpd on

Time to get moving with the actual Asterisk installation. Here we install PearDB and Google voice dependencies:

[root@ip-1-2-3-4 ~]# pear install db-1.7.14
[root@ip-1-2-3-4 ~]# cd /usr/src
[root@ip-1-2-3-4 src]# wget https://iksemel.googlecode.com/files/iksemel-1.4.tar.gz
[root@ip-1-2-3-4 src]# tar xf iksemel-1.4.tar.gz
[root@ip-1-2-3-4 src]# cd iksemel-*
[root@ip-1-2-3-4 iksemel-1.4]# ./configure
[root@ip-1-2-3-4 iksemel-1.4]# make
[root@ip-1-2-3-4 iksemel-1.4]# make install
[root@ip-1-2-3-4 iksemel-1.4]# reboot

If you get a No releases available for package "pear.php.net/db" error while trying to reinstall PearDB just do:

[root@ip-1-2-3-4 ~]# pear install db-1.7.14
No releases available for package "pear.php.net/db"
[root@ip-1-2-3-4 ~]# mkdir /pear
[root@ip-1-2-3-4 pear]# cd /pear/
[root@ip-1-2-3-4 pear]# wget http://download.pear.php.net/package/DB-1.7.14.tgz
[root@ip-1-2-3-4 pear]# pear install DB-1.7.14.tgz

Download and install Asterisk, DAHDI, LIBPRI:

[root@ip-1-2-3-4 ~]# cd /usr/src/
[root@ip-1-2-3-4 src]# wget http://downloads.asterisk.org/pub/telephony/dahdi-linux-complete/dahdi-linux-complete-current.tar.gz
[root@ip-1-2-3-4 src]# wget http://downloads.asterisk.org/pub/telephony/libpri/libpri-1.4-current.tar.gz
[root@ip-1-2-3-4 src]# wget http://downloads.asterisk.org/pub/telephony/asterisk/asterisk-11-current.tar.gz
[root@ip-1-2-3-4 src]# tar xvfz dahdi-linux-complete-current.tar.gz
[root@ip-1-2-3-4 src]# cd dahdi-linux-complete-*
[root@ip-1-2-3-4 dahdi-linux-complete-2.10.0.1+2.10.0.1]# make all
[root@ip-1-2-3-4 dahdi-linux-complete-2.10.0.1+2.10.0.1]# make install
[root@ip-1-2-3-4 dahdi-linux-complete-2.10.0.1+2.10.0.1]# make config
[root@ip-1-2-3-4 dahdi-linux-complete-2.10.0.1+2.10.0.1]# cd /usr/src
[root@ip-1-2-3-4 src]# tar xvfz libpri-1.4-current.tar.gz
[root@ip-1-2-3-4 src]# cd libpri-*
[root@ip-1-2-3-4 libpri-1.4.15]# make
[root@ip-1-2-3-4 libpri-1.4.15]# make install
[root@ip-1-2-3-4 libpri-1.4.15]# cd /usr/src
[root@ip-1-2-3-4 src]# tar xvfz asterisk-11-current.tar.gz
[root@ip-1-2-3-4 src]# cd asterisk-*
[root@ip-1-2-3-4 asterisk-11.16.0]# ./configure
[root@ip-1-2-3-4 asterisk-11.16.0]# contrib/scripts/get_mp3_source.sh
[root@ip-1-2-3-4 asterisk-11.16.0]# make menuselect

This will bring up a menu. Make sure all the modules, sounds and features you want are included. You at least need to select Resource Modules-> res_xmpp, Channel Drivers -> chan_motif and Compiler Flags -> BUILD_NATIVE



Save and Exit...

[root@ip-1-2-3-4 asterisk-11.16.0]# make
[root@ip-1-2-3-4 asterisk-11.16.0]# make install
[root@ip-1-2-3-4 asterisk-11.16.0]# make config
[root@ip-1-2-3-4 asterisk-11.16.0]# cd /var/lib/asterisk/sounds
[root@ip-1-2-3-4 sounds]# wget http://downloads.asterisk.org/pub/telephony/sounds/asterisk-extra-sounds-en-gsm-current.tar.gz
[root@ip-1-2-3-4 sounds]# tar xfz asterisk-extra-sounds-en-gsm-current.tar.gz
[root@ip-1-2-3-4 sounds]# rm asterisk-extra-sounds-en-gsm-current.tar.gz

That's it! Asterisk is now fully installed! Time to install FreePBX:

[root@ip-1-2-3-4 sounds]# cd /usr/src
[root@ip-1-2-3-4 src]# export VER_FREEPBX=2.11
[root@ip-1-2-3-4 src]# git clone http://git.freepbx.org/scm/freepbx/framework.git freepbx
[root@ip-1-2-3-4 src]# cd freepbx/
[root@ip-1-2-3-4 freepbx]# git checkout release/${VER_FREEPBX}
[root@ip-1-2-3-4 freepbx]# adduser asterisk -M -c "Asterisk User"
[root@ip-1-2-3-4 freepbx]# chown asterisk. /var/run/asterisk
[root@ip-1-2-3-4 freepbx]# chown -R asterisk. /etc/asterisk
[root@ip-1-2-3-4 freepbx]# chown -R asterisk. /var/{lib,log,spool}/asterisk
[root@ip-1-2-3-4 freepbx]# chown -R asterisk. /usr/lib/asterisk
[root@ip-1-2-3-4 freepbx]# mkdir /var/www/html
[root@ip-1-2-3-4 freepbx]# chown -R asterisk. /var/www/

Change PHP upload_max_filesize to 120MB and change the default Apache User and Group:

[root@ip-1-2-3-4 freepbx]# chown -R asterisk. /var/www/
[root@ip-1-2-3-4 freepbx]# sed -i 's/\(^upload_max_filesize = \).*/\120M/' /etc/php.ini
[root@ip-1-2-3-4 freepbx]# cp /etc/httpd/conf/httpd.conf /etc/httpd/conf/httpd.conf_orig
[root@ip-1-2-3-4 freepbx]# sed -i 's/^\(User\|Group\).*/\1 asterisk/' /etc/httpd/conf/httpd.conf
[root@ip-1-2-3-4 freepbx]# sed -i 's/^DocumentRoot.*/DocumentRoot \"\/var\/www\/admin\"/g' /etc/httpd/conf/httpd.conf
[root@ip-1-2-3-4 freepbx]# service httpd restart

Configure MySQL database:

[root@ip-1-2-3-4 freepbx]# mysql_secure_installation
[root@ip-1-2-3-4 freepbx]# export ASTERISK_DB_PW=amp109
[root@ip-1-2-3-4 freepbx]# mysqladmin -u root create asterisk -p
[root@ip-1-2-3-4 freepbx]# mysqladmin -u root create asteriskcdrdb -p
[root@ip-1-2-3-4 freepbx]# mysql -u root asterisk -p < SQL/newinstall.sql 
[root@ip-1-2-3-4 freepbx]# mysql -u root asteriskcdrdb -p < SQL/cdr_mysql_table.sql 
[root@ip-1-2-3-4 freepbx]# mysql -u root -p -e "GRANT ALL PRIVILEGES ON asterisk.* TO asteriskuser@localhost IDENTIFIED BY '${ASTERISK_DB_PW}';"
[root@ip-1-2-3-4 freepbx]# mysql -u root -p -e "GRANT ALL PRIVILEGES ON asteriskcdrdb.* TO asteriskuser@localhost IDENTIFIED BY '${ASTERISK_DB_PW}';"
[root@ip-1-2-3-4 freepbx]# mysql -u root -p -e "flush privileges;"

Time to start it all up:

[root@ip-1-2-3-4 freepbx]# ln -s /usr/lib/libasteriskssl.so.1 /usr/lib64/libasteriskssl.so.1
[root@ip-1-2-3-4 freepbx]# ./start_asterisk start
[root@ip-1-2-3-4 freepbx]# ./install_amp --username=asteriskuser --password=$ASTERISK_DB_PW --webroot /var/www
[root@ip-1-2-3-4 freepbx]# amportal a ma download manager
[root@ip-1-2-3-4 freepbx]# amportal a ma install manager
[root@ip-1-2-3-4 freepbx]# amportal a ma installall
[root@ip-1-2-3-4 freepbx]# amportal a reload

This should give you a few questions to answer, but accepting the default values is more than recommended EXCEPT the "Enter the IP ADDRESS or hostname used to access the AMP web-admin". Obviously you should enter your server's IP address there. In my case that would be 1.2.3.4.

Almost there:

[root@ip-1-2-3-4 freepbx]# ln -s /var/lib/asterisk/moh /var/lib/asterisk/mohmp3
[root@ip-1-2-3-4 freepbx]# amportal start


After you enable and update the modules in FreePBX, you might see the following error: Symlink from modules failed

To correct this error just delete the list of failed files:

[root@ip-1-2-3-4 freepbx]# cd /etc/asterisk
[root@ip-1-2-3-4 freepbx]# rm ccss.conf confbridge.conf features.conf sip.conf iax.conf logger.conf extensions.conf sip_notify.conf

References: http://wiki.freepbx.org/display/HTGS/Installing+FreePBX+2.11+on+Centos+6.3