Friday, March 13, 2015

Introduction to Parallel Computing Part 1e - Using Hadoop (Installing Hue on Hortonworks)

In our previous "Creating a Hadoop Cluster" post, we saw how we can install a Hadoop cluster using Hortonworks.

Great, we built a cluster, but how do we actually feed it with data and how do we make it process it?

The easy way is to install hue and do pretty much everything using a web browser. Sounds good? Let's do it.

Hue supports the following operating systems:
  • Red Hat Enterprise Linux (RHEL) v6.x
  • Red Hat Enterprise Linux (RHEL) v5.x (deprecated)
  • CentOS v6.x
  • CentOS v5.x (deprecated)
  • Oracle Linux v6.x
  • Oracle Linux v5.x (deprecated)
  • SUSE Linux Enterprise Server (SLES) v11, SP1 and SP3
  • Ubuntu Precise v12.04
As before, I'm going to use RHEL 6.6 for this.

You also need to have these Hadoop components installed:

ComponentRequiredApplicationsNotes
HDFSYesCore, FilebrowserHDFS access through WebHDFS or HttpFS
YARNYesJobDesigner, JobBrowser, Hive Transitive dependency via Hive or Oozie
OozieNoJobDesigner, OozieOozie access through REST API
HiveNoHive, HCatalogBeeswax server uses the Hive client libraries
WebHCatNoHCatalog, PigHCatalog and Pig use WebHcat REST API

And let's remember my cluster details:

Hadoop Cluster
Node Type and Number Node Name IP
Namenode hadoop1 192.168.0.101
Secondary Namenode hadoop2 192.168.0.102
Tertiary Services hadoop3 192.168.0.103
Datanode #1 hadoop4 192.168.0.104
Datanode #2 hadoop5 192.168.0.105
Datanode #3 hadoop6 192.168.0.106
Datanode #4 hadoop7 192.168.0.107
Datanode #5 hadoop8 192.168.0.108


First of all, go to Ambari, from the left hand side menu select HDFS and go to "Configs". There you need to ensure that WebHDFS is enabled.


Then, you need to do the following adjustments:
Go to "Custom core-site" and add the following properties:

Key Value
hadoop.proxyuser.hue.hosts *
hadoop.proxyuser.hue.groups *
hadoop.proxyuser.hcat.groups *
hadoop.proxyuser.hcat.hosts *

Save your changes. Restart any services that might need it due to the config changes. Now from the left hand side menu, select Hive. Go to "Custom webhcat-site" and add the following properties:

Key Value
webhcat.proxyuser.hue.hosts *
webhcat.proxyuser.hue.groups *



Save your changes. Restart any services that might need it due to the config changes. From the left hand side menu, select Oozie. Go to "Custom oozie-site" and add the following properties:

Key Value
oozie.service.ProxyUserService.proxyuser.hue.hosts *
oozie.service.ProxyUserService.proxyuser.hue.groups *


Save your changes. Restart any services that might need it due to the config changes.

Finally, go to your left hand side menu, select HDFS and select "Service Actions" and Stop. This is needed since we will be installing Hue.

OK, let's go to the system that will be our Hue server and install it (this should really be the same system that has Hive installed on it, hadoop3 in my case):

[root@hadoop3 ~]# yum -y install hue

We'll need a randomly-generated password:
[root@hadoop3 ~]# perl -e 'my @chars = ("A".."Z", "a".."z", "0".."9", "!", "@", "#", "\$", "%", "\^", "&", "*", "-", "\_", "=", "+", "\\", "|", "[", "{", "]", "}", ";", ":", ",", "<", ".", ">", "/", "?"); my $string; $string .= $chars[rand @chars] for 0..59; print "$string\n";'
QJy9@?s-g5UhS{I]IXkSC_ex%{@#za8?EcV#%@sasYX-ngI+|Qr$KHn/c]g]

Copy this string, you'll need it soon. Now, let's edit the hue.ini configuration file to suit our needs:

[root@hadoop3 ~]# vi /etc/hue/conf/hue.ini
....
  # Set this to a random string, the longer the better.
  # This is used for secure hashing in the session store.
  secret_key=QJy9@?s-g5UhS{I]IXkSC_ex%{@#za8?EcV#%@sasYX-ngI+|Qr$KHn/c]g]

  # Webserver listens on this address and port
  http_host=0.0.0.0
  http_port=8000

  # Time zone name
  time_zone=Etc/GMT
....

Paste your randomly-generated password next to the secret_key= then change the port that Hue will listen on and enter your correct time zone (if required).

We 're not finished yet with this file so, let's continue editing. Go to the [hadoop] section:
....
###########################################################################
# Settings to configure your Hadoop cluster.
###########################################################################

[hadoop]

  # Configuration for HDFS NameNode
  # ------------------------------------------------------------------------
  [[hdfs_clusters]]

    [[[default]]]
      # Enter the filesystem uri
      fs_defaultfs=hdfs://hadoop1:8020

      # Use WebHdfs/HttpFs as the communication mechanism. To fallback to
      # using the Thrift plugin (used in Hue 1.x), this must be uncommented
      # and explicitly set to the empty value.
      webhdfs_url=http://hadoop1:50070/webhdfs/v1

      ## security_enabled=true


  [[yarn_clusters]]

    [[[default]]]
      # Whether to submit jobs to this cluster
      submit_to=true

      ## security_enabled=false

      # Resource Manager logical name (required for HA)
      ## logical_name=

      # URL of the ResourceManager webapp address (yarn.resourcemanager.webapp.address)
      resourcemanager_api_url=http://hadoop2:8088

      # URL of Yarn RPC adress (yarn.resourcemanager.address)
      resourcemanager_rpc_url=http://hadoop2:8050

      # URL of the ProxyServer API
      proxy_api_url=http://hadoop2:8088

      # URL of the HistoryServer API
      history_server_api_url=http://hadoop2:19888

      # URL of the NodeManager API
      node_manager_api_url=http://hadoop1:8042

      # HA support by specifying multiple clusters
      # e.g.

      # [[[ha]]]
        # Enter the host on which you are running the failover Resource Manager
        #resourcemanager_api_url=http://failover-host:8088
        #logical_name=failover
        #submit_to=True
....

And make sure you enter the correct namenodes and the ports which they listen on for the corresponding services. Configure JobDesigner and Oozie:

....
###########################################################################
# Settings to configure liboozie
###########################################################################

[liboozie]
  # The URL where the Oozie service runs on. This is required in order for
  # users to submit jobs.
  oozie_url=http://hadoop3:11000/oozie

  ## security_enabled=true

  # Location on HDFS where the workflows/coordinator are deployed when submitted.
  ## remote_deployement_dir=/user/hue/oozie/deployments
....

Moving on, we'll need to configure beeswax:

....
[beeswax]

  # Host where Hive server Thrift daemon is running.
  # If Kerberos security is enabled, use fully-qualified domain name (FQDN).
  hive_server_host=hadoop3

  beeswax_server_host=hadoop3

  # Port where HiveServer2 Thrift server runs on.
  hive_server_port=10000

  # Hive configuration directory, where hive-site.xml is located
  hive_conf_dir=/etc/hive/conf
  hive_home_dir=/usr/hdp/2.2.0.0-2041/hive

  # Timeout in seconds for thrift calls to Hive service
  ## server_conn_timeout=120

  # Set a LIMIT clause when browsing a partitioned table.
  # A positive value will be set as the LIMIT. If 0 or negative, do not set any limit.
  ## browse_partitioned_table_limit=250

  # A limit to the number of rows that can be downloaded from a query.
  # A value of -1 means there will be no limit.
  # A maximum of 65,000 is applied to XLS downloads.
  ## download_row_limit=1000000

  # Hue will try to close the Hive query when the user leaves the editor page.
  # This will free all the query resources in HiveServer2, but also make its results inaccessible.
  ## close_queries=false

  # Option to show execution engine choice.
  ## show_execution_engine=False

  # "Go to column pop up on query result page. Set to false to disable"
  ## go_to_column=true
....

Your hive_home_dir will be /usr/hdp/your_hdp_version/hive. You might need to check that manually on your Hive server. And finally:

....
###########################################################################
# Settings for the User Admin application
###########################################################################

[useradmin]
  # The name of the default user group that users will be a member of
  default_user_group=hadoop
  default_username=hue
  default_user_password=1111


[hcatalog]
  templeton_url=http://hadoop3:50111/templeton/v1/
  security_enabled=false

[about]
  tutorials_installed=false

[pig]
  udf_path="/tmp/udfs"
....

That was it. Now go to Ambari and start your HDFS again and once that is done, start hue.

[root@hadoop3 ~]# service hue start

If you go to your Hive server's IP:8000 (http://hadoop3:8000 or http://192.168.0.103:8000 in my case), you'll be greeted with this:

Just select your username and password that you will use for hue. As soon as you're in, select "Check for misconfiguration" to check that all is ok. If you missed anything, make sure that you haven't missed a step or maybe forgot to stop your HDFS, or perhaps hue was mistakenly started before you actually edited its file and needs a restart now. After it's done you should get this:


Which means that we have everything up and running and we can actually use our Hadoop cluster using a Web browser instead of going through everything manually!

References: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1.7/bk_installing_manually_book/content/rpm-chap-hue.html

No comments:

Post a Comment