Saturday, March 14, 2015

Introduction to Parallel Computing Part 1e - Using Hadoop (Using Hue and Hive)

The first thing you need to do (after checking that there are no misconfigurations) as admin user is to create a user that can upload and handle files on your HDFS system.

There should be an account named root if you installed via the Hortonworks/Ambari method and/or hdfs if you installed manually. These users should have Superuser status. Please note that for brevity, I use superuser accounts to actually process my data. Obviously, for security reasons that should not be the case.




The hdfs user should also exist, again with Superuser status.


Now, let's check our actual HDFS permissions. Log in to your HDFS system (hadoop1 for me) and execute:

[root@hadoop1 ~]# hadoop fs -ls /user
Found 6 items
drwxrwx---   - ambari-qa hdfs          0 2015-03-11 19:11 /user/ambari-qa
drwxr-xr-x   - hcat      hdfs          0 2015-03-11 19:09 /user/hcat
drwxr-xr-x   - hdfs      hdfs          0 2015-03-13 12:06 /user/hdfs
drwx------   - hive      hdfs          0 2015-03-11 19:02 /user/hive
drwxrwxr-x   - oozie     hdfs          0 2015-03-11 19:06 /user/oozie
drwxr-xr-x   - root      root          0 2015-03-13 14:46 /user/root

Oh, that's no good. We need to change permissions to directory /user/hive:

[root@hadoop1 ~]# su - hdfs
[hdfs@hadoop1 ~]$ hadoop fs -mkdir -p /user/hive/warehouse
[hdfs@hadoop1 ~]$ hadoop fs -chmod g+w /tmp
[hdfs@hadoop1 ~]$ hadoop fs -chmod g+w /user/hive/
[hdfs@hadoop1 ~]$ hadoop fs -chmod g+w /user/hive/warehouse

Let's add the root user to the hdfs and hadoop groups as well while we're at it.

[root@hadoop1 ~]# groups root
root : root
[root@hadoop1 ~]# usermod -a -G root,hdfs,hadoop root

Now, go to your web browser again and log out from your admin account, and log in as root.

Let's create a directory for our project and set proper permissions to it.

Go to "File Browser" and select "New Directory". Select a descriptive name for it.




Now let's make sure this directory has proper permissions.



Go into your newly created directory and select "Upload", "Files".


I'm just going to go ahead and upload this: https://s3.amazonaws.com/hw-sandbox/tutorial1/NYSE-2000-2001.tsv.gz

It's stock ticker data from the New York Stock Exchange from the years 2000–2001. That was made available by the Hortonworks tutorial I site in my references. Just go ahead and paste that into the dialog that pops up.


After the actual upload to your HDFS, the file should be visible. You can actually click on it and see its contents if you like.

Now, in order for us to be able to process it using Hive and Pig, we'll need to register it with HCatalog first. We're going to take a shortcut though. Go to Beeswax and select "Tables".


Select "Create new table from a file". Enter a name as a table name, choose the file we uploaded and make sure the "Import data from file" option is checked.


Choose Next. This file uses TAB as its delimiter. Its first row is our columns header. So, we choose delimiter "Tab" and we check the "Read Column Headers":


Choose Next. Before going any further, check that the columns data type is correct. Adjust where necessary.


 Finally, select Create Table. Let's put our cluster to the test. Go to "Query Editor" and for instance:


OR:


And now sit back while Hadoop performs its magic.



So the average of the stock price high in the New York Stock exchange in the year 2000 was 27.7649 dollars.

That's it! You know SQL, you know Hive. End of story.

References: http://hortonworks.com/hadoop-tutorial/hello-world-an-introduction-to-hadoop-hcatalog-hive-and-pig/

No comments:

Post a Comment