Friday, 7 October 2011

MapReduce with Hadoop and Hive - "Hello World!"

MapReduce is a concept I've been interested for several year mainly because it's at the core of how Google do some clever stuff at scale. Naturally this leads to having some interest in the Hadoop project. However none of this meant I got my hands dirty with the technology. The scale of the problems I was working on didn't warrant the large amount of time I may have had to spend in getting Hadoop infrastructure up and running. Only if somebody provided Hadoop-as-a-service I thought, and then Amazon came to the party with the Elastic MapReduce service - this was sometime ago obviously.

Over the last couple of days finally decided it was time to dive in. The Elastic MapReduce service means that I don't have to worry about setting up the infrastructure - my grandmother would be able to do that if she was alive - and get on with figuring out how to solve my problem with the tools. My chosen problem is to analyse IIS weblog files. At this time I've not found any other example of how to process IIS logs using Hadoop. So here we go...

Getting Started
My first task was to understand how to interact with Hadoop (on Elastic MapReduce). It turns out there are several ways, using the high level data analysis platforms Pig or Hive, or using a general purpose language such as Java, Ruby, and Python. I decided Pig or Hive sounded like they were designed to specifically to deal with my problem while I've only dabbled with Java and Python and not at all with Ruby. Hive has a query language very similar to SQL which was the deciding factor.

Next task, sign-in to one of our AWS accounts, spin up a new Hive Work Flow in interactive mode with just the master node [1]. Find the public DNS of the master node and connect via SSH.

Now the fun begins, stop chatting and start typing code (plus some comments of course). I'm starting with a simple set of data so I can get familiar with Elastic MapReduce, Hadoop, and Hive.

My Hadoop + Hive "Hello World"
Fire-up the Hive CLI with a variable called SOURCE_S3_BUCKET to make life a little easier later. You can define multiple.

hadoop@ip-1-2-3-4:~$ hive \
    -d SOURCE_S3_BUCKET=s3://<my_s3_bucket_name>/hadoop-test-output

Upload a space delimited text file to your S3 bucket. I uploaded it to s3://<my_s3_bucket_name>/hadoop-test-output. This is your source data file, the stuff you want to process. I used the following (the text between the lines).


In the example above nulls 
are inserted for the array
and map types in the
destination tables but potentially these
can also come from the
external table if the proper
row formats are specified loop


Define a logical table structure that the data in the above file can map to.

  col1  STRING,
  col2  STRING,
  col3  STRING,
  col4  STRING,
  col5  STRING


Time taken: 0.079 seconds

hive> select * from in_table;
In      the     example above   nulls
are     inserted        for     the     array
and     map     types   in      the
destination     tables  but     potentially     these
can     also    come    from    the
external        table   if      the     proper
row     formats are     specified       loop
In      the     example above   nulls
are     inserted        for     the     array
and     map     types   in      the
destination     tables  but     potentially     these
can     also    come    from    the
external        table   if      the     proper
row     formats are     specified       loop
Time taken: 13.174 seconds

Next Challege
Figure out how to load IIS format web log files in to Hadoop using Hive.


  • EXTERNAL allows you to point at none local storage such as Amazon S3
  • A CREATE TABLE can be thought of as creating the logical table schema and the associated physical file structure (the file on disk). OR defining a logical table schema and an appropriate making to an existing physical data file that was created by some other source. An example of such a data file would be a web log file.
  • STORED AS signals that there is no existing file, hence create. It also defines the file format.
  • LOCATION defines the path of the source or destination files. When S3 is used the path format is s3://<bucket_name>/<optional_folder>
  • AWS Elastic MapReduce nodes (Amazon Debian ami-2cf2c358) have "hadoop" as the default username.

No comments: