Friday, 7 October 2011

Load IIS 6.0 weblog files into Hadoop using Hive


Following on from part 1 here is how you load IIS 6.0 logs into Hadoop ready for query.

MapReduce with Hadoop and Hive - "Hello World!"


Background
MapReduce is a concept I've been interested for several year mainly because it's at the core of how Google do some clever stuff at scale. Naturally this leads to having some interest in the Hadoop project. However none of this meant I got my hands dirty with the technology. The scale of the problems I was working on didn't warrant the large amount of time I may have had to spend in getting Hadoop infrastructure up and running. Only if somebody provided Hadoop-as-a-service I thought, and then Amazon came to the party with the Elastic MapReduce service - this was sometime ago obviously.

Over the last couple of days finally decided it was time to dive in. The Elastic MapReduce service means that I don't have to worry about setting up the infrastructure - my grandmother would be able to do that if she was alive - and get on with figuring out how to solve my problem with the tools. My chosen problem is to analyse IIS weblog files. At this time I've not found any other example of how to process IIS logs using Hadoop. So here we go...