Friday, 7 October 2011

Load IIS 6.0 weblog files into Hadoop using Hive

Following on from part 1 here is how you load IIS 6.0 logs into Hadoop ready for query.

MapReduce with Hadoop and Hive - "Hello World!"

MapReduce is a concept I've been interested for several year mainly because it's at the core of how Google do some clever stuff at scale. Naturally this leads to having some interest in the Hadoop project. However none of this meant I got my hands dirty with the technology. The scale of the problems I was working on didn't warrant the large amount of time I may have had to spend in getting Hadoop infrastructure up and running. Only if somebody provided Hadoop-as-a-service I thought, and then Amazon came to the party with the Elastic MapReduce service - this was sometime ago obviously.

Over the last couple of days finally decided it was time to dive in. The Elastic MapReduce service means that I don't have to worry about setting up the infrastructure - my grandmother would be able to do that if she was alive - and get on with figuring out how to solve my problem with the tools. My chosen problem is to analyse IIS weblog files. At this time I've not found any other example of how to process IIS logs using Hadoop. So here we go...

Monday, 26 September 2011

HOW TO: Install Ylastic CostAgent on Windows

Although one can simply schedule the Ylastic CostAgent as a task on Ylastic itself ( see Can Ylastic host the cost agent for me? ) I decided I didn't want to do this as it involves giving Ylastic the master account username and password for each of our AWS accounts. Although we probably trust Ylastic this is not best practice. Therefore I decided to install the CostAgent on my work station (I'll move it to a server later) and schedule it. I quickly discovered that the main documentation only covered installation on Linux, didn't find anything else out  there hence this post.

Tuesday, 13 September 2011

Checkpoint SmartCenter Server Backup Automation Script

Today I finally got around to figuring out how to automate the backup of a Checkpoint SmartCenter Server. 

The following batch file can be run as a schedule task. It generates a backup file with a name composed of the current date and time hence not overwriting previous backups. Note the undocumented "-n" option that makes upgrade_export.exe run completely silently.

This can be further extended to move the backup file off the server once complete. FTP it off or upload it to an Amazon S3 bucket.

@echo off
set "filename=%TIME%_%DATE%"
rem echo.%filename%
set "filename=%filename:/=%"
set "filename=%filename::=%"
set "filename=%filename:.=%"
set "filename=%filename: =%" 
set "filename=c:\<some_folder>\%filename%.bak"
rem echo.%filename%

call C:\WINDOWS\FW1\R65\fw1\bin\upgrade_tools\upgrade_export.exe -n %filename% 

Tuesday, 31 May 2011 error "Can't locate Net/DNS/"

If you are trying to run on Windows and get the following error then there's a Python script that may work better. I've not tried it yet.

Can't locate Net/DNS/ in @INC (@INC contains: C:/strawberry/perl/site/lib C:/strawberry/perl/vendor/lib C:/strawber
ry/perl/lib .) at line 78.
BEGIN failed--compilation aborted at line 78.

Tuesday, 15 February 2011

TIP: Amazon Web Services Operates Split-horizon DNS

While making some tweaks to our AWS setup we learnt that Amazon run a Split-horizon DNS setup. This is only really relevant if you have services that are internal (or backend) to your overall system such as a database (not RDS) behind a web server. And you want to run it off a static IP (Elastic IP) as you should, plus a custom DNS entry such as

In the above scenario you would have you database secured with a rule only allowing access to the web server. You do this by adding a firewall rule by Security Group name i.e. any EC2 instance with the given Security Group attached (the one defined by name in the rule) is allowed access to the database ( more on EC2 Security Groups). Then you attach an Elastic IP to your EC2 database instance, create an A Record under your domain pointing to the Elastic IP, then use this A Record in the connection string in your application running on the web server.

This will not behave as you expect, your web server will lose connectivity to the database. The reason for this is that the Public DNS name of the database server you were using, on a EC2 instance,  still resolves to it's private IP address. Where as from your PC in the office or at home it will resolve to the a Public IP address i.e. Split-horizon DNS. Firewall rules based on security group names resolve against private IP addresses to identify EC2 instance therefore allowing this traffic. When you try and connect using your custom DNS name (which always resolves to the Elastic IP) the database server is not receiving traffic from the a private IP the firewall can associate back to an instance and hence back to attached security groups  hence blocks the traffic.

The solution is to always create custom DNS names as CNames to the EC2 Public DNS record rather than A Record directly resolving to the Elastic IP and all will work nicely.