Thursday, 13 December 2012

HOW TO: Configure FreeSSHd on Windows 2012 and Connect with Putty

Recently I started playing around with Chef for a project. Chef needs SSH and setting SSH up on Windows is something I've not done before. I couldn't find a set of good instructions so here goes.

This YouTube video helped http://www.youtube.com/watch?v=lwHktjugAYM

I used a AWS EC2 Windows 2012 x64 Server and Windows 8 x64 desktop.

Overview

  • Install  and configure FreeSSHd on the server
  • Create keys
  • configure Putty to connect to the server

Install FreeSSHd


Configure FreeSSHd

Open FreeSSHd settings (may have to kill the service and start manually to get the GUI)


  • SSH tab:
    • Max number = 2
    • idle = 600

  • Authentication tab
    • Pub key folder = C:\Program Files (x86)\freeSSHd\keys
    • Password auth = disabled
    • Pub key auth = required

  • Users tab
    • add
      • login=chef
      • auth = 'Pub key (ssh only)'
      • user can use = shell
    • click OK


Generate Public and Private keys

  • Open PuttyGen
  • Click ‘Generate’
    • move the mouse pointer around as instructed to generate the key
  • Save a Putty compatible private key
    • Click ‘Save private key’
    • Save this to the client PC, Putty will need this
    • You should really save with a passphrase for extra security
  • Save OpenSSL compatible private key for Chef knife
    • ‘Conversions’ menu > ‘Export OpenSSH Key’ > save as a *.pem
  • Save the public key
    • Copy the contents of ‘Public key for pasting into OpenSSH authorized file:’ and paste into a textfile.
    • rename this file ‘chef’ (no file extension, the filename must match the user login name created above)
    • drop this file into the public key folder C:\Program Files (x86)\freeSSHd\keys on the server.


Connecting with Putty

  • Open Putty (or Putty portable)
  • Enter the IP address of the server
  • Connection type = SSH (obviously!)
  • In the left menu tree
    • Connection > SSH > Auth > ‘Private key file for authentication:’ > click browse
    • Select the private key that was generated above
    • Click ‘Open’
  • when prompted ‘login:’ > enter ‘chef’ > hit enter
  • If the private key was saved with a passphrase then enter this when prompted
  • You should now be connected to the server.






Friday, 7 October 2011

Load IIS 6.0 weblog files into Hadoop using Hive


Following on from part 1 here is how you load IIS 6.0 logs into Hadoop ready for query.

Fire-up the the Hive CLI
hive \
    -d SOURCE1=s3://consumernewsweblogs/hadoop-test-output

Create a TABLE representing the structure of the source logfiles.


CREATE EXTERNAL TABLE iislogs (
  date1 STRING,
  time1  STRING,
  s_sitename  STRING,
  s_computername  STRING,
  s_ip  STRING,
  cs_method  STRING,
  cs_uri_stem  STRING,
  cs_uri_query  STRING,
  s_port  STRING,
  cs_username  STRING,
  c_ip  STRING,
  cs_version  STRING,
  cs_user_agent  STRING,
  cs_cookie  STRING,
  cs_referer  STRING,
  cs_host  STRING,
  sc_status  STRING,
  sc_substatus  STRING,
  sc_win32_status  STRING,
  sc_bytes  STRING,
  cs_bytes  STRING,
  time_taken STRING)
  ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' LINES TERMINATED BY '\n'
LOCATION '${SOURCE1}/iislogs';

Verify this worked by running

hive> SHOW TABLES;


hive> SELECT COUNT(date1) FROM iislogs;

This should return a row count if successful obviously but only an OK if the data file don't parse.

The source files have field headers and comments which we don't need. We can remove these by copying all  none comment rows i.e. row that do not start with "#" into a new table.

We achieve this by first creating a new table with the same structure with a different S3 folder for storage


CREATE EXTERNAL TABLE iislogsclean (
  date1 STRING,
  time1  STRING,
  s_sitename  STRING,
  s_computername  STRING,
  s_ip  STRING,
  cs_method  STRING,
  cs_uri_stem  STRING,
  cs_uri_query  STRING,
  s_port  STRING,
  cs_username  STRING,
  c_ip  STRING,
  cs_version  STRING,
  cs_user_agent  STRING,
  cs_cookie  STRING,
  cs_referer  STRING,
  cs_host  STRING,
  sc_status  STRING,
  sc_substatus  STRING,
  sc_win32_status  STRING,
  sc_bytes  STRING,
  cs_bytes  STRING,
  time_taken STRING)
  ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' LINES TERMINATED BY '\n'
STORED AS TEXTFILE  
LOCATION '${SOURCE1}/iislog_clean/';

Then we SELECT from one and INSERT into the other as follows.


INSERT OVERWRITE TABLE iislogsclean
  SELECT * FROM iislogs WHERE NOT date1 LIKE '#%';


hive> SELECT COUNT(date1) FROM iislogsclean;

This should return a count less than the source table in theory.

Now the log data is in Hadoop ready for you to run some MapReduce jobs on.


MapReduce with Hadoop and Hive - "Hello World!"


Background
MapReduce is a concept I've been interested for several year mainly because it's at the core of how Google do some clever stuff at scale. Naturally this leads to having some interest in the Hadoop project. However none of this meant I got my hands dirty with the technology. The scale of the problems I was working on didn't warrant the large amount of time I may have had to spend in getting Hadoop infrastructure up and running. Only if somebody provided Hadoop-as-a-service I thought, and then Amazon came to the party with the Elastic MapReduce service - this was sometime ago obviously.

Over the last couple of days finally decided it was time to dive in. The Elastic MapReduce service means that I don't have to worry about setting up the infrastructure - my grandmother would be able to do that if she was alive - and get on with figuring out how to solve my problem with the tools. My chosen problem is to analyse IIS weblog files. At this time I've not found any other example of how to process IIS logs using Hadoop.

Getting Started
My first task was to understand how to interact with Hadoop (on Elastic MapReduce). It turns out there are several ways, using the high level data analysis platforms Pig or Hive, or using a general purpose language such as Java, Ruby, and Python. I decided Pig or Hive sounded like they were designed to specifically to deal with my problem while I've only dabbled with Java and Python and not at all with Ruby. Hive has a query language very similar to SQL which was the deciding factor.

Next task, sign-in to one of our AWS accounts, spin up a new Hive Work Flow in interactive mode with just the master node [1]. Find the public DNS of the master node and connect via SSH.

Now the fun begins, stop chatting and start typing code (plus some comments of course). I'm starting with a simple set of data so I can get familiar with Elastic MapReduce, Hadoop, and Hive.

My Hadoop + Hive "Hello World"
Fire-up the Hive CLI with a variable called SOURCE_S3_BUCKET to make life a little easier later. You can define multiple.


hadoop@ip-1-2-3-4:~$ hive \
    -d SOURCE_S3_BUCKET=s3://<my_s3_bucket_name>/hadoop-test-output



Upload a space delimited text file to your S3 bucket. I uploaded it to s3://<my_s3_bucket_name>/hadoop-test-output. This is your source data file, the stuff you want to process. I used the following (the text between the lines).

------------

In the example above nulls 
are inserted for the array
and map types in the
destination tables but potentially these
can also come from the
external table if the proper
row formats are specified loop

------------


Define a logical table structure that the data in the above file can map to.

hive>
CREATE EXTERNAL TABLE IF NOT EXISTS in_table (
  col1  STRING,
  col2  STRING,
  col3  STRING,
  col4  STRING,
  col5  STRING
)
  ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' LINES TERMINATED BY '\n'
LOCATION '${SOURCE_S3_BUCKET}';

hive> SHOW TABLES;

OK
in_table
Time taken: 0.079 seconds

hive> select * from in_table;
OK
In      the     example above   nulls
are     inserted        for     the     array
and     map     types   in      the
destination     tables  but     potentially     these
can     also    come    from    the
external        table   if      the     proper
row     formats are     specified       loop
In      the     example above   nulls
are     inserted        for     the     array
and     map     types   in      the
destination     tables  but     potentially     these
can     also    come    from    the
external        table   if      the     proper
row     formats are     specified       loop
Time taken: 13.174 seconds

Next Challege
Figure out how to load IIS format web log files in to Hadoop using Hive.


Notes:

  • EXTERNAL allows you to point at none local storage such as Amazon S3
  • A CREATE TABLE can be thought of as creating the logical table schema and the associated physical file structure (the file on disk). OR defining a logical table schema and an appropriate making to an existing physical data file that was created by some other source. An example of such a data file would be a web log file.
  • STORED AS signals that there is no existing file, hence create. It also defines the file format.
  • LOCATION defines the path of the source or destination files. When S3 is used the path format is s3://<bucket_name>/<optional_folder>
  • AWS Elastic MapReduce nodes (Amazon Debian ami-2cf2c358) have "hadoop" as the default username.
References:



Monday, 26 September 2011

HOW TO: Install Ylastic CostAgent on Windows

Although one can simply schedule the Ylastic CostAgent as a task on Ylastic itself ( see Can Ylastic host the cost agent for me? ) I decided I didn't want to do this as it involves giving Ylastic the master account username and password for each of our AWS accounts. Although we probably trust Ylastic this is not best practice. Therefore I decided to install the CostAgent on my work station (I'll move it to a server later) and schedule it. I quickly discovered that the main documentation only covered installation on Linux, didn't find anything else out  there hence this post.


  1. Install Strawberry perl from strawberryperl.com
  2. cmd> cpan
  3. cpan> install App::Ylastic::CostAgent
  4. Create CostAgent config file next to ylastic-costagent.bat called config.ini (ini format) more info here
    Example:

    ylastic_id = 1234567890abdcef1234567890abcdef12345678
    [AWS Account Number]
    user = foo@example.com
    pass = trustno1

    [4421-0000-3510]
    user = foo@example.com
    pass = trustno1
  5. To run c:\cost_agent_install_path\ylastic-costagent.bat config.ini (ylastic-costagent.bat -h for help). Create a schedule task to automate.
  6. Optionally you may create a folder for temp report downloads called 'ylastic-costagent-temp' (-d param) and for logs 'ylastic-costagent-logs' (-l param).
    
    
UPDATE 28/09/2011: It would seem that if you use the -d param on Windows the files upload with the full path and hence don't work. Not confirmed if it works using the default download path. Upload includes the full path in the file name on Ylastic servers, therefore doesn't work. Bug logged https://rt.cpan.org/Public/Bug/Display.html?id=71332


ref:
  1. Ylastic Spending Analysis
  2. app-ylastic-costagent README

Tuesday, 13 September 2011

Checkpoint SmartCenter Server Backup Automation Script


Today I finally got around to figuring out how to automate the backup of a Checkpoint SmartCenter Server. 


The following batch file can be run as a schedule task. It generates a backup file with a name composed of the current date and time hence not overwriting previous backups. Note the undocumented "-n" option that makes upgrade_export.exe run completely silently.


This can be further extended to move the backup file off the server once complete. FTP it off or upload it to an Amazon S3 bucket.


-----
cls
@echo off
set "filename=%TIME%_%DATE%"
rem echo.%filename%
set "filename=%filename:/=%"
set "filename=%filename::=%"
set "filename=%filename:.=%"
set "filename=%filename: =%" 
set "filename=c:\<some_folder>\%filename%.bak"
rem echo.%filename%


call C:\WINDOWS\FW1\R65\fw1\bin\upgrade_tools\upgrade_export.exe -n %filename% 
-----

Tuesday, 31 May 2011

bindtoroute53.pl error "Can't locate Net/DNS/RR.pm"

If you are trying to run bindtoroute53.pl on Windows and get the following error then there's a Python script that may work better. I've not tried it yet.


c:\>perl bindtoroute53.pl
Can't locate Net/DNS/RR.pm in @INC (@INC contains: C:/strawberry/perl/site/lib C:/strawberry/perl/vendor/lib C:/strawber
ry/perl/lib .) at bindtoroute53.pl line 78.
BEGIN failed--compilation aborted at bindtoroute53.pl line 78.


bindtoroute53.py
http://aws.amazon.com/developertools/Amazon-Route-53/9489892636320520

Tuesday, 15 February 2011

TIP: Amazon Web Services Operates Split-horizon DNS

While making some tweaks to our AWS setup we learnt that Amazon run a Split-horizon DNS setup. This is only really relevant if you have services that are internal (or backend) to your overall system such as a database (not RDS) behind a web server. And you want to run it off a static IP (Elastic IP) as you should, plus a custom DNS entry such as mydatabase3.abc.com.

In the above scenario you would have you database secured with a rule only allowing access to the web server. You do this by adding a firewall rule by Security Group name i.e. any EC2 instance with the given Security Group attached (the one defined by name in the rule) is allowed access to the database ( more on EC2 Security Groups). Then you attach an Elastic IP to your EC2 database instance, create an A Record under your domain pointing to the Elastic IP, then use this A Record in the connection string in your application running on the web server.

This will not behave as you expect, your web server will lose connectivity to the database. The reason for this is that the Public DNS name of the database server you were using, on a EC2 instance,  still resolves to it's private IP address. Where as from your PC in the office or at home it will resolve to the a Public IP address i.e. Split-horizon DNS. Firewall rules based on security group names resolve against private IP addresses to identify EC2 instance therefore allowing this traffic. When you try and connect using your custom DNS name (which always resolves to the Elastic IP) the database server is not receiving traffic from the a private IP the firewall can associate back to an instance and hence back to attached security groups  hence blocks the traffic.

The solution is to always create custom DNS names as CNames to the EC2 Public DNS record rather than A Record directly resolving to the Elastic IP and all will work nicely.