Load IIS 6.0 weblog files into Hadoop using Hive

07 October 2011

Following on from part 1 here is how you load IIS 6.0 logs into Hadoop ready for query.

Fire-up the the Hive CLI
hive \
-d SOURCE1=s3://consumernewsweblogs/hadoop-test-output

Create a TABLE representing the structure of the source logfiles.

CREATE EXTERNAL TABLE iislogs (
date1 STRING,
time1 STRING,
s_sitename STRING,
s_computername STRING,
s_ip STRING,
cs_method STRING,
cs_uri_stem STRING,
cs_uri_query STRING,
s_port STRING,
cs_username STRING,
c_ip STRING,
cs_version STRING,
cs_user_agent STRING,
cs_cookie STRING,
cs_referer STRING,
cs_host STRING,
sc_status STRING,
sc_substatus STRING,
sc_win32_status STRING,
sc_bytes STRING,
cs_bytes STRING,
time_taken STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ’ ’ LINES TERMINATED BY ’\n’
LOCATION ’${SOURCE1}/iislogs’;

Verify this worked by running

hive> SHOW TABLES;

hive> SELECT COUNT(date1) FROM iislogs;

This should return a row count if successful obviously but only an OK if the data file don’t parse.

The source files have field headers and comments which we don’t need. We can remove these by copying all none comment rows i.e. row that do not start with ”#” into a new table.

We achieve this by first creating a new table with the same structure with a different S3 folder for storage

CREATE EXTERNAL TABLE iislogsclean (
date1 STRING,
time1 STRING,
s_sitename STRING,
s_computername STRING,
s_ip STRING,
cs_method STRING,
cs_uri_stem STRING,
cs_uri_query STRING,
s_port STRING,
cs_username STRING,
c_ip STRING,
cs_version STRING,
cs_user_agent STRING,
cs_cookie STRING,
cs_referer STRING,
cs_host STRING,
sc_status STRING,
sc_substatus STRING,
sc_win32_status STRING,
sc_bytes STRING,
cs_bytes STRING,
time_taken STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ’ ’ LINES TERMINATED BY ’\n’
STORED AS TEXTFILE
LOCATION ’${SOURCE1}/iislog_clean/‘;

Then we SELECT from one and INSERT into the other as follows.

INSERT OVERWRITE TABLE iislogsclean
SELECT * FROM iislogs WHERE NOT date1 LIKE ’#%‘;

hive> SELECT COUNT(date1) FROM iislogsclean;

This should return a count less than the source table in theory.

Now the log data is in Hadoop ready for you to run some MapReduce jobs on.