Friday, 7 October 2011

Load IIS 6.0 weblog files into Hadoop using Hive


Following on from part 1 here is how you load IIS 6.0 logs into Hadoop ready for query.



Fire-up the the Hive CLI
hive \
    -d SOURCE1=s3://consumernewsweblogs/hadoop-test-output

Create a TABLE representing the structure of the source logfiles.


CREATE EXTERNAL TABLE iislogs (
  date1 STRING,
  time1  STRING,
  s_sitename  STRING,
  s_computername  STRING,
  s_ip  STRING,
  cs_method  STRING,
  cs_uri_stem  STRING,
  cs_uri_query  STRING,
  s_port  STRING,
  cs_username  STRING,
  c_ip  STRING,
  cs_version  STRING,
  cs_user_agent  STRING,
  cs_cookie  STRING,
  cs_referer  STRING,
  cs_host  STRING,
  sc_status  STRING,
  sc_substatus  STRING,
  sc_win32_status  STRING,
  sc_bytes  STRING,
  cs_bytes  STRING,
  time_taken STRING)
  ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' LINES TERMINATED BY '\n'
LOCATION '${SOURCE1}/iislogs';

Verify this worked by running

hive> SHOW TABLES;

hive> SELECT COUNT(date1) FROM iislogs;

This should return a row count if successful obviously but only an OK if the data file don't parse.

The source files have field headers and comments which we don't need. We can remove these by copying all  none comment rows i.e. row that do not start with "#" into a new table.

We achieve this by first creating a new table with the same structure with a different S3 folder for storage


CREATE EXTERNAL TABLE iislogsclean (
  date1 STRING,
  time1  STRING,
  s_sitename  STRING,
  s_computername  STRING,
  s_ip  STRING,
  cs_method  STRING,
  cs_uri_stem  STRING,
  cs_uri_query  STRING,
  s_port  STRING,
  cs_username  STRING,
  c_ip  STRING,
  cs_version  STRING,
  cs_user_agent  STRING,
  cs_cookie  STRING,
  cs_referer  STRING,
  cs_host  STRING,
  sc_status  STRING,
  sc_substatus  STRING,
  sc_win32_status  STRING,
  sc_bytes  STRING,
  cs_bytes  STRING,
  time_taken STRING)
  ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' LINES TERMINATED BY '\n'
STORED AS TEXTFILE  
LOCATION '${SOURCE1}/iislog_clean/';

Then we SELECT from one and INSERT into the other as follows.


INSERT OVERWRITE TABLE iislogsclean
  SELECT * FROM iislogs WHERE NOT date1 LIKE '#%';


hive> SELECT COUNT(date1) FROM iislogsclean;

This should return a count less than the source table in theory.

Now the log data is in Hadoop ready for you to run some MapReduce jobs on.


No comments: