Tags: #aws #athena #s3 #sql
Note: reference article
Basic query example:
SELECT status_code,
COUNT(status_code) AS requests
FROM fastly_logs.example_com
GROUP BY requests
ORDER BY requests DESC
Create a table:
CREATE EXTERNAL TABLE `www_buzzfeed_com`(
`client_ip` string COMMENT '',
`timestamp` timestamp COMMENT '',
`method` string COMMENT '',
`path` string COMMENT '',
`http_protocol` string COMMENT '',
`status_code` smallint COMMENT '',
`response_size` int COMMENT '',
`tls_version` string COMMENT '',
`cache_status` string COMMENT '')
PARTITIONED BY (
`dt` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'input.regex'='(\\S+) \\[(.+)\\] \\\"(\\S+) (\\S+) (\\S+)\\\" (\\d{3}) (\\d+) (\\S+) (\\S+)$')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://bf-logs-archive/Fastly/www.buzzfeed.com'
TBLPROPERTIES (
'has_encrypted_data'='false',
'transient_lastDdlTime'='1512424557')
Note: the
input.regex
matches the following log format from Fastly:> > ```%h [%{%Y-%m-%d %H:%M:%S}t.%{msec_frac}t] "%r" %>s %B %{tls.client.protocol}V %{fastly_info.state}V``` Rebuild indices (e.g. Athena doesn't know about _new_ content otherwise): ```sql MSCK REPAIR TABLE www_buzzfeed_com
Select requests based on a specific timestamp value:
SELECT * FROM www_buzzfeed_com WHERE timestamp = timestamp '2017-12-04 21:50:01.646' AND cache_status LIKE 'HIT%'
Select requests based on datetime partition field:
SELECT * FROM www_buzzfeed_com WHERE dt='2019-04-02'
Select requests based on a specific date range:
SELECT * FROM www_buzzfeed_com WHERE dt>='2019-04-02' AND dt<='2019-04-03' AND cache_status LIKE 'MISS%'
Faster than LIKE statement:
SELECT * FROM www_buzzfeed_com WHERE dt>='2019-04-02' AND dt<='2019-04-03' AND regexp_like(cache_status, '^(MISS|PASS)')
Order data so the most recent appears at the top and only the most recent 10 records:
SELECT * FROM www_buzzfeed_com WHERE dt='2019-04-02' ORDER BY timestamp DESC limit 10
Note: see peformance tricks here
To drop a table:
DROP TABLE `www_stage_buzzfeed_com_json`;
https://docs.aws.amazon.com/athena/latest/ug/querying-JSON.html
CREATE EXTERNAL TABLE `www_stage_buzzfeed_com_json`(
`backend` string,
`bodySize` int,
`contentType` string,
`fastlyState` string,
`host` string,
`ip` string,
`method` string,
`protocol` string,
`statusCode` smallint,
`timestamp` timestamp,
`tlsVersion` string,
`url` string,
`userAgent` string)
PARTITIONED BY (
`dt` string)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'ignore.malformed.json' = 'true')
LOCATION
's3://bf-logs-archive/Fastly/json/www-stage.buzzfeed.com'
TBLPROPERTIES (
'has_encrypted_data'='false')
Notice that camelCase fields are lowercased (e.g. fastlyState becomes fastlystate):
SELECT url, statuscode, fastlystate FROM www_stage_buzzfeed_com_json WHERE dt='2019-04-04' AND backend = '' ORDER BY timestamp DESC LIMIT 10
To format output from command line:
cat ~/Downloads/2019-04-04T11_55_00.000-EKBlIHaTVrnIxJ7EPna7.log | tail -n 1 | python -m json.tool
{
"backend": "",
"bodySize": "240",
"contentType": "application/json",
"fastlyState": "HIT",
"host": "www.buzzfeed.com",
"ip": "70.50.102.167",
"method": "GET",
"protocol": "HTTP/1.1",
"statusCode": "200",
"timestamp": "2019-04-04 11:59:59.650",
"tlsVersion": "TLSv1.2",
"url": "/manifest.json",
"userAgent": "Mozilla/5.0 (Linux; Android 9; SAMSUNG SM-G960W Build/PPR1.180610.011) AppleWebKit/537.36 (KHTML, like Gecko) SamsungBrowser/9.2 Chrome/67.0.3396.87 Mobile Safari/537.36"
}
If you set a field to have a field of type int
and it’s sent as an empty string “” by default by the provider of the data, then expect Hive DB to complain loudly when trying to search for data.