[AWS Athena SQL]

Tags: #aws #athena #s3 #sql

AWS Athena SQL.md

Note: reference article

Basic query example:

SELECT status_code,
         COUNT(status_code) AS requests
FROM fastly_logs.example_com
GROUP BY  requests
ORDER BY  requests DESC

Create a table:

CREATE EXTERNAL TABLE `www_buzzfeed_com`(
  `client_ip` string COMMENT '', 
  `timestamp` timestamp COMMENT '', 
  `method` string COMMENT '', 
  `path` string COMMENT '', 
  `http_protocol` string COMMENT '', 
  `status_code` smallint COMMENT '', 
  `response_size` int COMMENT '', 
  `tls_version` string COMMENT '', 
  `cache_status` string COMMENT '')
PARTITIONED BY ( 
  `dt` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.RegexSerDe' 
WITH SERDEPROPERTIES ( 
  'input.regex'='(\\S+) \\[(.+)\\] \\\"(\\S+) (\\S+) (\\S+)\\\" (\\d{3}) (\\d+) (\\S+) (\\S+)$') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://bf-logs-archive/Fastly/www.buzzfeed.com'
TBLPROPERTIES (
  'has_encrypted_data'='false', 
  'transient_lastDdlTime'='1512424557')

Note: the input.regex matches the following log format from Fastly:

>
> ```%h [%{%Y-%m-%d %H:%M:%S}t.%{msec_frac}t] "%r" %>s %B %{tls.client.protocol}V %{fastly_info.state}V```

Rebuild indices (e.g. Athena doesn't know about _new_ content otherwise):


```sql
MSCK REPAIR TABLE www_buzzfeed_com

Select requests based on a specific timestamp value:

SELECT * FROM www_buzzfeed_com WHERE timestamp = timestamp '2017-12-04 21:50:01.646' AND cache_status LIKE 'HIT%'

Select requests based on datetime partition field:

SELECT * FROM www_buzzfeed_com WHERE dt='2019-04-02'

Select requests based on a specific date range:

SELECT * FROM www_buzzfeed_com WHERE dt>='2019-04-02' AND dt<='2019-04-03' AND cache_status LIKE 'MISS%'

Faster than LIKE statement:

SELECT * FROM www_buzzfeed_com WHERE dt>='2019-04-02' AND dt<='2019-04-03' AND regexp_like(cache_status, '^(MISS|PASS)')

Order data so the most recent appears at the top and only the most recent 10 records:

SELECT * FROM www_buzzfeed_com WHERE dt='2019-04-02' ORDER BY timestamp DESC limit 10

Note: see peformance tricks here

To drop a table:

DROP TABLE `www_stage_buzzfeed_com_json`;

JSON Format

https://docs.aws.amazon.com/athena/latest/ug/querying-JSON.html

CREATE EXTERNAL TABLE `www_stage_buzzfeed_com_json`(
  `backend` string,
  `bodySize` int,
  `contentType` string,
  `fastlyState` string,
  `host` string,
  `ip` string,
  `method` string,
  `protocol` string,
  `statusCode` smallint,
  `timestamp` timestamp,
  `tlsVersion` string,
  `url` string,
  `userAgent` string)
PARTITIONED BY ( 
  `dt` string)
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
  'ignore.malformed.json' = 'true')
LOCATION
  's3://bf-logs-archive/Fastly/json/www-stage.buzzfeed.com'
TBLPROPERTIES (
  'has_encrypted_data'='false')

Notice that camelCase fields are lowercased (e.g. fastlyState becomes fastlystate):

SELECT url, statuscode, fastlystate FROM www_stage_buzzfeed_com_json WHERE dt='2019-04-04' AND backend = '' ORDER BY timestamp DESC LIMIT 10

To format output from command line:

cat ~/Downloads/2019-04-04T11_55_00.000-EKBlIHaTVrnIxJ7EPna7.log | tail -n 1 | python -m json.tool

{
    "backend": "",
    "bodySize": "240",
    "contentType": "application/json",
    "fastlyState": "HIT",
    "host": "www.buzzfeed.com",
    "ip": "70.50.102.167",
    "method": "GET",
    "protocol": "HTTP/1.1",
    "statusCode": "200",
    "timestamp": "2019-04-04 11:59:59.650",
    "tlsVersion": "TLSv1.2",
    "url": "/manifest.json",
    "userAgent": "Mozilla/5.0 (Linux; Android 9; SAMSUNG SM-G960W Build/PPR1.180610.011) AppleWebKit/537.36 (KHTML, like Gecko) SamsungBrowser/9.2 Chrome/67.0.3396.87 Mobile Safari/537.36"
}

Caution!

If you set a field to have a field of type int and it’s sent as an empty string “” by default by the provider of the data, then expect Hive DB to complain loudly when trying to search for data.