Amazon Athena is a query service that enables users to analyze data in Amazon S3 using SQL. It uses Presto with ANSI SQL support and works with multiple data formats like CSV, JSON, Parquet, Avro and ORC. In this blog post I will go through below available features of Athena - CREATE TABLE AS SELECT … Continue reading Amazon Athena – CTAS, UNLOAD, Parameterized Prepared statements, Partition Projection
Add new partitions in AWS Glue Data Catalog from AWS Glue Job
Given that you have a partitioned table in AWS Glue Data Catalog, there are few ways in which you can update the Glue Data Catalog with the newly created partitions. Run MSCK REPAIR TABLE <database>.<table_name> in AWS Athena service.Rerun the AWS Glue crawler . Recently, AWS Glue service team has added a new feature (or … Continue reading Add new partitions in AWS Glue Data Catalog from AWS Glue Job
Parquet timestamp and Athena Query
In this blog I will walk you through the way timestamp is stored in Parquet file version 1.0 and 2.0, how the timestamp column data is displayed in Athena for each version and how you can cast the timestamp column in Athena to view timestamp for version 2.0 Parquet files. Using AWS Glue crawler, I … Continue reading Parquet timestamp and Athena Query
Rename Glue Tables using AWS Data Wrangler
I had a use case of renaming over 50 tables, adding "prod_" prefix to the existing Glue tables. AWS Athena does not support native Hive DDL "ALTER TABLE table_name RENAME TO" command. So one of the option was to - "Generate Create Table DDL" in AWS Athena.Modify the table name.Execute the DDL.Preview the new table.Drop the … Continue reading Rename Glue Tables using AWS Data Wrangler
AWS Athena – DML Queries
You can learn something new everyday, and today I learned that AWS Athena supports INSERT INTO queries. SELECT * FROM information_schema.columns WHERE table_schema = 'marvel' AND table_name = 'marvel_superheroes' Lets create table based on marvel_superheroes using CTAS command - # To create non-partitioned table based on where clause CREATE TABLE "marvel"."marvel_1936" WITH ( format = … Continue reading AWS Athena – DML Queries
Usecase with RDS Snapshot Export to S3
AWS recently announced "Amazon RDS Snapshot Export to S3" feature wherein you can now export Amazon Relational Database Service (Amazon RDS) or Amazon Aurora snapshots to Amazon S3 as Apache Parquet, an efficient open columnar storage format for analytics. I had a use-case to refresh Athena tables daily with full data set in Account B(us-east-1) … Continue reading Usecase with RDS Snapshot Export to S3
Athena: Extracting data from JSON
Suppose you have a table in Athena and its column contain JSON data. How can you extract the individual keys? In the example, the table has column "fixedproperties" which contain JSON data - How can you display the data is below format? select json_extract(fixedproperties, '$.objectId') as object_id, json_extract(fixedproperties, '$.custId') as cust_id, json_extract(fixedproperties, '$.score') as score … Continue reading Athena: Extracting data from JSON
Athena – SQL to get date of next Monday
I was recently asked how to get date of next Monday irrespective of which day of the week sql is executed. So thought to share it, in-case someone else has such requirement. select date_add('day', 8 - extract(day_of_week from current_date), current_date) Or, select date_trunc('week', current_date) + interval '7' day; Happy learning 🙂
Expanding array to multiple rows – Athena
A single row in Athena table is stored as -- select id, course, date from demo.course_tab where id='1234567892' id course date 1234567892 [95c3c1bc5873, 2e345b2eb678, 027b02599f4a, 8695a580520b, 5d453355d415, cdcc7682070b] 2019-06-13 The datatype for course column is array(string). Now, how can you get the output in below format - id course date 1 1234567892 95c3c1bc5873 2019-06-13 2 … Continue reading Expanding array to multiple rows – Athena