Implementing Glue ETL job with Job Bookmarks

AWS Glue is a fully managed ETL service to load large amounts of datasets from various sources for analytics and data processing with Apache Spark ETL jobs. In this post I will discuss the use of AWS Glue Job Bookmarks feature in the following architecture. AWS Glue Job Bookmarks help Glue maintain state information of … Continue reading Implementing Glue ETL job with Job Bookmarks

Reading Parquet files with AWS Lambda

I had a use case to read data (few columns) from parquet file stored in S3, and write to DynamoDB table, every time a file was uploaded. Thinking to use AWS Lambda, I was looking at options of how to read parquet files within lambda until I stumbled upon AWS Data Wrangler. From the document … Continue reading Reading Parquet files with AWS Lambda

AWS Glue – Querying Nested JSON with Relationalize Transform

AWS Glue has transform Relationalize that can convert nested JSON into columns that you can then write to S3 or import into relational databases. As an example - Initial Schema: >>> df.printSchema() root |-- Id: string (nullable = true) |-- LastUpdated: long (nullable = true) |-- LastUpdatedBy: string (nullable = true) |-- Properties: struct (nullable … Continue reading AWS Glue – Querying Nested JSON with Relationalize Transform

Redshift: Convert TEXT to Timestamp

How do you convert TEXT to timestamp in redshift? If the score column has data in given format, how can you display the timestamp. {"Choices":null, "timestamp":"1579650266955", "scaledScore":null} select cast(json_extract_path_text(score, 'timestamp') as timestamp) from schema.table limit 10; This sql will fail with -- ERROR: Invalid data DETAIL: ----------------------------------------------- error: Invalid data code: 8001 context: Invalid format … Continue reading Redshift: Convert TEXT to Timestamp

Athena: Extracting data from JSON

Suppose you have a table in Athena and its column contain JSON data. How can you extract the individual keys? In the example, the table has column "fixedproperties" which contain JSON data - How can you display the data is below format? select json_extract(fixedproperties, '$.objectId') as object_id, json_extract(fixedproperties, '$.custId') as cust_id, json_extract(fixedproperties, '$.score') as score … Continue reading Athena: Extracting data from JSON

Athena – SQL to get date of next Monday

I was recently asked how to get date of next Monday irrespective of which day of the week sql is executed. So thought to share it, in-case someone else has such requirement. select date_add('day', 8 - extract(day_of_week from current_date), current_date)   Or, select date_trunc('week', current_date) + interval '7' day; Happy learning 🙂  

S3 – fatal error: An error occurred (404) when calling the HeadObject operation

Make sure to use --recursive parameter. [desktop: test]:${PWD}> aws s3 cp s3://demo-beta/dw/user/dt=2019-07-30/ /tmp/dw/ fatal error: An error occurred (404) when calling the HeadObject operation: Key "dw/user/dt=2019-07-30/" does not exist [desktop: test]:${PWD}> aws s3 cp s3://demo-beta/dw/user/dt=2019-07-30/ /tmp/dw/ --recursive download: s3://demo-beta/dw/user/dt=2019-07-30/part-00002-fd866c-238-489-a44-739f1d04-c000.snappy.parquet to ../../../tmp/dw/part-00002-fd866c-238-489-a44-739f1d04-c000.snappy.parquet From Documentation - --recursive (boolean) Command is performed on all files or objects under … Continue reading S3 – fatal error: An error occurred (404) when calling the HeadObject operation