Using AWS Data Wrangler with AWS Glue Job 2.0

I will admit, AWS Data Wrangler has become my go to package for developing extract, transform, and load (ETL) data pipelines and other day-to-day scripts. AWS Data Wrangler integration with multiple big data AWS services like S3, Glue Catalog, Athena, Databases, EMR, and others makes life simple for engineers. It also provides the ability to … Continue reading Using AWS Data Wrangler with AWS Glue Job 2.0

Parquet timestamp and Athena Query

In this blog I will walk you through the way timestamp is stored in Parquet file version 1.0 and 2.0, how the timestamp column data is displayed in Athena for each version and how you can cast the timestamp column in Athena to view timestamp for version 2.0 Parquet files. Using AWS Glue crawler, I … Continue reading Parquet timestamp and Athena Query

Rename Glue Tables using AWS Data Wrangler

I had a use case of renaming over 50 tables, adding "prod_" prefix to the existing Glue tables. AWS Athena does not support native Hive DDL "ALTER TABLE table_name RENAME TO" command. So one of the option was to - "Generate Create Table DDL" in AWS Athena.Modify the table name.Execute the DDL.Preview the new table.Drop the … Continue reading Rename Glue Tables using AWS Data Wrangler

Transform AWS CloudTrail data using AWS Data Wrangler

AWS CloudTrail service captures actions taken by an IAM user, IAM role, APIs, SDKs and other AWS services. By default, AWS CloudTrail is enabled in your AWS account. You can create "trail" to record ongoing events which will be delivered in JSON format to an Amazon S3 Bucket of your choice. CloudTrail Dashboard Create Trail … Continue reading Transform AWS CloudTrail data using AWS Data Wrangler

AWS Glue and PySpark Guide

In this post, I have penned down AWS Glue and PySpark functionalities which can be helpful when thinking of creating AWS pipeline and writing AWS Glue PySpark scripts. AWS Glue is a fully managed extract, transform, and load (ETL) service to process large amount of datasets from various sources for analytics and data processing. While … Continue reading AWS Glue and PySpark Guide

Filtering using Events Patterns – EventBridge

Amazon EventBridge as the name suggest is a serverless pub/sub allowing applications to connect via an "event bus". It helps build loosely coupled and distributed event driven architecture. EventBridge was formerly called CloudWatch Events. In this blog, I will give an example of setting filter based event pattern in Amazon EventBridge to send SNS notification. … Continue reading Filtering using Events Patterns – EventBridge

AWS Athena – DML Queries

You can learn something new everyday, and today I learned that AWS Athena supports INSERT INTO queries. SELECT * FROM information_schema.columns WHERE table_schema = 'marvel' AND table_name = 'marvel_superheroes' Lets create table based on marvel_superheroes using CTAS command - # To create non-partitioned table based on where clause CREATE TABLE "marvel"."marvel_1936" WITH ( format = … Continue reading AWS Athena – DML Queries

Implementing Glue ETL job with Job Bookmarks

AWS Glue is a fully managed ETL service to load large amounts of datasets from various sources for analytics and data processing with Apache Spark ETL jobs. In this post I will discuss the use of AWS Glue Job Bookmarks feature in the following architecture. AWS Glue Job Bookmarks help Glue maintain state information of … Continue reading Implementing Glue ETL job with Job Bookmarks

Aurora MySQL – Export data to S3

Using SELECT INTO OUTFILE S3 you can query data from an Aurora MySQL DB cluster and save it directly into text files stored in S3 bucket. 1. Create an IAM policy for S3. { "Version": "2012-10-17", "Statement": [ { "Sid": "VisualEditor0", "Effect": "Allow", "Action": [ "s3:DeleteObject", "s3:GetBucketLocation", "s3:GetObject", "s3:ListBucket", "s3:ListBucketMultipartUploads", "s3:PutObject" ], "Resource": [ "arn:aws:s3:::bucket-name", … Continue reading Aurora MySQL – Export data to S3