In this blog post I will discuss following scenarios to connect to databases from AWS Lambda function: Connecting to Amazon Aurora PostgreSQL database in private subnet with public accessibility set to No in same AWS account.Connecting to cross account Amazon Redshift database in public subnet with public accessibility set to Yes. Connect to Amazon Aurora … Continue reading Connect to AWS Aurora PostgreSQL/Amazon Redshift Database from AWS Lambda
Add new partitions in AWS Glue Data Catalog from AWS Glue Job
Given that you have a partitioned table in AWS Glue Data Catalog, there are few ways in which you can update the Glue Data Catalog with the newly created partitions. Run MSCK REPAIR TABLE <database>.<table_name> in AWS Athena service.Rerun the AWS Glue crawler . Recently, AWS Glue service team has added a new feature (or … Continue reading Add new partitions in AWS Glue Data Catalog from AWS Glue Job
Use redshift-data api with AWS Glue Python Shell job
AWS Glue is a fully managed extract, transform, and load (ETL) service to process large amount of datasets from various sources for analytics and data processing. When you add a AWS Glue job, you can choose the job to be either Spark or Spark Streaming or Python shell type. For one of my use-case I … Continue reading Use redshift-data api with AWS Glue Python Shell job
Sequential counter with groupby – Pandas DataFrame
Pandas DataFrame is a 2-dimensional tabular data structure with labeled axes. For this blog, we have a table "person" in database containing name, age and city column. As dml transactions are performed on this table, the new image of the record along with the dml operation type is captured and stored in json file. The … Continue reading Sequential counter with groupby – Pandas DataFrame
New Features in Amazon DynamoDB – PartiQL, Export to S3, Integration with Kinesis Data Streams
Every time with AWS re:Invent around, AWS releases many new features over a period of month. In this blog post I will touch on 3 new features which were introduced for Amazon DynamoDB. DynamoDB is a non-relational managed database with single digit millisecond performance at any scale. New Features in Amazon DynamoDB - PartiQL - SQL-compatible … Continue reading New Features in Amazon DynamoDB – PartiQL, Export to S3, Integration with Kinesis Data Streams
Using AWS Data Wrangler with AWS Glue Job 2.0 and Amazon Redshift connection
I will admit, AWS Data Wrangler has become my go to package for developing extract, transform, and load (ETL) data pipelines and other day-to-day scripts. AWS Data Wrangler integration with multiple big data AWS services like S3, Glue Catalog, Athena, Databases, EMR, and others makes life simple for engineers. It also provides the ability to … Continue reading Using AWS Data Wrangler with AWS Glue Job 2.0 and Amazon Redshift connection
Parquet timestamp and Athena Query
In this blog I will walk you through the way timestamp is stored in Parquet file version 1.0 and 2.0, how the timestamp column data is displayed in Athena for each version and how you can cast the timestamp column in Athena to view timestamp for version 2.0 Parquet files. Using AWS Glue crawler, I … Continue reading Parquet timestamp and Athena Query
Rename Glue Tables using AWS Data Wrangler
I had a use case of renaming over 50 tables, adding "prod_" prefix to the existing Glue tables. AWS Athena does not support native Hive DDL "ALTER TABLE table_name RENAME TO" command. So one of the option was to - "Generate Create Table DDL" in AWS Athena.Modify the table name.Execute the DDL.Preview the new table.Drop the … Continue reading Rename Glue Tables using AWS Data Wrangler
Transform AWS CloudTrail data using AWS Data Wrangler
AWS CloudTrail service captures actions taken by an IAM user, IAM role, APIs, SDKs and other AWS services. By default, AWS CloudTrail is enabled in your AWS account. You can create "trail" to record ongoing events which will be delivered in JSON format to an Amazon S3 Bucket of your choice. CloudTrail Dashboard Create Trail … Continue reading Transform AWS CloudTrail data using AWS Data Wrangler
AWS Glue and PySpark Guide
In this post, I have penned down AWS Glue and PySpark functionalities which can be helpful when thinking of creating AWS pipeline and writing AWS Glue PySpark scripts. AWS Glue is a fully managed extract, transform, and load (ETL) service to process large amount of datasets from various sources for analytics and data processing. While … Continue reading AWS Glue and PySpark Guide