Guide on PySpark DataFrame Functionality

In this blog post I have covered some of the common PySpark DataFrame Functions, Joins, and Windows Functions.  All the commands were executed on the new job authoring Jupyter notebook available in preview in AWS Glue Studio. Import Libraries from awsglue.job import Job from awsglue.transforms import * from awsglue.context import GlueContext import pyspark.sql.functions as F … Continue reading Guide on PySpark DataFrame Functionality

Amazon Athena – CTAS, UNLOAD, Parameterized Prepared statements, Partition Projection

Amazon Athena is a query service that enables users to analyze data in Amazon S3 using SQL. It uses Presto with ANSI SQL support and works with multiple data formats like CSV, JSON, Parquet, Avro and ORC. In this blog post I will go through below available features of Athena - CREATE TABLE AS SELECT … Continue reading Amazon Athena – CTAS, UNLOAD, Parameterized Prepared statements, Partition Projection

Apache Kafka – Introduction & Basic Commands

In this blog post I will cover some of the basics of Apache Kafka and few useful commands to keep handy. Introduction Apache Kafka is an open-source distributed event streaming platform. It allows you to decouple your source system and target system. It is optimized for ingesting and processing streaming data in real-time.  Due to … Continue reading Apache Kafka – Introduction & Basic Commands

How to load table with JSONB data type into Aurora PostgreSQL using AWS Glue

In this blog post I will cover what is JSON data type, what options does PostgreSQL offers to store JSON data, how you can create AWS Glue connection to Aurora PostgreSQL database running in private subnet and how can you then use AWS Glue to write data into table with JSONB datatype into Aurora/RDS PostgreSQL … Continue reading How to load table with JSONB data type into Aurora PostgreSQL using AWS Glue

Export table from Aurora PostgreSQL to Amazon S3

In this blog post I discuss how to export 100GB non-partitioned table from Aurora PostgreSQL to Amazon S3. I will walk you through two approaches that you can use to export the data. Firstly I will demonstrate using aws_s3, a PostgreSQL extension which Aurora PostgreSQL provides and then using AWS Glue service. The post also … Continue reading Export table from Aurora PostgreSQL to Amazon S3

Handling missing values in Pandas to Spark DataFrame conversion

Data transformation is an important aspect of Data Engineering and can be a challenging task depending on the dataset and the transformation requirements. A bug in data transformation can have a severe impact on the final data set generated leading to data issues. In this blog I am going to share my experience of having … Continue reading Handling missing values in Pandas to Spark DataFrame conversion

Add new partitions in AWS Glue Data Catalog from AWS Glue Job

Given that you have a partitioned table in AWS Glue Data Catalog, there are few ways in which you can update the Glue Data Catalog with the newly created partitions. Run MSCK REPAIR TABLE <database>.<table_name> in AWS Athena service.Rerun the AWS Glue crawler . Recently, AWS Glue service team has added a new feature (or … Continue reading Add new partitions in AWS Glue Data Catalog from AWS Glue Job

Sequential counter with groupby – Pandas DataFrame

Pandas DataFrame is a 2-dimensional tabular data structure with labeled axes. For this blog, we have a table "person" in database containing name, age and city column. As dml transactions are performed on this table, the new image of the record along with the dml operation type is captured and stored in json file. The … Continue reading Sequential counter with groupby – Pandas DataFrame

New Features in Amazon DynamoDB – PartiQL, Export to S3, Integration with Kinesis Data Streams

Every time with AWS re:Invent around, AWS releases many new features over a period of month. In this blog post I will touch on 3 new features which were introduced for Amazon DynamoDB. DynamoDB is a non-relational managed database with single digit millisecond performance at any scale. New Features in Amazon DynamoDB - PartiQL - SQL-compatible … Continue reading New Features in Amazon DynamoDB – PartiQL, Export to S3, Integration with Kinesis Data Streams

Using AWS Data Wrangler with AWS Glue Job 2.0 and Amazon Redshift connection

I will admit, AWS Data Wrangler has become my go to package for developing extract, transform, and load (ETL) data pipelines and other day-to-day scripts. AWS Data Wrangler integration with multiple big data AWS services like S3, Glue Catalog, Athena, Databases, EMR, and others makes life simple for engineers. It also provides the ability to … Continue reading Using AWS Data Wrangler with AWS Glue Job 2.0 and Amazon Redshift connection