Overview AWS Glue DataBrew is a no-code visual data preparation tool to clean and normalize the data. It comes with many pre-built transformations to help in data preparation task. The service is geared more towards the business analysts and data scientists personas to explore, discover, visualize, clean, transform, and get insights from terabytes of raw … Continue reading AWS Glue DataBrew – Overview
Guide on PySpark DataFrame Functionality
In this blog post I have covered some of the common PySpark DataFrame Functions, Joins, and Windows Functions. All the commands were executed on the new job authoring Jupyter notebook available in preview in AWS Glue Studio. Import Libraries from awsglue.job import Job from awsglue.transforms import * from awsglue.context import GlueContext import pyspark.sql.functions as F … Continue reading Guide on PySpark DataFrame Functionality
How to load table with JSONB data type into Aurora PostgreSQL using AWS Glue
In this blog post I will cover what is JSON data type, what options does PostgreSQL offers to store JSON data, how you can create AWS Glue connection to Aurora PostgreSQL database running in private subnet and how can you then use AWS Glue to write data into table with JSONB datatype into Aurora/RDS PostgreSQL … Continue reading How to load table with JSONB data type into Aurora PostgreSQL using AWS Glue
Handling missing values in Pandas to Spark DataFrame conversion
Data transformation is an important aspect of Data Engineering and can be a challenging task depending on the dataset and the transformation requirements. A bug in data transformation can have a severe impact on the final data set generated leading to data issues. In this blog I am going to share my experience of having … Continue reading Handling missing values in Pandas to Spark DataFrame conversion
AWS Glue Job Fails with CSV data source does not support map data type error
AWS Glue is a serverless ETL service to process large amount of datasets from various sources for analytics and data processing. Recently I came across "CSV data source does not support map data type" error for a newly created glue job. In a nutshell, the job was performing below steps: Read the data from S3 … Continue reading AWS Glue Job Fails with CSV data source does not support map data type error
Using AWS Data Wrangler with AWS Glue Job 2.0 and Amazon Redshift connection
I will admit, AWS Data Wrangler has become my go to package for developing extract, transform, and load (ETL) data pipelines and other day-to-day scripts. AWS Data Wrangler integration with multiple big data AWS services like S3, Glue Catalog, Athena, Databases, EMR, and others makes life simple for engineers. It also provides the ability to … Continue reading Using AWS Data Wrangler with AWS Glue Job 2.0 and Amazon Redshift connection
AWS Glue and PySpark Guide
In this post, I have penned down AWS Glue and PySpark functionalities which can be helpful when thinking of creating AWS pipeline and writing AWS Glue PySpark scripts. AWS Glue is a fully managed extract, transform, and load (ETL) service to process large amount of datasets from various sources for analytics and data processing. While … Continue reading AWS Glue and PySpark Guide
Filtering using Events Patterns – EventBridge
Amazon EventBridge as the name suggest is a serverless pub/sub allowing applications to connect via an "event bus". It helps build loosely coupled and distributed event driven architecture. EventBridge was formerly called CloudWatch Events. In this blog, I will give an example of setting filter based event pattern in Amazon EventBridge to send SNS notification. … Continue reading Filtering using Events Patterns – EventBridge
Cross-account AWS Glue Data Catalog access with Glue ETL
To process data in AWS Glue ETL, DataFrame or DynamicFrame is required. A DataFrame is similar to a table and supports functional-style (map/reduce/filter/etc.) along with SQL operations. The AWS Glue DynamicFrame is similar to DataFrame, except that each record is self-describing, so no schema is required initially. It computes a schema on-the-fly when required, and … Continue reading Cross-account AWS Glue Data Catalog access with Glue ETL
AWS Glue – Querying Nested JSON with Relationalize Transform
AWS Glue has transform Relationalize that can convert nested JSON into columns that you can then write to S3 or import into relational databases. As an example - Initial Schema: >>> df.printSchema() root |-- Id: string (nullable = true) |-- LastUpdated: long (nullable = true) |-- LastUpdatedBy: string (nullable = true) |-- Properties: struct (nullable … Continue reading AWS Glue – Querying Nested JSON with Relationalize Transform