Handling missing values in Pandas to Spark DataFrame conversion

Data transformation is an important aspect of Data Engineering and can be a challenging task depending on the dataset and the transformation requirements. A bug in data transformation can have a severe impact on the final data set generated leading to data issues. In this blog I am going to share my experience of having … Continue reading Handling missing values in Pandas to Spark DataFrame conversion

Advertisement

Using AWS Data Wrangler with AWS Glue Job 2.0 and Amazon Redshift connection

I will admit, AWS Data Wrangler has become my go to package for developing extract, transform, and load (ETL) data pipelines and other day-to-day scripts. AWS Data Wrangler integration with multiple big data AWS services like S3, Glue Catalog, Athena, Databases, EMR, and others makes life simple for engineers. It also provides the ability to … Continue reading Using AWS Data Wrangler with AWS Glue Job 2.0 and Amazon Redshift connection

Rename Glue Tables using AWS Data Wrangler

I had a use case of renaming over 50 tables, adding "prod_" prefix to the existing Glue tables. AWS Athena does not support native Hive DDL "ALTER TABLE table_name RENAME TO" command. So one of the option was to - "Generate Create Table DDL" in AWS Athena.Modify the table name.Execute the DDL.Preview the new table.Drop the … Continue reading Rename Glue Tables using AWS Data Wrangler

Transform AWS CloudTrail data using AWS Data Wrangler

AWS CloudTrail service captures actions taken by an IAM user, IAM role, APIs, SDKs and other AWS services. By default, AWS CloudTrail is enabled in your AWS account. You can create "trail" to record ongoing events which will be delivered in JSON format to an Amazon S3 Bucket of your choice. CloudTrail Dashboard Create Trail … Continue reading Transform AWS CloudTrail data using AWS Data Wrangler

Reading Parquet files with AWS Lambda

I had a use case to read data (few columns) from parquet file stored in S3, and write to DynamoDB table, every time a file was uploaded. Thinking to use AWS Lambda, I was looking at options of how to read parquet files within lambda until I stumbled upon AWS Data Wrangler. From the document … Continue reading Reading Parquet files with AWS Lambda