AWS Glue DataBrew – Overview

Overview

AWS Glue DataBrew is a no-code visual data preparation tool to clean and normalize the data. It comes with many pre-built transformations to help in data preparation task. The service is geared more towards the business analysts and data scientists personas to explore, discover, visualize, clean, transform, and get insights from terabytes of raw data without the need to manage any infrastructure. You can read both structured and unstructured data into AWS Glue DataBrew from various sources. AWS Glue DataBrew can also be used to identify and transform personally identifiable information (PII) in a dataset. I will be writing more about the AWS Services which can be used to identify and redact PII data in another blog.

Components of AWS Glue DataBrew

Dataset

The very first step in AWS Glue DataBrew is to connect new dataset. You can create datasets by using any of the supported sources. Dataset is a read-only connection to data where DataBrew collects a set of descriptive metadata to refer to the data. The actual data is neither stored nor can be altered by DataBrew. AWS Glue DataBrew supports the following file formats –

  • CSV – must be encoded with UTF-8
  • Microsoft Excel – must be encoded with UTF-8
  • JSON – must be encoded with UTF-8
  • TSV
  • ORC
  • Parquet
  • GlueParquet – Custom Parquet writer with performance optimizations for DynamicFrames, through the useGlueParquetWriter configuration key.

AWS Console > AWS Glue DataBrew

AWS Console > AWS Glue DataBrew > Connect new dataset

AWS Glue DataBrew supports the following sources currently –

  1. File upload
  2. Amazon S3
  3. JDBC Connections – (Microsoft SQL Server / MySQL / Oracle / PostgreSQL)
  4. Amazon Redshift
  5. Amazon AppFlow
  6. AWS Data Exchange
  7. Snowflake
  8. AWS Glue Data Catalog

Few points to note for source as Amazon S3

DataBrew can process all of the files in an S3 folder, but only if the following conditions are true:

  • All of the files in the folder have the same format.
  • All of the files in the folder have the same file extension.

With Amazon S3 as source you can create dynamic datasets to schedule data preparation on new incoming files and apply transformations on filtered or conditionally chosen files or folders in S3. Using the Dynamic dataset parameters you can select different input files to the dataset and create recurring data preparation jobs that can dynamically update data from Amazon S3. When using dynamic dataset parameters, the parameters for the dataset are evaluated at the dataset loading time both in Project and during the job run.

You can also choose how many files you want to include in the result set (Specify number of files to include) or choose how recent the files in the result set should be (Specify last updated date range).

In the Specify number of files to include you can set either Latest or Oldest and the number of files to be included.

In the Specify last updated date range you can choose between Past hour / Past 24 hours / Past week / Past month / Past year or create a custom date range by providing “After” and “Before” absolute timestamps and timezone or choosing relative time.

Define dynamic dataset parameters

Data Profile

Once the dataset has been created you can preview the dataset, run a data profile job, set data quality rules, and look at the data lineage. Here we are going to create a data profile job (this is an optional step).

Datasets – Data profile overview

When creating a data profile job you have few choices to make –

  1. Job run sample – You can define the scope of the dataset to run the job on. You can either select the full dataset or on custom sample/records.
  2. Job output settings – The data profile job generates an output file in a JSON format. The file contains the metadata details. You can write the output file to the current AWS account or another AWS account.
  3. Data profile configurations – There are 2 levels of configurations
    • Dataset level configurations –
      • Duplicate values – Identify the number of duplicate rows in dataset. It’s enabled by default.
      • PII statistics – Identify column statistics with PII. It’s disable by default. If enabled you can specify the PII categories to be evaluated or choose all categories. The current list of entity types is available here.
      • Correlations widget – Per my observation, enabling the PII statistics automatically enables the Correlations widget (for the first time). It can be disabled if not needed. Correlations widget enables matrix visualization of correlation coefficient.
    • Column level configurations – Various column level statistics are collected – Unique values count, Missing values count, Distinct values count, Entropy, Mode, Min etc. By default it’s calculated for all columns, but can be edited to include only required columns.
  4. Advanced job settings – Under the advanced job settings you can set the maximum number of DataBrew nodes to be used when the job runs. By default it is set to 5 with job timeout set to 48 hrs. Based on the amount of data you have to process, you can increase or decrease the number of units/nodes allocated. Also, enable the Amazon CloudWatch logs for job. Every time that you run the job, you will need to modify the number of nodes to the desired value. As of now after the job run, there isn’t an option to look at the details for each execution to confirm the numbers of units used per execution.
  5. Permissions – As usual, choose the IAM role that has access to connect to the data source.

Run the job and once it completes, it will populate both the “Data profile overview” and the “Column statistics” tab. To get the job run details you can go under JOBS > Profile jobs. Click on the Job name and it will take you to the Job run history, Job details, and Data lineage. The output tab under Job run history contains the JSON file name and destination path.

AWS Console > AWS Glue DataBrew > Jobs > Profile jobs

Projects

Projects create a workspace to explore, understand, clean, transform the data. When you create a project you have an option to create a new recipe with all the steps that you perform or choose to edit an existing one.

To create a project provide the Project name, recipe details, select datasets, and permissions at minimum. If you have an existing recipe you can attach and edit existing recipe or import steps from existing recipe. You can also select the sampling type and size of sample.Be default, the type is set to First n rows and 500 row samples.

Project creates a workspace to explore and transform data.

As you transform the sample data, DataBrew helps you build and refine the project recipe. Once you have added all the transformations you can create a job.

Project = Dataset + Recipe

Recipes

Recipe is a step by step series of transformations that will be applied to the dataset. The recipe is automatically saved as you keep working on the adding a new step or deleting an existing step. Once the recipe is ready you have an option to publish it so that it can be later applied to entire dataset. You also have options to download the recipe as YAML or JSON files. You can include up to 100 data transformations in a single recipe.

Jobs

Once you have the dataset and recipe, the next step is to create a “recipe job”. This job will perform the transforms mentioned in the recipe to the DataBrew dataset and write the result to a target. Following are the available targets which AWS Glue DataBrew is currently supporting –

  1. Amazon S3
  2. JDBC Connections
  3. Amazon Redshift
  4. AWS Glue Data Catalog
  5. Snowflake

To create a job you can either choose to create it from the Projects console directly or choose JOBS from the navigation pane. If you choose to create the job by selecting JOBS from navigation pane you will find an options to select from two types of jobs:

  • Recipe Job – This runs all the transformation steps mentioned in a recipe and will produce a transformed dataset.
  • Profile Job – We have discussed this in the Data Profile section. A profile job examines the dataset and generate summary and statistics about it.

Under the Job output settings you have the option to select the destination for your job output. If your target is Amazon S3 then you have further choices –

  • You can choose one of the available data output format for Amazon S3.
  • Based on the choice of File type, you can choose the available compression types.
  • In case of CSV file type for output, you have option to choose from available delimiter.
  • If you click on Settings you have options to choose the
    • File output storage – You can choose to either create a new folder for each job run (default) or replace output files for each job run.
    • File partitioning – Under File partitioning you have options to choose for File output options and custom partition by column values. There are three options for the file output with Amazon S3.
      • Autogenerate files (default) – DataBrew determines the optimal number of files
      • Single file output – Job will generate a single output file.
      • Multiple file output – This gives you an option to specify the number of files for output between 2 – 999.

You can automate the DataBrew job run to be executed at particular time or on a recurring basis using DataBrew job schedule.

After you create and run the recipe job you can monitor the job progress by checking its status while the job is running. Once the job completes successfully, the status is changed to Succeeded and the transformed dataset is written to the destination.

Glue DataBrew Recipe Job

AWS Glue DataBrew also shows the lineage for the job run

So a DataBrew Recipe job runs a Project and a Project consist of Dataset and Recipe.

Conclusion

To conclude AWS Glue DataBrew is a visual data preparation tool to interactively discover, clean, normalize, and transform raw data without writing code. The service supports various data sources and destinations to prepare dataset and write it to the desired output location. You can prepare your data, save transformation steps as recipe. You can create a project and run job to write the result to an output location of your choice. If required, you can also run profile job on DataBrew dataset to generate summary and statistics about the dataset. You can use AWS Glue DataBrew service to mask/encrypt sensitive PII data. Using AWS Glue DataBrew you can also ensure the quality of the datasets by defining list of data quality rules in a ruleset.

Leave a comment