AWS Glue DataBrew – Overview

Overview AWS Glue DataBrew is a no-code visual data preparation tool to clean and normalize the data. It comes with many pre-built transformations to help in data preparation task. The service is geared more towards the business analysts and data scientists personas to explore, discover, visualize, clean, transform, and get insights from terabytes of raw … Continue reading AWS Glue DataBrew – Overview

Use Amazon MSK Connect with Lenses plugin to sink data from Amazon MSK to Amazon S3

Apache Kafka is an open-source distributed event streaming platform consisting of servers and clients communicating via high performance TCP network protocol. It allows you to decouple your source system and target system. It is optimized for ingesting and processing streaming data in real-time. Due to its distributed nature, it provides high throughput, scalability, resilient architecture … Continue reading Use Amazon MSK Connect with Lenses plugin to sink data from Amazon MSK to Amazon S3

Guide on PySpark DataFrame Functionality

In this blog post I have covered some of the common PySpark DataFrame Functions, Joins, and Windows Functions.  All the commands were executed on the new job authoring Jupyter notebook available in preview in AWS Glue Studio. Import Libraries from awsglue.job import Job from awsglue.transforms import * from awsglue.context import GlueContext import pyspark.sql.functions as F … Continue reading Guide on PySpark DataFrame Functionality

Amazon Athena – CTAS, UNLOAD, Parameterized Prepared statements, Partition Projection

Amazon Athena is a query service that enables users to analyze data in Amazon S3 using SQL. It uses Presto with ANSI SQL support and works with multiple data formats like CSV, JSON, Parquet, Avro and ORC. In this blog post I will go through below available features of Athena - CREATE TABLE AS SELECT … Continue reading Amazon Athena – CTAS, UNLOAD, Parameterized Prepared statements, Partition Projection

Apache Kafka – Introduction & Basic Commands

In this blog post I will cover some of the basics of Apache Kafka and few useful commands to keep handy. Introduction Apache Kafka is an open-source distributed event streaming platform. It allows you to decouple your source system and target system. It is optimized for ingesting and processing streaming data in real-time.  Due to … Continue reading Apache Kafka – Introduction & Basic Commands

How to retrieve credentials stored in AWS Secrets Manager from AWS Lambda running in VPC

AWS Secrets Manager is a secrets management service that enables you to store credentials and retrieve it dynamically when you need them. It helps protect access to your applications and services. With AWS Secret Manager you can - Programmatically retrieve encrypted secret values at runtimeStore different types of secretsEncrypt secret dataAutomate secret's rotation When you … Continue reading How to retrieve credentials stored in AWS Secrets Manager from AWS Lambda running in VPC

FastAI: Multi-Label Classification [Chapter-6]

PyTorch and fastai have two main classes for representing and accessing a training set or validation set: Dataset:: A collection that returns a tuple of your independent and dependent variable for a single itemDataLoader:: An iterator that provides a stream of mini-batches, where each mini-batch is a tuple of a batch of independent variables and … Continue reading FastAI: Multi-Label Classification [Chapter-6]

How to load table with JSONB data type into Aurora PostgreSQL using AWS Glue

In this blog post I will cover what is JSON data type, what options does PostgreSQL offers to store JSON data, how you can create AWS Glue connection to Aurora PostgreSQL database running in private subnet and how can you then use AWS Glue to write data into table with JSONB datatype into Aurora/RDS PostgreSQL … Continue reading How to load table with JSONB data type into Aurora PostgreSQL using AWS Glue

FastAI – Image Classification – [Chapter 5]

DataLoaders DataLoaders - Is a class that just stores whatever DataLoader object you pass to it and makes it available as train and valid. To turn downloaded data into a DataLoaders object we need at least 4 things - What kind of data we are working withHow to get the list of itemsHow to label … Continue reading FastAI – Image Classification – [Chapter 5]

FastAI : Training a Digit Classifier - [Chapter-4] - Part II

Metric is to drive human understanding and the loss is to drive automated learning. Stochastic Gradient Descent - As Arthur Samuel had mentioned the description of machine learning Suppose we arrange for some automatic means of testing the effectiveness of any current weight assignment in terms of actual performance and provide a mechanism for altering … Continue reading FastAI : Training a Digit Classifier - [Chapter-4] - Part II