AWS Glue and PySpark Guide

In this post, I have penned down AWS Glue and PySpark functionalities which can be helpful when thinking of creating AWS pipeline and writing AWS Glue PySpark scripts. AWS Glue is a fully managed extract, transform, and load (ETL) service to process large amount of datasets from various sources for analytics and data processing. While … Continue reading AWS Glue and PySpark Guide


Pandas Scratchpad – I

This blog is scratchpad for day-to-day Pandas commands. pandas is an open-source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. 1. Few quick ways to create Pandas DataFrame DataFrame from Dict of List - DataFrame from List of List - DataFrame from List of Dict - DataFrame … Continue reading Pandas Scratchpad – I

Merge json files using Pandas

Quick demo for merging multiple json files using Pandas - import pandas as pd import glob import json file_list = glob.glob("*.json") >>> file_list ['b.json', 'c.json', 'a.json'] Use enumerate to assign counter to files. allFilesDict = {v:k for v, k in enumerate(file_list, 1)} >>> allFilesDict {1: 'b.json', 2: 'c.json', 3: 'a.json'} Append the data into list … Continue reading Merge json files using Pandas