Handling missing values in Pandas to Spark DataFrame conversion

Data transformation is an important aspect of Data Engineering and can be a challenging task depending on the dataset and the transformation requirements. A bug in data transformation can have a severe impact on the final data set generated leading to data issues. In this blog I am going to share my experience of having … Continue reading Handling missing values in Pandas to Spark DataFrame conversion

Advertisement

Sequential counter with groupby – Pandas DataFrame

Pandas DataFrame is a 2-dimensional tabular data structure with labeled axes. For this blog, we have a table "person" in database containing name, age and city column. As dml transactions are performed on this table, the new image of the record along with the dml operation type is captured and stored in json file. The … Continue reading Sequential counter with groupby – Pandas DataFrame

Pandas Scratchpad – I

This blog is scratchpad for day-to-day Pandas commands. pandas is an open-source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. 1. Few quick ways to create Pandas DataFrame DataFrame from Dict of List - DataFrame from List of List - DataFrame from List of Dict - DataFrame … Continue reading Pandas Scratchpad – I

Pandas – ValueError: If using all scalar values, you must pass an index

Reading json file using Pandas read_json can fail with "ValueError: If using all scalar values, you must pass an index". Let see with an example - cat a.json { "creator": "CaptainAmerica", "last_modifier": "NickFury", "title": "Captain America: The First Avenger", "view_count": 12000 } >>> import pandas as pd >>> import glob >>> for f in glob.glob('*.json'): … Continue reading Pandas – ValueError: If using all scalar values, you must pass an index