Merge json files using Pandas

Quick demo for merging multiple json files using Pandas –

import pandas as pd
import glob
import json

file_list = glob.glob("*.json")
>>> file_list
['b.json', 'c.json', 'a.json']

Use enumerate to assign counter to files.


allFilesDict = {v:k for v, k in enumerate(file_list, 1)}
>>> allFilesDict
{1: 'b.json', 2: 'c.json', 3: 'a.json'}

Append the data into list –

>>> data = []

for k,v in allFilesDict.items():
    if 1 <= k <= 2:
        with open(v, 'r') as d:
            jdata = json.load(d)
            if jdata:
                data.append(jdata)
...

data
[{'creator': 'NickFury', 'last_modifier': 'NickFury', 'title': 'Avengers', 'view_count': 300000}, {'creator': 'TonyStark', 'last_modifier': 'IronMan', 'title': 'Iron Man 3', 'view_count': 1899000}]


Create pandas dataframe –

df = pd.DataFrame(data)

Write the dataframe to json

df.to_json(r'/tmp/data_output_default.json')

dev-dsk% cat /tmp/data_output_default.json | jq
{
  "creator": {
    "0": "NickFury",
    "1": "TonyStark"
  },
  "last_modifier": {
    "0": "NickFury",
    "1": "IronMan"
  },
  "title": {
    "0": "Avengers",
    "1": "Iron Man 3"
  },
  "view_count": {
    "0": 300000,
    "1": 1899000
  }
}

df.to_json(r'/tmp/data_output_records.json', orient='records')

dev-dsk % cat /tmp/data_output_records.json|jq
[
  {
    "creator": "NickFury",
    "last_modifier": "NickFury",
    "title": "Avengers",
    "view_count": 300000
  },
  {
    "creator": "TonyStark",
    "last_modifier": "IronMan",
    "title": "Iron Man 3",
    "view_count": 1899000
  }
]

One caveat of using dataframe is that its memory intensive. So if you have alot of files to merge, it can consume good amount of memory on host.

More about jq and to_json

One issue with above merged file is it doesn’t read the data as individual records –

dev-dsk % cat /tmp/data_output_records.json | wc -l
0

Using jq you can rewite it as –


dev-dsk % cat /tmp/data_output_records.json| jq -c '.[]' > /tmp/data_output_records_new.json

dev-dsk % cat /tmp/data_output_records_new.json|jq
{
  "creator": "NickFury",
  "last_modifier": "NickFury",
  "title": "Avengers",
  "view_count": 300000
}
{
  "creator": "TonyStark",
  "last_modifier": "IronMan",
  "title": "Iron Man 3",
  "view_count": 1899000
}

dev-dsk % cat /tmp/data_output_records_new.json | wc -l
2

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s