Pandas DataFrame is a 2-dimensional tabular data structure with labeled axes.
For this blog, we have a table “person” in database containing name, age and city column. As dml transactions are performed on this table, the new image of the record along with the dml operation type is captured and stored in json file. The dml optype is captured in eventname column.
name | age | city | eventname |
John | 23 | Seattle | INSERT |
Steve | 28 | Portland | INSERT |
John | 24 | Arizona | UPDATE |
Jack | 32 | New York | INSERT |
Marie | 28 | Spokane | INSERT |
Marie | 29 | Utah | UPDATE |
Marie | 30 | Spokane | UPDATE |
Marie | 26 | Spokane | UPDATE |
One thing to note is that the timestamp is not captured as part of the transaction. But the transactions are stored in sequential order in which they were performed in the database.
So now if you are tasked to read the json file and retrieve the most recent record grouped by name, how would you go about it?
The first thought in my mind was to read the json file into Pandas DataFrame and simply use “rank” as it computes numerical data ranks along axis.
df['rank'] = df.groupby(['name']).rank(ascending=True, method='first')
name | age | city | eventname | rank | |
0 | John | 23 | Seattle | INSERT | 1 |
1 | Steve | 28 | Portland | INSERT | 1 |
2 | John | 24 | Arizona | UPDATE | 2 |
3 | Jack | 32 | New York | INSERT | 1 |
4 | Marie | 28 | Spokane | INSERT | 2 |
5 | Marie | 29 | Utah | UPDATE | 3 |
6 | Marie | 30 | Spokane | UPDATE | 4 |
7 | Marie | 26 | Spokane | UPDATE | 1 |
Looking closely at the above DataFrame, the ranking order for “John” is as expected, but for “Marie” the index 7 with eventname “UPDATE” is ranked as 1 (due to age) instead of index 4 with “INSERT” eventname. The transaction sequence order ranking for Marie is not correct in this case.
To achieve this you can use Pandas groupby.cumcount(). It numbers each item in each group from 0 to the length of that group – 1.
df['rank'] = df.groupby(['name']).cumcount().add(1)
name | age | city | eventname | rank | |
0 | John | 23 | Seattle | INSERT | 1 |
1 | Steve | 28 | Portland | INSERT | 1 |
2 | John | 24 | Arizona | UPDATE | 2 |
3 | Jack | 32 | New York | INSERT | 1 |
4 | Marie | 28 | Spokane | INSERT | 1 |
5 | Marie | 29 | Utah | UPDATE | 2 |
6 | Marie | 30 | Spokane | UPDATE | 3 |
7 | Marie | 26 | Spokane | UPDATE | 4 |
Now, to get recent most record grouped by name –
df[df['rank'] == df.groupby('name')['rank'].transform('max')]
name | age | city | eventName | rank | |
1 | Steve | 28 | Portland | INSERT | 1 |
2 | John | 24 | Arizona | UPDATE | 2 |
3 | Jack | 32 | New York | INSERT | 1 |
7 | Marie | 26 | Spokane | UPDATE | 4 |