Sequential counter with groupby – Pandas DataFrame

Pandas DataFrame is a 2-dimensional tabular data structure with labeled axes.

For this blog, we have a table “person” in database containing name, age and city column. As dml transactions are performed on this table, the new image of the record along with the dml operation type is captured and stored in json file. The dml optype is captured in eventname column.

nameagecityeventname
John23SeattleINSERT
Steve28PortlandINSERT
John24ArizonaUPDATE
Jack32New YorkINSERT
Marie28SpokaneINSERT
Marie29UtahUPDATE
Marie30SpokaneUPDATE
Marie26SpokaneUPDATE

One thing to note is that the timestamp is not captured as part of the transaction. But the transactions are stored in sequential order in which they were performed in the database.

So now if you are tasked to read the json file and retrieve the most recent record grouped by name, how would you go about it?

The first thought in my mind was to read the json file into Pandas DataFrame and simply use “rank” as it computes numerical data ranks along axis.

df['rank'] = df.groupby(['name']).rank(ascending=True, method='first')

nameagecityeventnamerank
0John23SeattleINSERT1
1Steve28PortlandINSERT1
2John24ArizonaUPDATE2
3Jack32New YorkINSERT1
4Marie28SpokaneINSERT2
5Marie29UtahUPDATE3
6Marie30SpokaneUPDATE4
7Marie26SpokaneUPDATE1

Looking closely at the above DataFrame, the ranking order for “John” is as expected, but for “Marie” the index 7 with eventname “UPDATE” is ranked as 1 (due to age) instead of index 4 with “INSERT” eventname. The transaction sequence order ranking for Marie is not correct in this case.

To achieve this you can use Pandas groupby.cumcount(). It numbers each item in each group from 0 to the length of that group – 1.

df['rank'] = df.groupby(['name']).cumcount().add(1)

nameagecityeventnamerank
0John23SeattleINSERT1
1Steve28PortlandINSERT1
2John24ArizonaUPDATE2
3Jack32New YorkINSERT1
4Marie28SpokaneINSERT1
5Marie29UtahUPDATE2
6Marie30SpokaneUPDATE3
7Marie26SpokaneUPDATE4

Now, to get recent most record grouped by name –

df[df['rank'] == df.groupby('name')['rank'].transform('max')]

nameagecityeventNamerank
1Steve28PortlandINSERT1
2John24ArizonaUPDATE2
3Jack32New YorkINSERT1
7Marie26SpokaneUPDATE4
Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s