PANDAS groupings

david.wylie · April 6, 2020, 8:02pm

Following on from my previous question Working out summaries) …

Here’s the code I have so far :

import pandas
import numpy as np

data = [
    {'group_name': 'GROUP 1', 'client': 'CLIENT 1', 'messages': 100, 'supplier': 'SUP 1', 'blocked': 0}, 
    {'group_name': 'GROUP 2', 'client': 'CLIENT 1', 'messages': 200, 'supplier': 'SUP 1', 'blocked': 27}, 
    {'group_name': 'GROUP 3', 'client': 'CLIENT 1', 'messages': 300, 'supplier': 'SUP 1', 'blocked': 0}, 
    {'group_name': 'GROUP 1', 'client': 'CLIENT 2', 'messages': 400, 'supplier': 'SUP 1', 'blocked': 4}, 
    {'group_name': 'GROUP 2', 'client': 'CLIENT 2', 'messages': 500, 'supplier': 'SUP 1', 'blocked': 0}, 
    {'group_name': 'GROUP 4', 'client': 'CLIENT 3', 'messages': 600, 'supplier': 'SUP 1', 'blocked': 9}, 
]

newdata = {}
newdata['client'] = []
newdata['supplier'] = []
newdata['group_name'] = []
newdata['messages'] = []
newdata['blocked'] = []

for d in data:
    newdata['client'].append(d['client'])
    newdata['supplier'].append(d['supplier'])
    newdata['group_name'].append(d['group_name'])
    newdata['messages'].append(d['messages'])
    newdata['blocked'].append(d['blocked'])

df = pandas.DataFrame(newdata)

grouped = df.groupby(['client','supplier'])
res = grouped[['messages','blocked']].agg(np.sum)

print(res.to_dict("records"))

and it produces this :

[
  {
    'messages': 600, 
    'blocked': 27
  }, 
  {
    'messages': 900, 
    'blocked': 4
  }, 
  {
    'messages': 600, 
    'blocked': 9
  }
]

I want the respective “client” and “supplier” to be added to each record, like this :

[
  {
    'client': 'CLIENT 1',
    'supplier': 'SUP 1',
    'messages': 600, 
    'blocked': 27
  }, 
  { etc etc

but I cannot work it out. Can anyone help?

edit sorry I did not explain the intended result - each record should have two non-aggregated fields (client and supplier) and two aggregated fields (messages & blocked). I don’t really get pandas (or itertools groupby), and I think I’m just being a bit slow this month.

stucork · April 6, 2020, 11:28pm

here’s the pandas option:

import pandas as pd
import numpy as np

data = [
    {'group_name': 'GROUP 1', 'client': 'CLIENT 1', 'messages': 100, 'supplier': 'SUP 1', 'blocked': 0}, 
    {'group_name': 'GROUP 2', 'client': 'CLIENT 1', 'messages': 200, 'supplier': 'SUP 1', 'blocked': 27}, 
    {'group_name': 'GROUP 3', 'client': 'CLIENT 1', 'messages': 300, 'supplier': 'SUP 1', 'blocked': 0}, 
    {'group_name': 'GROUP 1', 'client': 'CLIENT 2', 'messages': 400, 'supplier': 'SUP 1', 'blocked': 4}, 
    {'group_name': 'GROUP 2', 'client': 'CLIENT 2', 'messages': 500, 'supplier': 'SUP 1', 'blocked': 0}, 
    {'group_name': 'GROUP 4', 'client': 'CLIENT 3', 'messages': 600, 'supplier': 'SUP 1', 'blocked': 9}, 
] 

df = pd.DataFrame(data)
grouped = df.groupby(['client','supplier'])
res = grouped[['messages','blocked']].agg(np.sum)
res['index'] = range(len(res))
res = res.reset_index()
res.to_dict('records')

it was a case of creating a new index before using to_dict

stucork · April 7, 2020, 12:23am

and the itertools groupby

from itertools import groupby 

def sorter(row):
  return (row['client'], row['supplier'])

data = sorted(data, key=sorter)
grouped = groupby(data, key=sorter)

option 1

records = []

for (client, supplier), group in grouped:
  record = {'client':client, 'supplier':supplier, 'messages':0, 'blocked':0}
  for g in group:
    record['messages'] += g['messages']
    record['blocked']  += g['blocked']
  records.append(record)

option 2

from functools import reduce

def agg(x, y):
  x['messages'] += y['messages']
  x['blocked'] += y['blocked']
  return x

records = []

for (client, supplier), group in grouped:
  record = {'client':client, 'supplier':supplier, 'messages':0, 'blocked':0}
  record = reduce(agg, group, record)
  records.append(record)

option 3

(if you like one(ish) liners)

from functools import reduce

def agg(x, y):
  x['messages'] += y['messages']
  x['blocked'] += y['blocked']
  return x

records = [reduce(agg, group, {'client':client, 'supplier':supplier, 'messages':0, 'blocked':0}) 
                 for (client, supplier), group in grouped]

david.wylie · April 7, 2020, 7:15am

You, Stu, are an absolute star.

Between you and @owen.campbell my problem is solved.

Next question - is there somewhere that explains all this slightly more simply than a “for dummies” book would? Something like “Data processing and pandas for 7 year olds”? Whilst my problem is solved, I still don’t understand it. I could probably get what you’ve shown me here, but I should really understand how it does it and the underlying principles.

stucork · April 7, 2020, 7:50am

whenever i work with pandas i try to use an ipython style console so that I can mess about with the dataframe quickly and then later solidify what I did into code.

that, and a combination of stackexchange and pandas documentation

I rated this free online book which has a chapter on numpy and a chapter on pandas:

https://jakevdp.github.io/PythonDataScienceHandbook/

the set up for groupby is usually the same.

data = sorted(data, key=sorter)
grouped = groupby(data, key=sorter)

groupby is a generator object
iterating through groupby gives you (key, group) pairs.
a group is also a generator object
iterating through the group gives you the data with that key

that’s why in all the code with itertools groupby you have to iterate through the group generator object.

robert · April 7, 2020, 3:45pm

I teach Pandas at a university, and I do recommend the Python Data Science Handbook that Stu mentioned.

If you don’t mind reading and trial and error work, then the book is probably the standard out there for getting started with Pandas, Numpy, and the other libraries common in data science.

If you like videos and short practices that scaffold and build up (like me), then I would recommend Datacamp. It’s not free but very excellent. We use it in all of our Python classes.

Here is a pretty decent video that goes into Pandas: https://www.youtube.com/watch?v=vmEHCJofslg

If you want to see a more lecture style of me talking WAY too much in front of a lecture hall here is my unlisted playlist of most of my lectures from my Intro to Python for Data Science course from last semester: https://www.youtube.com/playlist?list=PLFtx_MNb-J_pGCxZg8vi3hEeSZQkSXzld

campopianoa · April 7, 2020, 4:08pm

I would highly recommend Kevin Markham’s YouTube channel. He is well recognized in the Python community for his excellent Pandas and Data Science teaching.

hjp · April 8, 2020, 9:31am

I’ve had similar issues, and found tinydb is quite fast, simple and capable. In your use case, see this stackoverflow question for a workable example.

hjp · April 8, 2020, 9:34am

I didn’t check if it’s available on the free account plan, but it is part of the anvil python environment, so should be available on your business account.

david.wylie · April 8, 2020, 9:35am

Interesting - I would have thrown it back out to a DB had I not had the help with pandas. An in memory sql-like db is worth looking at, thanks.