TLDR
In this Mage Academy lesson on feature engineering, we’ll learn how we’d aggregate, or sum up, values by group to find total items sold.
Glossary
How it works
Example code
Magical no-code solution ✨🔮
How it works
Since the concepts, aggregate, group by, and total sum, are a little tech jargon-y, this section defines these concepts by example to help those who are new to these functions.
Cropped Mr. Sun menu
Imagine you own this shop, and you’re looking through today’s sales
historyat the end of the day. Each drink sold is logged with a name, price, and category.
Group by
is a function that groups the rows by a distinct value in a column. Let’s say that Mage made a group order of 13 bobas one Thursday afternoon. Instead of throwing them all into one bag, the manager placed all 7 drinks from the “Milk Tea” category in one bag, 2 of the “Mr. Sun Specialties” in another, etc to see, at a glance, which category is most popular with Mage employees. Thus, group by is a function that groups rows that have matching values in a certain column; in our boba example, we grouped by menu category, but we can group by any column.
Why would grouping by category be useful to you? Perhaps some drink categories require more labor, time, or ingredients to make, so we need to find out and ensure they’re all profitable.
Art from San-X
Since we’d like to sum the profits by category to see if the profits outweigh the input, using an aggregate function would be perfect in this case. An
aggregate
function is often used in conjunction with
group by
to perform a calculation on a column to create new data from existing features. While there are many kinds of aggregate functions (ie. count, average, minimum, max), today we’ll be discussing
total sum
, which is summing values in the “Price” column based on groups in the “Category” column.
Example code
In this section, we’ll implement aggregate by total sum using the example boba shop scenario mentioned above.
The general procedure implementing aggregate by total sum in code is:
Group rows by unique value in the “Category” column using a dictionary
For each row entry in each group, sum the prices to obtain profits per category
1
2
3
import pandas as pd
df = pd.read_csv('boba-history.csv')
df
Hypothetical sales for today
Our first order of business is to re-organize our data and group them by category. The best way to do this is to go through row by row (iterate) and save the drink name and price to a unique key in a dictionary.
In this step, we are essentially creating boxes labeled by a key (ie. ‘Fresh Tea’, ‘Milk Foam Series’) and storing information about the drink (name and price) for every drink that falls under that category as a tuple. Here’s an example of a category “Smoothies” saved as a dictionary key, with the values as a list that contains drink name and price:
{‘Smoothies/Slushies’ : [ (‘Mango Smoothie’, $6.25) }
1
2
3
4
5
6
7
8
9
10
11
12
categories = {}
for _, row in df.iterrows():
# create new key for each unique category we encounter
key = row['Category']
if key not in categories:
categories[key] = []
# add name + price to each group
categories[key].append((row['Drink'], row['Price']))
categories
This dictionary/object format is loosely how Pandas’s built-in function groupby() saves groups, so remember it for the next section!
Now that we have sorted our boba drinks by category, we can finally sum the price by group. Thus, for every group (key), we’ll sum all of the prices (found in the second index of each drink) to get the total sales of each boba drink category.
1
2
3
4
5
6
price = {}
for key in categories:
price[key] = 0
# for each group, sum the price
for drink in categories[key]:
price[key] += drink[1] # the price
From our results, we can see that the “Milk Tea” category has earned us the most, which makes sense since we sold 7 cups of boba from this category.
Pandas supports
groupby()and aggregate functions, so we can simplify the process we fleshed out in the previous section. We can use the Pandas function to compartmentalize our drinks into their respective categories.
However, since the groupby function is used with an aggregate function (like finding the average, count, total sum, etc), the function returns a weird object that doesn’t quite know what to do with itself because it’s expecting you to call an aggregate function after.
In this next section, there’s no need to run this code as it’s just dismantling the groupby object to explicate that the drinks are indeed grouped by categories:
Notes about the code:
We can use
groups.keys()to view the unique categories that we’ve grouped all of our drinks into. In other words, these are all the unique values found in the “Category” column of our dataset, and all drinks are categorized under one of these groups.
Since “key” is one of the group names, we use
get_group(key)
to view the drinks that are categorized under this group.
We observe that there are seven orders under the category “Milk Tea,” which we will mention later
Now that we’ve confirmed that all the drinks are sorted into categories, we can now apply sum() to total the profits made in each category. This is the only code you have to run, convenient, right?
1
2
gp = df.groupby(['Category'])['Price'].sum()
gp
We can also observe that the total sums we obtained in this section matches those from the “From scratch” section earlier, which (probably) means it’s correct!
The complete code to obtain the sum of price by category is:
Magical no-code solution ✨🔮
On days when you don’t feel like writing even one line of code, try our service, Mage! Our “Aggregate” function will make you a column that tells you the total sum of each group. Take a look at our demo below:
Start building for free
No need for a credit card to get started.
Trying out Mage to build ranking models won’t cost a cent.
No need for a credit card to get started. Trying out Mage to build ranking models won’t cost a cent.