Getting started with artificial intelligence (AI): Part 3 - Transforming dataframes

September 30, 2021 · 10 minute read

Nathaniel Tjandra

Growth

TLDR

Terraforming a planet requires large scale projects to inhabit other planets for survival. We’ll begin by terraforming datasets to calculate the cost of survival on the Titanic.

  • Introduction

  • Before we begin

  • Functional programming

  • Applying Function

  • Aggregating Data

  • Transforming Data

  • Data Analysis

  • Conclusion

From the

SHAP

article, we know that people in some groups were more likely to survive when the Titanic crashed. But what does it cost to survive the titanic?

Titanic meets Iceberg (Source: Britannica)

In “Product developers’ guide to getting started with AI — Part 3: Terraforming dataframes”, we’ll look at the price point of a “golden ticket” that ensures the best chance of survival. Based on the SHAP values calculated there is a direct correlation between the sex, passenger class, fare, and age.

Mage Analyzer Page (Source: SHAP)

Manipulating datasets are a quick and easy way to rearrange data and extract everything. In this series we’ve gone over how to pick and search through data so it’s time to look at transforming the underlying data.

It is highly advised to have read

part 2

before continuing forward. 

In this guide, we’ll be using the

Titanic dataset

along with

Google Collab

.

 

I’ll be briefly reusing techniques from previous contents such as surfing and extracting to quickly start us off with an ideal dataframe for applying transformations and functions.

Part 2: Surfing through dataframes

Python is a functional programming language, which means that all operations can be expressed as a function. This is important as later on in this guide we’ll be looking at creating functions and passing lambda expressions to

apply

and

transform

. For those that are comfortable enough with Python, you may skip this section. Otherwise, keep reading for a quick refresher on the syntax for defining functions and lambda expressions.

In Python, a function is created by the “def” keyword and takes in a number of arguments.

Basic Adder that adds 1 to the value

Rewrite the adder function as a lambda expression to shorthand.

Lambda expression of the adder

For a small operation, like the adder above, it’s best practice to use a lambda expression. But, for more complex calculations that are used multiple times use a function. When in doubt check if there is a simpler way or how much repeating will occur.

The simplest form of manipulating a dataframe is by using

apply

. Apply takes in a function and repeats it for either all columns or rows within a dataframe. The applications of this are for quickly calculating or encrypting data.

Based on the SHAP values, we form a hypothesis that women and children are more likely to survive, possibly due to the fact that they can board first and when living in upper class areas of the ship there is less population density allowing them to quickly escape in comparison to the lower class.

Lifeboats on the Titanic (Source: DailyMail)

To find the average price point of the winning ticket: ticket for a young lady in 1st class, we first need to filter down our rows and columns. In the dataframe, “Pclass” represents whether a passenger is located in the 1st class, 2nd class, or 3rd class area of the Titanic. The average is calculated as the

sum

of the prices divided by the total number or

count

of items, but may also be calculated by the

mean

method.

Using what we’ve learned in

part 2

, we filter the rows down to only contain items from the sex, passenger class, and age columns. We define our filter as

  1. Having the sex of a female

  2. Passenger class of only 1st class

  3. Age must be no lower than 40 years old

Then reduce it to only show the relevant information: ‘Fare’ or price of golden ticket.

Then, we take the sum of the ‘Fare’ column and divide by the total number of items.

The total price of all golden tickets are $6484.80

Average price of $113.77

Unlike

part 2

, where we overwrite the values, instead store the data inside a new variable called average_price to hold the results of the calculations. This lets us preserve the old data.

We can confirm this is the same when calculating the

mean 

of the prices.

The mean matches the average price of $113.77

Pandas has multiple other built-in mathematical functions, such as

median

and more.

Median is $86.50

Unfortunately, all of this must be done separately, which makes

apply

good for short functions, but what about longer functions? That’s where

aggregate

or

agg 

shines in removing repeatability.

If you know which aggregate you want to apply ahead of time, use agg instead. When doing multiple calculations of summation, mean, or standard deviation,

aggregate

is a neater way to calculate than using apply.

For instance, if we were to use

aggregate

instead, we could grab multiple types all at once. For our next section, we’ll need the standard deviation so let’s calculate that as well. Note: The shorthand is

agg

, which is functionally equivalent to

aggregate

.

1 liner for sum, mean, max, and median

Another way of manipulating a dataframe is by using

transform

. This is similar to

apply

, except that it applies the function to itself and repeats it for all columns within a dataframe. Since it can be applied to itself, the applications are more extended and can complete multiple operations by passing values back to itself.

Because transform applies it to itself, the result must be the same length of the original input. This means that functions such as sum(), mean(), and max/min() don’t work as they condense or aggregate all the data into 1 value.

Calculate individual percentages

Back to the original problem, find out what percentage of passengers have a “golden ticket”. Using transform, we can combine aggregation using a series to calculate the individual values. This makes transform more useful at looking at the finer details.

Calculate individual percentages

Likewise, summing the individual results should result in 1.0 (100%)

Sanity Check

To find out how many passengers paid top dollar, first we take the original dataset and calculate the percentages. We leverage transform’s ability to maintain length, along with groupby to sort our data.

What slice of the “pie” do the golden ticket passengers make out?

23% of all income on the ship is from golden ticket sales.

What percentage of passengers own a golden ticket?

Only 6% of all passengers purchased a golden ticket.

Key Differences

  1. Transform returns based on self, the equal length must be satisfied. Therefore, transform can’t handle aggregate methods (sum, mean, std deviation, etc…)

  2. Apply doesn’t take in multiple aggregations (one column at a time), while agg can.

  1. Transform is best used to create a new entry into a table to see fine detail.

  2. Aggregate and apply are useful at calculating a single summary value.

That’s it now, you’re ready to tackle future problems in data science. Using your newfound knowledge I suggest modifying the steps to calculate what percentage of golden ticket holders survive, as your next step in familiarizing yourself with these core AI concepts. As always, stay tuned for future guides where we’ll go over more topics ranging from joining datasets to deploying a machine learning model to the Cloud.

I’ve got a Golden Ticket! (Source South Park)

Start building for free

No need for a credit card to get started.
Trying out Mage to build ranking models won’t cost a cent.

No need for a credit card to get started. Trying out Mage to build ranking models won’t cost a cent.

 2024 Mage Technologies, Inc.
 2024 Mage Technologies, Inc.