Feature Engineering - Category value mapping
Nathaniel Tjandra
Growth
TLDR
In this Mage Academy lesson on feature engineering, we’ll learn about category value mapping, also known as encoding in Data science, is used to assist machines in understanding the meaning behind the information given.
Glossary
Definition
Concept
How to code
Magical no-code solution ✨🔮
Definition
Category Value Mapping, otherwise known as One Hot Encoding in Data Science, is a method used to make human readable data more machine readable. No, this doesn’t mean turning all the features into binary.
Categorical data can be broken down into two types, nominal and ordinal data. You can read more about them in our
Guide to Model Training, but the gist is that it’s based on whether the categories contain a deeper meaning.
Concept
The concept of Category Value Mapping takes a column containing categorical data and proceeds to expand it into more columns. In order to map it to be more machine readable, we first need to understand what map and weights do.
The treasures in your brain (Source: MIT)
Mapping in programming or encoding in Data Science, is when we take a value and turn all instances of it into another value. For our data, we’ll want to map each textual value into a numeric value so computers can understand.
Weight stack (Source: Cyberfit)
Similar to weights in real life, weights tip the odds to favor certain parts of your data. Depending on the type of data, you can adjust these weights to add bias to your data to favor one category or another. Otherwise, you may assign these weights to be the same to avoid being biased towards certain categories.
Once you have expanded your columns, you should see a number of new columns equal to the number of unique values in your dataset. If a value was present here, it will be replaced with a 1 if it matches the original, and a 0 if it was not.
How to Code
For categories that are labels, we can do basic mapping where everything will be mapped but share the same weights. This is to ensure that since there’s no deeper meaning there shouldn’t be any bias. This rule should also be applied to sensitive information such as religion, race, or gender as it’s illegal to make decisions based on it!
We’ll be using this list of fruits in a basket as our dataset for labeling.
We’ll perform a basic mapping of each unique value to an incrementing number.
Here we’ll start from 0 and increment it by 1 for each unique value.
Inside the sklearn library, there is a package called
preprocessing
which contains the
Label Encoderfunction. The label encoder (le) starts by taking a list of values to
fit
. This applies the mapping of each value and by default increments by 1, starting from 0.
Once we have the mapping, we can apply it to a dataset, in this case our fruits basket.
Following SciKitLearn, the output will be as an np.array which excludes the commas.
Magical no-code solution ✨🔮
Mage is a product that handles the encoding portion and gives the user the ability to assign weights. For labeled categorical data, we’ll quickly assign every value the same weight. Note: Since Mage handles all the encoding, we only set all the weights to be the same for our labeled data and avoid bias.
Start building for free
No need for a credit card to get started.
Trying out Mage to build ranking models won’t cost a cent.
No need for a credit card to get started. Trying out Mage to build ranking models won’t cost a cent.