Feature Engineering - Extract column from JSON

March 23, 2022 · 7 minute read

Felicia Kuan

Growth

TLDR

Occasionally, columns contain pesky JSON objects that require parsing. Work smarter, not harder– learn how we can create features from a column of JSON values!

Glossary

  • Why does data get stored as JSON?

  • JSON: a conceptual analogy

  • How to implement in code 👩‍💻

  • Magical no-code solution ✨🔮

Why does data get stored as JSON?

It seems like such a hassle to work with, so why give ourselves the trouble? 

Simple answer

: we want to store additional information but don’t want to modify the schema.

When data analysts work with data logs on a website’s traffic, they typically have a way of formatting their database, called a 

schema

. This is like an outline so we understand how the data is organized before inundating ourselves with 60 columns and 10k rows of data.

Therefore, when logging website traffic, user event logs, which are unstructured, and often modified, are saved as a JSON  so the data doesn’t affect the overall schema. 

Similarly, APIs like 

Amplitude

log event data as JSON columns. To fully work with the data, we need to extract the relevant fields from the JSON column. 

JSON: a conceptual analogy

JSON

is essentially a format for people to store named textual data. You can think of it as a multi-compartment

makeup box

that stores objects of all shapes, sizes, and variety. Some compartments take up more space than others and have more objects nested within (ie. The top-most compartment is split into five nested compartments which each store different brushes, pens and bottles), but all the data stored is organized by some high-level logic. 

Product image of a makeup organizer

As code, JSON objects are characterized by being wrapped by curly brackets { }, with data stored as a named key and value. Values can be any data type, from strings, numbers, lists, and other JSONs. An example of a JSON describing the makeup box above would be:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
makeup_box = {
    'top_compartment': {
        'section_1': {'jar': [
                          'long flat brush',
                           'thick fluffy brush'
                        ],
                      'bottle': 1},
        'section_2': '3 perfumes',
        'section_3': 'jar',
        'section_4': 'blending sponge',
        'section_5': ['lip gloss', 'some pen', 'lipstick',
                       'brown brow pen', 'darker brow pen']
    },
    'left_compartment': 'towels',
    'right_compartment': 'bottles'
}

Now that you have a gist of how JSONs work, we can move on to learning how we parse these bad boys (because they’re “Jasons,” get it? 😏)

Jason wants to cut off this extended analogy (Source: Friday the 13th Part III)

How you’d code this 👩‍💻

As mentioned before, JSONs commonly manifest in datasets as user event data. We created a dummy 

dataset

logging the pages a single user viewed on a 

shop

on Etsy. 

The “event_properties” column contains JSON that we aren’t able to analyze in columns unless we’re able to extract it, so first I’ll show you how we’d parse the JSON using regular Python.

First, I saved just the “event_properties” column to a list so we can use a for loop to iterate through the different user events.

1
2
3
col = df['event_properties']
event_properties = list(col)
event_properties

In an URL, the information following the “ref” tag tells us which page the user viewed before coming to the current page. An example of a referral is bolded below:

https://www.etsy.com/shop/thousandskies?ref=

simple-shop-header-name&listing_id=679038887

This referral tells us that the user landed on the seller page, “

thousandskies

” by clicking on the “shop header.” Thus, we’re interested in scrutinizing the “ref” field of our event properties and figuring out which page the user viewed before coming to our Etsy shop. This data is stored under the key “ref,” so I’d reference this when going through the for loop:

1
2
for item in event_properties:
  print(item['ref'])

From this output, we can see that the user first found us and came to our site from a targeted Instagram ad and began exploring our products!

In user data tracking, 

page referrals

are essential for examining a user’s previous page views. It can tell us:

  • The very first ref for our dataset linked to Instagram, which demonstrates that our targeted Instagram ad brought an interested buyer to our site.

  • How the user browsed the items in the shop (ie. Did they view the seller’s list of bestsellers, click on recommended items at the footer, or go through the shop’s index?) And this draws a path of links of how users typically find the items they like.

  • Which could then tell us whether the recommendation flow of the seller’s products is effective or not

Since the referral data is useful to model training, we can save the refs as a list and add it to our original dataframe using Pandas. 

1
2
3
4
5
6
refs = []
for item in event_properties:
  refs.append(item['ref'])

df['ref'] = refs
df

If you decide that the other JSON fields are important to transform into columns as well, you can replace the key “ref” with the other keys available, like “category,” “page_url,” and “title.” All of this data is here for you to leverage! I suppose the challenge, then, is figuring out 

what

data is helpful.  

Magical no-code solution ✨🔮

Although it’s not too challenging to parse JSON (especially the type that isn’t nested), it breaks your immersion in your data analysis if you need to leave your spreadsheet app just to write a Python parser. We hope to keep you focused on data alone at 

Mage

.

To use our “Extract” JSON function, first go to

  1. “Edit data” > “Extract value”

  2. Enter the column that contains the JSON you’re extracting data from

  3. In the yellow field labeled “logic,” select “JSON path.” 

  4. You’ll need to know the key of the JSON field you want to extract the column data from. In our example, I entered “ref,” which is short for referrals to know how the user got to the current page  

  5. Then, the new column is created!

Want to learn more about machine learning (ML)? Visit

Mage Academy

! ✨🔮

Start building for free

No need for a credit card to get started.
Trying out Mage to build ranking models won’t cost a cent.

No need for a credit card to get started. Trying out Mage to build ranking models won’t cost a cent.

 2024 Mage Technologies, Inc.
 2024 Mage Technologies, Inc.