Schema: a simple data tool#

Note

Although excelbird’s Frame integrates with the xb.Schema class, Schema on its own is not related to or dependent on excelbird, and in the future will be its own separate installable package.

xb.Schema is a simple tool for data wrangling pipelines that helps you stay organized, improves readability and helps avoid bugs.

Where traditional dataframe schema classes define what should go inside a dataframe’s columns, Schema defines when those columns should be, and how they got there. It defines a dataframe’s state at a given point in time, and provides the tools/methods needed to help you arrive at that state, change it, and move to new states seamlessly.

It’s a simple class, designed for ease of use and exceptional readability.

We’ll cover functionality shortly, but first, here’s what we mean by ‘readability’:

[ ]:

sch_person = Schema(
    first_name=("FName", "First Name"),
    last_name=("LName", "Last Name"),
    favorite_food="Favorite Food",
)

sch_company = Schema(
    comp_name=("Companyname", "Company Name"),
    market_cap="Market Capitalization",
    favorite_food="Preferred Employee Favorite Food",
)

sch_output = Schema(
    sch_person[[
        'last_name',
        'age',
    ]],
    sch_company[[
        'comp_name',
    ]],
    is_executive="Person is Executive"
)

Without touching any data, the reader already knows exactly what’s supposed to happen in the script that follows:

There are two input sources, person and company. We know what their columns are called in the input data, what we’d like to call them when they’re eventually used to generate a final output of some kind, and given them short, python-friendly names to use during our workflow.
The output schema indicaets that the script must join person and company to get its output, and where each field is coming from.
The script needs to add a new custom column, is_executive

Creating a Schema#

Schema is a subclass of dictionary, where

Positional args: existing Schemas, or slices from them
Keyword args:
- Keys: python-friendly variable names
- Values: str, or tuple. If tuple:
  - Item 0: Input column name
  - Item 1: Output column name (optional)

Schema Examples#

[1]:

from excelbird import Schema
import pandas as pd

First we’ll create some sample data. Assume df_employee_raw was read in from somewhere else

[60]:

df_employee_raw = pd.DataFrame( [["Jared", "Richards", 45, 80, "Marketing"], ["Emily", "Seitz", 87, 60, "Analytics"], ["Nick", "Smarts", 23, 50, "Marketing"]], columns=["Emp FName", "Emp LName", "Hours Worked", "Emp Hourly Rate", "Domain"],)
df_employee_raw

[60]:

	Emp FName	Emp LName	Hours Worked	Emp Hourly Rate	Domain
0	Jared	Richards	45	80	Marketing
1	Emily	Seitz	87	60	Analytics
2	Nick	Smarts	23	50	Marketing

Elsewhere, we’ve defined the following

[61]:

schema = Schema(
    # var_name = ("Input Col Name", "Output Col Name")
    last_name=("Emp LName", "Last Name"),
    first_name=("Emp FName", "First Name"),
    hours="Hours Worked",  # will be applied to both input and output
    rate=("Emp Hourly Rate", "Hourly Rate"),
)
schema_partial = schema[["rate", "last_name"]]
schema

[61]:

	Input	Output
last_name	Emp LName	Last Name
first_name	Emp FName	First Name
hours	Hours Worked	Hours Worked
rate	Emp Hourly Rate	Hourly Rate

Now we can use the schema after reading in new data

[62]:

df = schema.select_inputs(df_employee_raw)
df

[62]:

	last_name	first_name	hours	rate
0	Richards	Jared	45	80
1	Seitz	Emily	87	60
2	Smarts	Nick	23	50

Calling .select_inputs() did the following

Check if all desired columns are present in input data, and throw an error if any are missing, informing you of all the missing columns at once
Select only the desired columns
Rename them to your desired standardized variable names
Re-order them to the order they follow in the schema

In the real world, your code will probably combine things from multiple datasets, and create new columns. Creating a composite schema is easy

[63]:

schema_composite = Schema(
    schema[[
        "hours",
        "first_name"
    ]],
    schema_partial,  # include all from schema_partial
    daily_rate="Daily Rate",
    bonus="Bonus",
)
schema_composite

[63]:

	Input	Output
hours	Hours Worked	Hours Worked
first_name	Emp FName	First Name
rate	Emp Hourly Rate	Hourly Rate
last_name	Emp LName	Last Name
daily_rate	Daily Rate	Daily Rate
bonus	Bonus	Bonus

We’ll be notified with a descriptive error if we’re missing anything at output time

[64]:

df_out = schema_composite.select_outputs(df)
df_out

---------------------------------------------------------------------------
SchemaError                               Traceback (most recent call last)
Cell In[64], line 1
----> 1 df_out = schema_composite.select_outputs(df)
      2 df_out

File ~/Library/CloudStorage/OneDrive-EY/repo/python-packages/excelbird/excelbird/schema.py:427, in Schema.select_outputs(self, df)
    425 missing = [col for col in self.keys() if col not in df.columns]
    426 if len(missing) > 0:
--> 427     raise SchemaError(f"Please add columns, {missing} before outputting.")
    428 df = self.rename_vars_to_outputs(df)
    429 return df[[k for k in self.outputs()]]

SchemaError: Please add columns, ['daily_rate', 'bonus'] before outputting.

[65]:

df["daily_rate"] = df.rate * 8
df["bonus"] = 100

df_out = schema_composite.select_outputs(df)
df_out

[65]:

	Hours Worked	First Name	Hourly Rate	Last Name	Daily Rate	Bonus
0	45	Jared	80	Richards	640	100
1	87	Emily	60	Seitz	480	100
2	23	Nick	50	Smarts	400	100

A more realistic use-case#

Assume we’re reading the following input data from files

[66]:

df_emp_raw = pd.DataFrame( [["Jared", "Richards", 24, "red"], ["Emily", "Seitz", 55, "green"], ["Nick", "Smarts", 33, "blue"]], columns=["Employee First Name", "Employee Last Name", "Age", "Favorite Color"],)
df_roster_raw = pd.DataFrame( [["Jared", "Richards", 45, 80, "Marketing"], ["Emily", "Seitz", 87, 60, "Analytics"], ["Nick", "Smarts", 23, 50, "Marketing"]], columns=["Emp FName", "Emp LName", "Hours Worked", "Emp Hourly Rate", "Domain"],)
display(df_emp_raw, df_roster_raw)

	Employee First Name	Employee Last Name	Age	Favorite Color
0	Jared	Richards	24	red
1	Emily	Seitz	55	green
2	Nick	Smarts	33	blue

	Emp FName	Emp LName	Hours Worked	Emp Hourly Rate	Domain
0	Jared	Richards	45	80	Marketing
1	Emily	Seitz	87	60	Analytics
2	Nick	Smarts	23	50	Marketing

We’ve got two input data sources: employee data, and roster data (see above).

They have shared columns (first and last name) which we want to join on, but are named differently.

In our code, we would need to rename these columns before joining. We would also need to reference the exact names of the input columns, which is not only annoying (because they’re long) but also inconvenient if the input column names change.

We also might write our finished data to a file that might later be read by another script in the pipeline. What if we change the column names of our outputted data? Then all other scripts who read our outputted data will need to be changed. That’s also inconvenient.

Solution: Declare ahead of time:

A schema for each of our input datasets
A schema for our final output. Then, another script in the pipeline who reads our output can reference this schema when reading data.

[67]:

sch_employee = Schema(
    last_name=("Employee Last Name", "Last Name"),
    first_name=("Employee First Name", "First Name"),
    age="Age",
)
sch_roster = Schema(
    last_name="Emp LName",
    first_name="Emp FName",
    hours="Hours Worked",
    rate=("Emp Hourly Rate", "Hourly Rate"),
)
sch_output = Schema(
    sch_employee[[
        "first_name",
        "last_name",
        "age"
    ]],
    sch_roster[["rate"]],
    pay="Total Pay",
    notes="Notes",
)

Here’s another look at the source data

[68]:

display(df_emp_raw, df_roster_raw)

	Employee First Name	Employee Last Name	Age	Favorite Color
0	Jared	Richards	24	red
1	Emily	Seitz	55	green
2	Nick	Smarts	33	blue

	Emp FName	Emp LName	Hours Worked	Emp Hourly Rate	Domain
0	Jared	Richards	45	80	Marketing
1	Emily	Seitz	87	60	Analytics
2	Nick	Smarts	23	50	Marketing

Now let’s apply the schemas to make selections and format our input data

[69]:

df_emp = sch_employee.select_inputs(df_emp_raw)
df_roster = sch_roster.select_inputs(df_roster_raw)
display(df_emp, df_roster)

	last_name	first_name	age
0	Richards	Jared	24
1	Seitz	Emily	55
2	Smarts	Nick	33

	last_name	first_name	hours	rate
0	Richards	Jared	45	80
1	Seitz	Emily	87	60
2	Smarts	Nick	23	50

[70]:

df = df_emp.merge(df_roster, on=["first_name", "last_name"])
df

[70]:

	last_name	first_name	age	hours	rate
0	Richards	Jared	24	45	80
1	Seitz	Emily	55	87	60
2	Smarts	Nick	33	23	50

[71]:

df["pay"] = df.hours * df.rate
df["notes"] = ""

# Let's try that again
df_output = sch_output.select_outputs(df)
df_output

[71]:

	First Name	Last Name	Age	Hourly Rate	Total Pay
0	Jared	Richards	24	80	3600
1	Emily	Seitz	55	60	5220
2	Nick	Smarts	33	50	1150

Now, how can we read in the output of the previous operations, and select from the new output names without creating a new schema?

Call .reset_inputs()

[72]:

# Pretend we're reading in the previous output from a file here
df_raw = df_output.copy()
df_raw

[72]:

	First Name	Last Name	Age	Hourly Rate	Total Pay
0	Jared	Richards	24	80	3600
1	Emily	Seitz	55	60	5220
2	Nick	Smarts	33	50	1150

[73]:

schema = sch_output.reset_inputs()

df = schema.select_inputs(df_raw)
df

[73]:

	first_name	last_name	age	rate	pay
0	Jared	Richards	24	80	3600
1	Emily	Seitz	55	60	5220
2	Nick	Smarts	33	50	1150

Create a schema that knows how to read its input data#

If you have some prep-work that’s always required after reading a certain dataset, it would be quite a luxury if the schema could handle this itself.

Or, what if you have multiple input sources that need to be joined immediately after being read? Instead of creating a separate schema for each, just create a single schema with the shared fields from each, and tell it how to populate each field.

Example: Concat first and last name into full name on read

Let’s go back to the first example. Instead of declaring first_name and last_name, we’ll just declare name, and tell the schema how to populate it

[74]:

# Input source
df_employee_raw

[74]:

	Emp FName	Emp LName	Hours Worked	Emp Hourly Rate	Domain
0	Jared	Richards	45	80	Marketing
1	Emily	Seitz	87	60	Analytics
2	Nick	Smarts	23	50	Marketing

[75]:

# Create a subclass of Schema with a read() function
class EmployeeSchema(Schema):
    def read(self):
        # pretend to read in data from a source
        df = df_employee_raw
        # Create name column
        df["Employee Name"] = df["Emp FName"] + " " + df["Emp LName"]
        return df

schema = EmployeeSchema(
    name="Employee Name",
    hours="Hours Worked",
    rate=("Emp Hourly Rate", "Hourly Rate"),
)

df = schema.read()
df = schema.select_inputs(df)
df

[75]:

	name	hours	rate
0	Jared Richards	45	80
1	Emily Seitz	87	60
2	Nick Smarts	23	50