{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Schema: a simple data tool\n",
    "\n",
    ".. note::\n",
    "   Although excelbird's `Frame` integrates with the `xb.Schema` class, Schema on its own\n",
    "   is not related to or dependent on excelbird, and in the future will be its own separate\n",
    "   installable package.\n",
    "\n",
    "\n",
    "`xb.Schema` is a simple tool for data wrangling pipelines that helps you stay organized, improves readability and helps avoid bugs.\n",
    "\n",
    "Where traditional dataframe schema classes define what should go *inside* a dataframe's\n",
    "columns, `Schema` defines **when** those columns should be, and **how** they got there. It\n",
    "defines a dataframe's **state** at a given point in time, and provides the tools/methods\n",
    "needed to help you arrive at that state, change it, and move to new states seamlessly.\n",
    "\n",
    "It's a simple class, designed for ease of use and exceptional readability.\n",
    "\n",
    "We'll cover functionality shortly, but first, here's what we mean by 'readability':"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sch_person = Schema(\n",
    "    first_name=(\"FName\", \"First Name\"),\n",
    "    last_name=(\"LName\", \"Last Name\"),\n",
    "    favorite_food=\"Favorite Food\",\n",
    ")\n",
    "\n",
    "sch_company = Schema(\n",
    "    comp_name=(\"Companyname\", \"Company Name\"),\n",
    "    market_cap=\"Market Capitalization\",\n",
    "    favorite_food=\"Preferred Employee Favorite Food\",\n",
    ")\n",
    "\n",
    "sch_output = Schema(\n",
    "    sch_person[[\n",
    "        'last_name',\n",
    "        'age',\n",
    "    ]],\n",
    "    sch_company[[\n",
    "        'comp_name',\n",
    "    ]],\n",
    "    is_executive=\"Person is Executive\"\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Without touching any data, the reader already knows exactly what's supposed to happen in the script that follows:\n",
    "\n",
    "* There are two input sources, person and company. We know what their columns are called in the input data, what we'd like to call them when they're eventually used to generate a final output of some kind, and given them short, python-friendly names to use during our workflow.\n",
    "* The output schema indicaets that the script must join person and company to get its output, and where each field is coming from.\n",
    "* The script needs to add a new custom column, `is_executive`\n",
    "\n",
    "## Creating a Schema\n",
    "\n",
    "Schema is a subclass of dictionary, where\n",
    "\n",
    "* **Positional args**: existing Schemas, or slices from them\n",
    "* **Keyword args**:\n",
    "  * *Keys*: python-friendly variable names\n",
    "  * *Values*: str, or tuple. If tuple:\n",
    "    * Item 0: Input column name\n",
    "    * Item 1: Output column name **(optional)**\n",
    "\n",
    "\n",
    "# Schema Examples"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from excelbird import Schema\n",
    "import pandas as pd"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "First we'll create some sample data. Assume ``df_employee_raw`` was read in from somewhere else"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Emp FName</th>\n",
       "      <th>Emp LName</th>\n",
       "      <th>Hours Worked</th>\n",
       "      <th>Emp Hourly Rate</th>\n",
       "      <th>Domain</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Jared</td>\n",
       "      <td>Richards</td>\n",
       "      <td>45</td>\n",
       "      <td>80</td>\n",
       "      <td>Marketing</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Emily</td>\n",
       "      <td>Seitz</td>\n",
       "      <td>87</td>\n",
       "      <td>60</td>\n",
       "      <td>Analytics</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Nick</td>\n",
       "      <td>Smarts</td>\n",
       "      <td>23</td>\n",
       "      <td>50</td>\n",
       "      <td>Marketing</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  Emp FName Emp LName  Hours Worked  Emp Hourly Rate     Domain\n",
       "0     Jared  Richards            45               80  Marketing\n",
       "1     Emily     Seitz            87               60  Analytics\n",
       "2      Nick    Smarts            23               50  Marketing"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_employee_raw = pd.DataFrame( [[\"Jared\", \"Richards\", 45, 80, \"Marketing\"], [\"Emily\", \"Seitz\", 87, 60, \"Analytics\"], [\"Nick\", \"Smarts\", 23, 50, \"Marketing\"]], columns=[\"Emp FName\", \"Emp LName\", \"Hours Worked\", \"Emp Hourly Rate\", \"Domain\"],)\n",
    "df_employee_raw"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Elsewhere, we've defined the following"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Input</th>\n",
       "      <th>Output</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>last_name</th>\n",
       "      <td>Emp LName</td>\n",
       "      <td>Last Name</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>first_name</th>\n",
       "      <td>Emp FName</td>\n",
       "      <td>First Name</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>hours</th>\n",
       "      <td>Hours Worked</td>\n",
       "      <td>Hours Worked</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>rate</th>\n",
       "      <td>Emp Hourly Rate</td>\n",
       "      <td>Hourly Rate</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "{'last_name': Column(input='Emp LName', output='Last Name'),\n",
       " 'first_name': Column(input='Emp FName', output='First Name'),\n",
       " 'hours': Column(input='Hours Worked', output='Hours Worked'),\n",
       " 'rate': Column(input='Emp Hourly Rate', output='Hourly Rate')}"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "schema = Schema(\n",
    "    # var_name = (\"Input Col Name\", \"Output Col Name\")\n",
    "    last_name=(\"Emp LName\", \"Last Name\"),\n",
    "    first_name=(\"Emp FName\", \"First Name\"),\n",
    "    hours=\"Hours Worked\",  # will be applied to both input and output\n",
    "    rate=(\"Emp Hourly Rate\", \"Hourly Rate\"),\n",
    ")\n",
    "schema_partial = schema[[\"rate\", \"last_name\"]]\n",
    "schema"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "Now we can use the schema after reading in new data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>last_name</th>\n",
       "      <th>first_name</th>\n",
       "      <th>hours</th>\n",
       "      <th>rate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Richards</td>\n",
       "      <td>Jared</td>\n",
       "      <td>45</td>\n",
       "      <td>80</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Seitz</td>\n",
       "      <td>Emily</td>\n",
       "      <td>87</td>\n",
       "      <td>60</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Smarts</td>\n",
       "      <td>Nick</td>\n",
       "      <td>23</td>\n",
       "      <td>50</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  last_name first_name  hours  rate\n",
       "0  Richards      Jared     45    80\n",
       "1     Seitz      Emily     87    60\n",
       "2    Smarts       Nick     23    50"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = schema.select_inputs(df_employee_raw)\n",
    "df"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Calling `.select_inputs()` did the following\n",
    "\n",
    "* Check if all desired columns are present in input data, and **throw an error** if any are missing, informing you of **all** the missing columns at once\n",
    "* Select only the desired columns\n",
    "* Rename them to your desired standardized variable names\n",
    "* Re-order them to the order they follow in the schema"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the real world, your code will probably combine things from multiple datasets,\n",
    "and create new columns. Creating a composite schema is easy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Input</th>\n",
       "      <th>Output</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>hours</th>\n",
       "      <td>Hours Worked</td>\n",
       "      <td>Hours Worked</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>first_name</th>\n",
       "      <td>Emp FName</td>\n",
       "      <td>First Name</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>rate</th>\n",
       "      <td>Emp Hourly Rate</td>\n",
       "      <td>Hourly Rate</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>last_name</th>\n",
       "      <td>Emp LName</td>\n",
       "      <td>Last Name</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>daily_rate</th>\n",
       "      <td>Daily Rate</td>\n",
       "      <td>Daily Rate</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>bonus</th>\n",
       "      <td>Bonus</td>\n",
       "      <td>Bonus</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "{'hours': Column(input='Hours Worked', output='Hours Worked'),\n",
       " 'first_name': Column(input='Emp FName', output='First Name'),\n",
       " 'rate': Column(input='Emp Hourly Rate', output='Hourly Rate'),\n",
       " 'last_name': Column(input='Emp LName', output='Last Name'),\n",
       " 'daily_rate': Column(input='Daily Rate', output='Daily Rate'),\n",
       " 'bonus': Column(input='Bonus', output='Bonus')}"
      ]
     },
     "execution_count": 63,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "schema_composite = Schema(\n",
    "    schema[[\n",
    "        \"hours\",\n",
    "        \"first_name\"\n",
    "    ]],\n",
    "    schema_partial,  # include all from schema_partial\n",
    "    daily_rate=\"Daily Rate\",\n",
    "    bonus=\"Bonus\",\n",
    ")\n",
    "schema_composite"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll be notified with a descriptive error if we're missing anything at output time"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [
    {
     "ename": "SchemaError",
     "evalue": "Please add columns, ['daily_rate', 'bonus'] before outputting.",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mSchemaError\u001b[0m                               Traceback (most recent call last)",
      "Cell \u001b[0;32mIn[64], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m df_out \u001b[39m=\u001b[39m schema_composite\u001b[39m.\u001b[39;49mselect_outputs(df)\n\u001b[1;32m      2\u001b[0m df_out\n",
      "File \u001b[0;32m~/Library/CloudStorage/OneDrive-EY/repo/python-packages/excelbird/excelbird/schema.py:427\u001b[0m, in \u001b[0;36mSchema.select_outputs\u001b[0;34m(self, df)\u001b[0m\n\u001b[1;32m    425\u001b[0m missing \u001b[39m=\u001b[39m [col \u001b[39mfor\u001b[39;00m col \u001b[39min\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mkeys() \u001b[39mif\u001b[39;00m col \u001b[39mnot\u001b[39;00m \u001b[39min\u001b[39;00m df\u001b[39m.\u001b[39mcolumns]\n\u001b[1;32m    426\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mlen\u001b[39m(missing) \u001b[39m>\u001b[39m \u001b[39m0\u001b[39m:\n\u001b[0;32m--> 427\u001b[0m     \u001b[39mraise\u001b[39;00m SchemaError(\u001b[39mf\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mPlease add columns, \u001b[39m\u001b[39m{\u001b[39;00mmissing\u001b[39m}\u001b[39;00m\u001b[39m before outputting.\u001b[39m\u001b[39m\"\u001b[39m)\n\u001b[1;32m    428\u001b[0m df \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mrename_vars_to_outputs(df)\n\u001b[1;32m    429\u001b[0m \u001b[39mreturn\u001b[39;00m df[[k \u001b[39mfor\u001b[39;00m k \u001b[39min\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39moutputs()]]\n",
      "\u001b[0;31mSchemaError\u001b[0m: Please add columns, ['daily_rate', 'bonus'] before outputting."
     ]
    }
   ],
   "source": [
    "df_out = schema_composite.select_outputs(df)\n",
    "df_out"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Hours Worked</th>\n",
       "      <th>First Name</th>\n",
       "      <th>Hourly Rate</th>\n",
       "      <th>Last Name</th>\n",
       "      <th>Daily Rate</th>\n",
       "      <th>Bonus</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>45</td>\n",
       "      <td>Jared</td>\n",
       "      <td>80</td>\n",
       "      <td>Richards</td>\n",
       "      <td>640</td>\n",
       "      <td>100</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>87</td>\n",
       "      <td>Emily</td>\n",
       "      <td>60</td>\n",
       "      <td>Seitz</td>\n",
       "      <td>480</td>\n",
       "      <td>100</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>23</td>\n",
       "      <td>Nick</td>\n",
       "      <td>50</td>\n",
       "      <td>Smarts</td>\n",
       "      <td>400</td>\n",
       "      <td>100</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Hours Worked First Name  Hourly Rate Last Name  Daily Rate  Bonus\n",
       "0            45      Jared           80  Richards         640    100\n",
       "1            87      Emily           60     Seitz         480    100\n",
       "2            23       Nick           50    Smarts         400    100"
      ]
     },
     "execution_count": 65,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"daily_rate\"] = df.rate * 8\n",
    "df[\"bonus\"] = 100\n",
    "\n",
    "df_out = schema_composite.select_outputs(df)\n",
    "df_out"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A more realistic use-case\n",
    "\n",
    "Assume we're reading the following input data from files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Employee First Name</th>\n",
       "      <th>Employee Last Name</th>\n",
       "      <th>Age</th>\n",
       "      <th>Favorite Color</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Jared</td>\n",
       "      <td>Richards</td>\n",
       "      <td>24</td>\n",
       "      <td>red</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Emily</td>\n",
       "      <td>Seitz</td>\n",
       "      <td>55</td>\n",
       "      <td>green</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Nick</td>\n",
       "      <td>Smarts</td>\n",
       "      <td>33</td>\n",
       "      <td>blue</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  Employee First Name Employee Last Name  Age Favorite Color\n",
       "0               Jared           Richards   24            red\n",
       "1               Emily              Seitz   55          green\n",
       "2                Nick             Smarts   33           blue"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Emp FName</th>\n",
       "      <th>Emp LName</th>\n",
       "      <th>Hours Worked</th>\n",
       "      <th>Emp Hourly Rate</th>\n",
       "      <th>Domain</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Jared</td>\n",
       "      <td>Richards</td>\n",
       "      <td>45</td>\n",
       "      <td>80</td>\n",
       "      <td>Marketing</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Emily</td>\n",
       "      <td>Seitz</td>\n",
       "      <td>87</td>\n",
       "      <td>60</td>\n",
       "      <td>Analytics</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Nick</td>\n",
       "      <td>Smarts</td>\n",
       "      <td>23</td>\n",
       "      <td>50</td>\n",
       "      <td>Marketing</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  Emp FName Emp LName  Hours Worked  Emp Hourly Rate     Domain\n",
       "0     Jared  Richards            45               80  Marketing\n",
       "1     Emily     Seitz            87               60  Analytics\n",
       "2      Nick    Smarts            23               50  Marketing"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "df_emp_raw = pd.DataFrame( [[\"Jared\", \"Richards\", 24, \"red\"], [\"Emily\", \"Seitz\", 55, \"green\"], [\"Nick\", \"Smarts\", 33, \"blue\"]], columns=[\"Employee First Name\", \"Employee Last Name\", \"Age\", \"Favorite Color\"],)\n",
    "df_roster_raw = pd.DataFrame( [[\"Jared\", \"Richards\", 45, 80, \"Marketing\"], [\"Emily\", \"Seitz\", 87, 60, \"Analytics\"], [\"Nick\", \"Smarts\", 23, 50, \"Marketing\"]], columns=[\"Emp FName\", \"Emp LName\", \"Hours Worked\", \"Emp Hourly Rate\", \"Domain\"],)\n",
    "display(df_emp_raw, df_roster_raw)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "We've got two input data sources: employee data, and roster data (see above).\n",
    "\n",
    "They have shared columns (first and last name) which we want to join on, but are named differently.\n",
    "\n",
    "In our code, we would need to rename these columns before joining. We would also need to reference the exact names of the input columns, which is not only annoying (because they're long) but also inconvenient if the input column names change.\n",
    "\n",
    "We also might write our finished data to a file that might later be read by another script in the pipeline. What if we change the column names of our outputted data? Then all other scripts who read our outputted data will need to be changed. That's also inconvenient.\n",
    "\n",
    "**Solution**: Declare ahead of time:\n",
    "\n",
    "* A schema for each of our input datasets\n",
    "* A schema for our final output. Then, another script in the pipeline who reads our output can reference this schema when reading data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [],
   "source": [
    "sch_employee = Schema(\n",
    "    last_name=(\"Employee Last Name\", \"Last Name\"),\n",
    "    first_name=(\"Employee First Name\", \"First Name\"),\n",
    "    age=\"Age\",\n",
    ")\n",
    "sch_roster = Schema(\n",
    "    last_name=\"Emp LName\",\n",
    "    first_name=\"Emp FName\",\n",
    "    hours=\"Hours Worked\",\n",
    "    rate=(\"Emp Hourly Rate\", \"Hourly Rate\"),\n",
    ")\n",
    "sch_output = Schema(\n",
    "    sch_employee[[\n",
    "        \"first_name\",\n",
    "        \"last_name\",\n",
    "        \"age\"\n",
    "    ]],\n",
    "    sch_roster[[\"rate\"]],\n",
    "    pay=\"Total Pay\",\n",
    "    notes=\"Notes\",\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's another look at the source data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Employee First Name</th>\n",
       "      <th>Employee Last Name</th>\n",
       "      <th>Age</th>\n",
       "      <th>Favorite Color</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Jared</td>\n",
       "      <td>Richards</td>\n",
       "      <td>24</td>\n",
       "      <td>red</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Emily</td>\n",
       "      <td>Seitz</td>\n",
       "      <td>55</td>\n",
       "      <td>green</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Nick</td>\n",
       "      <td>Smarts</td>\n",
       "      <td>33</td>\n",
       "      <td>blue</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  Employee First Name Employee Last Name  Age Favorite Color\n",
       "0               Jared           Richards   24            red\n",
       "1               Emily              Seitz   55          green\n",
       "2                Nick             Smarts   33           blue"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Emp FName</th>\n",
       "      <th>Emp LName</th>\n",
       "      <th>Hours Worked</th>\n",
       "      <th>Emp Hourly Rate</th>\n",
       "      <th>Domain</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Jared</td>\n",
       "      <td>Richards</td>\n",
       "      <td>45</td>\n",
       "      <td>80</td>\n",
       "      <td>Marketing</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Emily</td>\n",
       "      <td>Seitz</td>\n",
       "      <td>87</td>\n",
       "      <td>60</td>\n",
       "      <td>Analytics</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Nick</td>\n",
       "      <td>Smarts</td>\n",
       "      <td>23</td>\n",
       "      <td>50</td>\n",
       "      <td>Marketing</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  Emp FName Emp LName  Hours Worked  Emp Hourly Rate     Domain\n",
       "0     Jared  Richards            45               80  Marketing\n",
       "1     Emily     Seitz            87               60  Analytics\n",
       "2      Nick    Smarts            23               50  Marketing"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "display(df_emp_raw, df_roster_raw)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's apply the schemas to make selections and format our input data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>last_name</th>\n",
       "      <th>first_name</th>\n",
       "      <th>age</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Richards</td>\n",
       "      <td>Jared</td>\n",
       "      <td>24</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Seitz</td>\n",
       "      <td>Emily</td>\n",
       "      <td>55</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Smarts</td>\n",
       "      <td>Nick</td>\n",
       "      <td>33</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  last_name first_name  age\n",
       "0  Richards      Jared   24\n",
       "1     Seitz      Emily   55\n",
       "2    Smarts       Nick   33"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>last_name</th>\n",
       "      <th>first_name</th>\n",
       "      <th>hours</th>\n",
       "      <th>rate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Richards</td>\n",
       "      <td>Jared</td>\n",
       "      <td>45</td>\n",
       "      <td>80</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Seitz</td>\n",
       "      <td>Emily</td>\n",
       "      <td>87</td>\n",
       "      <td>60</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Smarts</td>\n",
       "      <td>Nick</td>\n",
       "      <td>23</td>\n",
       "      <td>50</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  last_name first_name  hours  rate\n",
       "0  Richards      Jared     45    80\n",
       "1     Seitz      Emily     87    60\n",
       "2    Smarts       Nick     23    50"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "df_emp = sch_employee.select_inputs(df_emp_raw)\n",
    "df_roster = sch_roster.select_inputs(df_roster_raw)\n",
    "display(df_emp, df_roster)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>last_name</th>\n",
       "      <th>first_name</th>\n",
       "      <th>age</th>\n",
       "      <th>hours</th>\n",
       "      <th>rate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Richards</td>\n",
       "      <td>Jared</td>\n",
       "      <td>24</td>\n",
       "      <td>45</td>\n",
       "      <td>80</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Seitz</td>\n",
       "      <td>Emily</td>\n",
       "      <td>55</td>\n",
       "      <td>87</td>\n",
       "      <td>60</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Smarts</td>\n",
       "      <td>Nick</td>\n",
       "      <td>33</td>\n",
       "      <td>23</td>\n",
       "      <td>50</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  last_name first_name  age  hours  rate\n",
       "0  Richards      Jared   24     45    80\n",
       "1     Seitz      Emily   55     87    60\n",
       "2    Smarts       Nick   33     23    50"
      ]
     },
     "execution_count": 70,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = df_emp.merge(df_roster, on=[\"first_name\", \"last_name\"])\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>First Name</th>\n",
       "      <th>Last Name</th>\n",
       "      <th>Age</th>\n",
       "      <th>Hourly Rate</th>\n",
       "      <th>Total Pay</th>\n",
       "      <th>Notes</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Jared</td>\n",
       "      <td>Richards</td>\n",
       "      <td>24</td>\n",
       "      <td>80</td>\n",
       "      <td>3600</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Emily</td>\n",
       "      <td>Seitz</td>\n",
       "      <td>55</td>\n",
       "      <td>60</td>\n",
       "      <td>5220</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Nick</td>\n",
       "      <td>Smarts</td>\n",
       "      <td>33</td>\n",
       "      <td>50</td>\n",
       "      <td>1150</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  First Name Last Name  Age  Hourly Rate  Total Pay Notes\n",
       "0      Jared  Richards   24           80       3600      \n",
       "1      Emily     Seitz   55           60       5220      \n",
       "2       Nick    Smarts   33           50       1150      "
      ]
     },
     "execution_count": 71,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"pay\"] = df.hours * df.rate\n",
    "df[\"notes\"] = \"\"\n",
    "\n",
    "# Let's try that again\n",
    "df_output = sch_output.select_outputs(df)\n",
    "df_output"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "Now, how can we read in the output of the previous operations, and select from the new output names without creating a new schema?\n",
    "\n",
    "**Call** `.reset_inputs()`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>First Name</th>\n",
       "      <th>Last Name</th>\n",
       "      <th>Age</th>\n",
       "      <th>Hourly Rate</th>\n",
       "      <th>Total Pay</th>\n",
       "      <th>Notes</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Jared</td>\n",
       "      <td>Richards</td>\n",
       "      <td>24</td>\n",
       "      <td>80</td>\n",
       "      <td>3600</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Emily</td>\n",
       "      <td>Seitz</td>\n",
       "      <td>55</td>\n",
       "      <td>60</td>\n",
       "      <td>5220</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Nick</td>\n",
       "      <td>Smarts</td>\n",
       "      <td>33</td>\n",
       "      <td>50</td>\n",
       "      <td>1150</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  First Name Last Name  Age  Hourly Rate  Total Pay Notes\n",
       "0      Jared  Richards   24           80       3600      \n",
       "1      Emily     Seitz   55           60       5220      \n",
       "2       Nick    Smarts   33           50       1150      "
      ]
     },
     "execution_count": 72,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Pretend we're reading in the previous output from a file here\n",
    "df_raw = df_output.copy()\n",
    "df_raw"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>first_name</th>\n",
       "      <th>last_name</th>\n",
       "      <th>age</th>\n",
       "      <th>rate</th>\n",
       "      <th>pay</th>\n",
       "      <th>notes</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Jared</td>\n",
       "      <td>Richards</td>\n",
       "      <td>24</td>\n",
       "      <td>80</td>\n",
       "      <td>3600</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Emily</td>\n",
       "      <td>Seitz</td>\n",
       "      <td>55</td>\n",
       "      <td>60</td>\n",
       "      <td>5220</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Nick</td>\n",
       "      <td>Smarts</td>\n",
       "      <td>33</td>\n",
       "      <td>50</td>\n",
       "      <td>1150</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  first_name last_name  age  rate   pay notes\n",
       "0      Jared  Richards   24    80  3600      \n",
       "1      Emily     Seitz   55    60  5220      \n",
       "2       Nick    Smarts   33    50  1150      "
      ]
     },
     "execution_count": 73,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "schema = sch_output.reset_inputs()\n",
    "\n",
    "df = schema.select_inputs(df_raw)\n",
    "df"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create a schema that knows how to read its input data\n",
    "\n",
    "If you have some prep-work that's always required after reading a certain dataset, it would be quite a luxury if the schema could handle this itself.\n",
    "\n",
    "Or, what if you have multiple input sources that need to be joined immediately after being read? Instead of creating a separate schema for each, just create a single schema with the shared fields from each, and tell it how to populate each field."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Example: Concat first and last name into full name on read**\n",
    "\n",
    "---\n",
    "\n",
    "Let's go back to the first example. Instead of declaring `first_name` and `last_name`, we'll just declare `name`, and tell the schema how to populate it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Emp FName</th>\n",
       "      <th>Emp LName</th>\n",
       "      <th>Hours Worked</th>\n",
       "      <th>Emp Hourly Rate</th>\n",
       "      <th>Domain</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Jared</td>\n",
       "      <td>Richards</td>\n",
       "      <td>45</td>\n",
       "      <td>80</td>\n",
       "      <td>Marketing</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Emily</td>\n",
       "      <td>Seitz</td>\n",
       "      <td>87</td>\n",
       "      <td>60</td>\n",
       "      <td>Analytics</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Nick</td>\n",
       "      <td>Smarts</td>\n",
       "      <td>23</td>\n",
       "      <td>50</td>\n",
       "      <td>Marketing</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  Emp FName Emp LName  Hours Worked  Emp Hourly Rate     Domain\n",
       "0     Jared  Richards            45               80  Marketing\n",
       "1     Emily     Seitz            87               60  Analytics\n",
       "2      Nick    Smarts            23               50  Marketing"
      ]
     },
     "execution_count": 74,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Input source\n",
    "df_employee_raw"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>hours</th>\n",
       "      <th>rate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Jared Richards</td>\n",
       "      <td>45</td>\n",
       "      <td>80</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Emily Seitz</td>\n",
       "      <td>87</td>\n",
       "      <td>60</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Nick Smarts</td>\n",
       "      <td>23</td>\n",
       "      <td>50</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             name  hours  rate\n",
       "0  Jared Richards     45    80\n",
       "1     Emily Seitz     87    60\n",
       "2     Nick Smarts     23    50"
      ]
     },
     "execution_count": 75,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Create a subclass of Schema with a read() function\n",
    "class EmployeeSchema(Schema):\n",
    "    def read(self):\n",
    "        # pretend to read in data from a source\n",
    "        df = df_employee_raw\n",
    "        # Create name column\n",
    "        df[\"Employee Name\"] = df[\"Emp FName\"] + \" \" + df[\"Emp LName\"]\n",
    "        return df\n",
    "\n",
    "schema = EmployeeSchema(\n",
    "    name=\"Employee Name\",\n",
    "    hours=\"Hours Worked\",\n",
    "    rate=(\"Emp Hourly Rate\", \"Hourly Rate\"),\n",
    ")\n",
    "\n",
    "df = schema.read()\n",
    "df = schema.select_inputs(df)\n",
    "df"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "env",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.1"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "6ee8c39a5e4f5a5ffe92ac0abdca1f10885ded7e28c74450614be57680931831"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}