Predicting SAT Scores for NYC Schools
An Introduction to Pandas
This tutorial will get you started with Pandas - a data analysis library for Python that is great for data preparation, joining, and ultimately generating well-formed, tabular data that's easy to use in a variety of visualization tools or (as we will see here) machine learning applications. This tutorial assumes a solid understanding of core Python functionality, but nothing about machine learning or Pandas.
You can download the Python code for this tutorial here, download the IPython notebook file (ipynb) here. Lastly you can read this entire post as IPython Notebook HTML output here, which is significantly easier to read.
You can download the Python code for this tutorial here, download the IPython notebook file (ipynb) here. Lastly you can read this entire post as IPython Notebook HTML output here, which is significantly easier to read.
Goals
- Using data from [NYC Open Data] (https://data.cityofnewyork.us/), build a unified, tabular dataset ready for use with machine learning algorithms to predict student SAT scores on a per school basis.
- Learn and use the Pandas data analysis package.
- Learn how data is typically prepared for machine learning algorithms (ingestion, cleaning, joining, feature generation).
First, let's ingest the data and get the lay of the land. You can download the data sets referenced below from NYC Open Data, or directly download a zip file with the relevant data.
In [8]:
In [9]:
Out[9]:
Outline
Pandas has read the data files without issue. Next let's create a rough map of where we are going.
We have five datasets here, each with information about either schools or districts. We're going to need to join all of this information together into a tabular file, with one row for each school, joined with as much information we can gather about that school & its district, including our dependent variables, which will be the mean SAT scores for each school in 2010.
Drilling down one level of detail, let's look at the dataset dsSATs, which contains the target variables:
In [10]:
Out[10]:
Target Variable and Joining Strategy
We are going to build a dataset to predict Critical Reading Mean, Mathematics Mean, and Writing Mean for each school (identified by DBN).
After digging around in Excel (or just taking my word for it) we identify the following join strategy (using SQL-esque pseudocode):
dsSATS join dsClassSize on dsSATs['DBN'] = dsClassSize['SCHOOL CODE']
join dsProgReports on dsSATs['DBN'] = dsProgReports['DBN']
join dsDistrct on dsProgReports['DISTRICT'] = dsDistrict['JURISDICTION NAME']
join dsAttendEnroll on dsProgReports['DISTRICT'] = dsAttendEnroll['District']
Now that we have the strategy identified at a high level, there are a number of details we have to identify and take care of first.
Primary Keys - Schools
Before we can join these three datasets together, we need to normalize their primary keys. Below we see the mismatch between the way the DBN (school id) field is represented in the different datasets. We then write code to normalize the keys and correct this problem.
In [11]:
Out[11]:
In [12]:
Out[12]:
Primary Keys - Districts
We have a similar story with the district foreign keys. Again, we need to normalize the keys. The only additional complexity here is that dsProgReports['DISTRICT'] is typed numerically, whereas the other two district keys are typed as string. We do some type conversions following the key munging.
In [13]:
Out[13]:
In [14]:
Out[14]:
Additional Cleanup
At this point we could do the joins, but there is messiness in the data still. First let's reindex the DataFrames so the semantics come out a bit cleaner. Pandas indexing is beyond the scope of this tutorial but suffice it to say it makes these operations easier.
In [15]:
Let's take a look at one of our target variables. Right away we see the "s" value, which shouldn't be there.
We'll filter out the rows without data.
In [16]:
Out[16]:
In [17]:
Out[17]:
Feature Construction
dsClassSize will be a many-to-one join with dsSATs because dsClassSize contains multiple entries per school. We need to summarize and build features from this data in order to get one row per school that will join neatly to dsSATs.
Additionally, the data has an irregular format, consisting of a number of rows per school describing different class sizes, then a final row for that school which contains no data except for a number in the final column, SCHOOLWIDE PUPIL-TEACHER RATIO.
We need to extract the SCHOOLWIDE PUPIL-TEACHER RATIO rows, at which point we'll have a regular format and can build features via aggregate functions. We'll also drop any features that can't be easily summarized or aggregated and likely have no bearing on the SAT scores (like School Name).
In [18]:
In [19]:
Joining
One final thing before we join - dsProgReports contains distinct rows for separate grade level blocks within one school. For instance one school (one DBN) might have two rows: one for middle school and one for high school. We'll just drop everything that isn't high school.
And finally we can join our data. Note these are inner joins, so district data get joined to each school in that district.
In [20]:
In [21]:
(Even More) Additional Cleanup
We should be in a position to build a predictive model for our target variables right away but unfortunately there is still messy data floating around in the dataframe that machine learning algorithms will choke on. A pure feature matrix should have only numeric features, but we can see that isn't the case. However for many of these columns, the right approach is obvious once we've dug in.
In [22]:
Out[22]:
In [23]:
Categorical Variables
We can see above that the remaining non-numeric field are grades . Intuitively, they might be important so we don't want to drop them, but in order to get a pure feature matrix we need numeric values. The approach we'll use here is to explode these into multiple boolean columns. Some machine learning libraries effectively do this for you under the covers, but when the cardinality of the categorical variable is relatively low, it's nice to be explicit about it.
In [24]:
That's it!
We now have a feature matrix that's trivial to use with any number of machine learning algorithms. Feel free to stop here and run the two lines of code below to get nice CSV files written to disk that you can easily use in Excel, Tableau, etc. Or run the larger block of code to see how easy it is to build a random forest model against this data, and look at which variables are most important.
In [ ]:
Hi there! I could have sworn I've been to this site before but after reading through some of the post I realized it's new to me.
ReplyDeleteNonetheless, I'm definitely glad I found it and I'll be bookmarking and checking back frequently!
My page > decals
Hi there! I'm at work browsing your blog from my new iphone 3gs! Just wanted to say I love reading your blog and look forward to all your posts! Keep up the excellent work!
ReplyDeleteMy site - asuntoturkista.Net
The FIA FDA for U. S. or CFIA for Canada officials who are reviewing
ReplyDeletelabels on food for thought before products are allowed
launching into the country. The fact is, most multitude actually respond
to clear and uncomplicated headlines present building complex to particular Here.
My homepage ... business cards flyers
Also see my web site > Mailing Labels Bumper
Ahaa, its nice discussion about this article at this place at this webpage,
ReplyDeleteI have read all that, so now me also commenting at this place.
Also visit my web blog :: Movers Dayton