Saturday, November 30, 2013

The No-Frills Guide to PGP on OS X

Want to get started using PGP on your Mac, but confused by the morass of professor-doctor style sites with seemingly out-of-date software and plugins? Have no fear! I'll have you up and running in minutes with this handy guide. You'll learn how to use public key servers and how to encrypt and send and receive and decrypt emails, and how to sign and verify messages using GPG. By the time you're done, you might actually find the GPG manual pages for simple tasks illuminating instead of totally incoherent.

Really explaining how public key encryption works or why you'd want to use it is not the goal of this post by a long shot - this is just a simple block-and-tackle tutorial on how to use the stuff. If you want an easy-to-read, non-technical intro to asymmetric cryptography, go read the first section of chapter 10 of Little Brother by Cory Doctorow. But with that said, I'd be remiss if I didn't at least provide...

Part 0 - A Two-Sentence Introduction to Public Key Encryption

Everyone who wants to communicate securely generates their own pair of keys, one of which they publicize, and one of which they keep private. To send someone a secure message, just encrypt your message with the recipient's public key, and that person (and only that person) will be able to decrypt it using their private key.

Got it?

Part 1 - A Quick Disambiguation of PGP, OpenPGP, and GPG

PGP stands for Pretty Good Privacy and is a piece of software that implements the OpenPGP public key (or asymmetric) encryption standard. The PGP implementation is owned by Symantec. GPG is another, free, implementation of OpenPGP that stands for Gnu Privacy Guard. GPG is very common on *nix systems and it's what we'll use here. So basically GPG and PGP are functionally equivalent.

Part 2 - Installing Stuff

We're going to use a Thunderbird (which is a great email client by Mozilla) extension called Enigmail to send and receive GPG emails. Let's get installing!

  1. Get Thunderbird and hook it up to an email account you use. Thunderbird is pretty great, and works well with Gmail (you only need to do the super short instructions in the "configuring your gmail" section - it's that easy). Ok done? Good.
  2. Now we need to install GPG. For OS X, you'll want GPG Suite. It's super easy to install, and will walk you through creating your first GPG key pair just after installation. It will default to an RSA key of 2048 bits, but I'd recommend using 4096. It's quite a bit more secure and really doesn't have any downsides - it's a bit more computationally expensive, but that doesn't matter unless you're using the key for something like SSL. We're just encrypting emails here.
  3. Finally, install Enigmail. This is a Thunderbird extension that provides simple integration with GPG so you don't have to muck about with a bunch of command line tools just to deal with sending and receiving email. There are a few configuration options - odds are, unless you are a character in Cryptonomicon, you don't want to encrypt or sign your messages by default. The official homepage of Engimail makes it look a little long in the tooth but it works great with the latest version of Thunderbird.

Part 3 - Using Key Servers

Now that you have a key pair created, you need to share it with the world - so that anyone who wants to communicate with you securely can encrypt messages to you with your public key. There are a number of organizations that maintain public keyservers to do just this. Key servers are just searchable directories of public keys. I use the MIT's because Phil Zimmerman, the inventor of PGP, is an MIT guy, and the server has been around a heck of a long time.

Go ahead and search for me. That's my key!

If you click on the key, you'll actually see the long block of gibberish that is my public RSA key. But you don't even have to interact with that - the GPG Suite makes it super easy to publish and install keys public keys. In the GPG Keychain Access tool (under Applications on your Mac), go to Apple > Preferences and point to the MIT key server:


Now you can publish your public key to the server by right-clicking your key and hitting "Send public key to Keyserver":


Now go search for your email on MIT's servers and you can find your public key! Not only that, MIT's key server will propagate your key to other keyservers all around the world.

Installing keys is just as easy. You can install mine by going to Key > Retrieve From Keyserver and putting in my key ID from MIT's server (0x6f0eff6b2e0593ad). That's me! And now the world can find your public key just as easily.


Part 4 - Sending & Receiving Encrypted Messages

With Engimail installed, this is really pretty easy. Open Thunderbird and compose a new message. You'll see the OpenPGP drop down menu at the top, and you can elect to encrypt the message.


When you go to send the message, you'll of course have to select the public key with which to encrypt the message. Enigmail will detect the key automatically if you have a key with a matching email address on file with your GPG Keychain already, but if not you'll have to select one (or, more likely, go retrieve and install the correct one from a keyserver).

So if you write me a message like this:


And choose to encrypt it with my private key, here is the email that actually gets sent (note that the subject line is NOT encrypted):


And if I open it in Thunderbird, Engimail detects the encryption, prompts for my password, decrypts it, and lets me know that it's decrypted a message:


Couldn't be easier! The harder thing is actually finding someone who also knows how to send and receive encrypted emails. Hah hah.


Part 5 - Signing & Verifying Emails

Besides keeping communication private, public key cryptography has another superpower - identify verification. The properties of the private/public key pairs means that, not only can someone else use your public key to encrypt a message that is only decryptable by you, but you can encrypt a message using your private key that can only be decrypted by your public key. Now, you may be asking yourself "why is that useful? Why bother encrypting a message that anyone can then read?". Well, if the message can be decrypted using your public key, then it must have been signed using your private key - which means you must have sent the message! Thus we can exploit public key cryptography for identity verification, as well as secret-message sending.

A signed message looks like this:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512
Really!
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.22 (Darwin)
Comment: GPGTools - https://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
iQIcBAEBCgAGBQJSmOyIAAoJEG8O/2suBZOtmSUQAMSEbn13rjey0L9trG/Qc3eI
DEvbnl1Vkx+37d7o8GVtwqmD5uwfh2RphyRq/l1ML3fz00pFzmMH7mfNib6BVZ9U
nMXkN+r3x2VUUCNcYbn6i3lXWkxNwYiUIiUQpgwj9DKKv3n7ujNxC/u/1d+dcVK4
hlrf1oHCMMPIcqyQKDRKkPNOB9LGm1Y1KlKFKA6C/ElEAn/48kJKlCKZA55VJSk2
TuxutrduaPOkPjuY9zBGHWlcfF5d1CtZaQALPxjcBbS0z9mSW4vqJRO4kRUY7SAV
jWFBfmPKNVPcZQTqvPd6CfqyQqmgCnl1fIHThHviWAMbX3GdGjsscRSzC3DiJRgZ
cMcXLqjlqWYaml/6Iq8V8+Azk6Ph2ORxZOsKDiOAz+VwZRQwlyjMu9SSkA/VABEY
dN7OrNZZwTwz2A1/QH/SHNvRVAB3kLRpzJTaLJsuFBeh5aycjjbhETHtYccYsxDf
rmuGnnfxBrglkbBaYExxzKutaE/yVeFLSegO9clxa2biSk31X51kOjS/2Vy8UQgd
iVRNQb/3ArfjOFiQvXIylGkJS0aiVmJXrkEOyiSzy5h0JGxpa2T4JWZ9VyrpGzLx
8PJUPPcYUGcfTEcB0dRvBC7/GpTn/ChEcMBfrPAqI+srsPG0CIBp+aIDAQEsSH61
tWVnIgeDuHCOFrNrB3bK
=SYaF
-----END PGP SIGNATURE-----

Under the hood, the signature was generated by hashing the message contents (in this case you can see Enigmail inserted the hashing algorithm it used - SHA512), then using the sender's private key to encrypt the hash along with a timestamp. The recipient can then verify the message by decrypting the signature blog, re-hashing the contents, and comparing. Not only is the sender's identity verified, but so are the message contents and the sending time. Cool!

Sending and receiving signed messages through Engimail is just as easy as sending and receiving encrypted messages - just select the option when sending, and Engimail handles everything for you. Signed messages are automatically verified. Here's what the above message looks like when viewed in Engimail:


Sometimes you might want to verify blocks of text (or an email you received through a client other than Thunderbird). That's really simple with GPG as well, though you'll have to go to the command line. To try it out for yourself, save the above signed message as test.txt and then run the following from a terminal window:


Ta-da! Piece of cake.

A Quick Digression About Trusted Signatures

You'll see an ominous WARNING in the gpg output above, letting you know that this is not a trusted signature. This simply means that you have not indicated that you trust the source of the public key you used to decrypt the message. For instance, what if I am not Chris Clark at all? But some nefarious impersonator who has broken into Chris' blog and provided a public key in this post that is not Chris' at all? What then?? Who can we trust?
Turns out this is actually a pretty hard problem to solve. The PGP community has a concept called the web of trust, which you can read about on your own. Another approach is key signing parties. Ultimately you will need to make the trust determination on your own. You can then set your trust level for each public key in the GPG Keychain tool. By default all keys except your own are "undefined", and this warning won't go away until you've indicated you have "ultimate" trust in the key. 
Back to verification - for kicks, try modifying the message slightly - replace my "Really!" with "Really." and re-run the verification (also note that I've now set my own key back to "Ultimate" trust so the warning is gone):


And that's it! You know how to use the GPG tool to communicate securely. There are all sorts of intricacies and details that are loads of fun to learn about, and you should now have a bit of a foundation to go exploring. Maybe those manual pages aren't so cryptic now after all. Good luck, and leave a comment if you'd like to exchange some encrypted emails with yours truly. You know where to find me.

Wednesday, September 11, 2013

Form Over Function & The New iPhone

I probably found this more profound than it really is, but Matt Buchanan said it well in the New Yorker, regarding the latest iPhone announcements:

Fundamental technology, like manufacturing processes for processors and imaging sensors and displays, have evolved to the point that the basic shape and sense of a phone—a thin rectangle with a four-to-five-inch high-resolution touch screen stuffed with a variety of sensors—is determined now largely based on its merits rather than its outright technical limitations, much the same way that the basic shape of a knife is defined by its function rather than our ability to produce it.

It's amazing to think that this is true; that the technological achievements of these devices are perhaps being held back more by the size and shape of our hands than by our manufacturing capabilities.

Thursday, August 22, 2013

Inline Vertically Scrolling Text with jQuery

We needed some vertically scrolling text on the epantry.com landing page. I found this vertical ticker example, but my use case proved a bit trickier because I wanted static text on either side of the vertically scrolling words. Based on that code, I cooked up this (view the source on jsFfiddle):

This is how you

  • scroll
  • tick
  • vertically

inline



Perhaps someone else will find it useful. Ultimately we didn't need the trailing static text on the ePantry home page, so I wound up needing quite a bit less CSS to achieve the effect.

Thursday, August 1, 2013

Declarative Approach to Nesting Backbone Models

Backbone doesn't have great (any?) support for nested models. Here's my approach. I think it's kind of fun (we get to write recursive functions!), and hopefully useful. I'll report back when I've lived with it for a bit longer.

Here's the problem: We have a Backbone model with attributes that ought to be other Backbone models, and some of those models' attributes ought to also be Backbone models. We want Backbone models all the way down! But when our REST endpoint returns data from the server, Backbone doesn't know that and hydrates only the top-level model to a first-class Backbone object. So we end up with ugly code inside _.each loops calling new App.Models.MyModel(data) all over the place. It's gross.

More concretely, after we fetch() a model (in this example, we'll use a Customer model with some nested data), we typically get a data structure back that looks like this:

  customer [App.Models.Customer]
    attributes : [Object]
      name : (String)
      email : (String)
      address : [Object]
      shipments : [Array]
        0 : [Object]
          date : (String)
          total : (Float)
          items : [Array]
            0 : [Object]
            1 : [Object]
        1 : [Object]
          date : (String)
          total : (Float)
          items : [Array]
            0 : [Object]
            1 : [Object]

But I actually have a bunch of Models and Collections that correspond to various pieces of my Customer model, so what I really want is this:

  customer : [App.Models.Customer]
    attributes : [Object]
      name : (String)
      email : (String)
      address : [App.Models.Address]
      shipments : [App.Collections.Shipments]
        models : [Array]
          0 : [App.Models.Shipment]
            attributes : [Object]
              date : (String)
              total : (Float)
              items : [App.Collections.ShipmentItems]
                Models : [Array]
                  0 : [App.Models.ShipmentItem]
                  1 : [App.Models.ShipmentItem]
          1 : [App.Models.Shipment]
            attributes : [Object]
              date : (String)
              total : (Float)
              items : [App.Collections.ShipmentItems]
                models : [Array]
                  0 : [App.Models.ShipmentItem]
                  1 : [App.Models.ShipmentItem]

In other words, I want to walk down my root object, and map models and collections onto it where appropriate (we aren't trying to turn EVERYTHING into a Backbone object, just the objects we have models and collections for). You can of course just write a bunch of loops and do it by hand, but I've tried to great a more general purpose method that can be reused across an app. Basically, each model can provide a "map" of how it should be rehydrated into a complete, nested model when it returns from the server. So let's write that (all examples are in Coffeescript):

  class App.Models.Customer extends Backbone.Model
    
    map: () ->
      [
        { name: "address", obj: App.Models.Address },
        {
          name: "shipments", obj: App.Collections.Shipments,
          children: [
            { name: "items", obj: App.Collections.ShipmentItems }
          ]
        },
      ]

Pretty simple! Using standard JSON we declare which attributes we want to turn into Backbone objects, and provide an reference to the relevant class. For deeper nesting, we just specify the child mappings in the same manner. With this simple pattern, we can easily write arbitrarily complex mappings, with deep nesting. Also, Backbone is helpful during the Collection instantiation process and converts the contents of an array into Backbone models, which is why we can stop the map at the Collections level (although we could go deeper).

Note also that map is an anonymous function, rather than just a property. This is because we want to evaluate map at the time we hydrate the object to ensure that all of the relevant Backbone types are loaded.

Now we need a way to walk the Customer model and apply the mapping. We'll write a helper function called "hydrate" for that, which will recursively walk the map and hydrate the objects:
       
  App.hydrate = (root, map) =>
      if root instanceof Backbone.Model
        App.hydrate(root.attributes, map)
      else
        _.each map, (field) ->
          if field.children
            root[field.name] = App.hydrate(root[field.name], field.children)
          if _.isArray root
            _.each root, (i) -> i[field.name] = new field.obj(i[field.name])
          else
            root[field.name] = new field.obj(root[field.name])
      root

Let's look at this in a bit more depth because I am barely smart enough to write recursive functions and it will give me confidence that I did it correctly if I can explain it to you.

First, if the root object is already a Backbone object, we'll walk the attributes instead of the Model itself. This is mostly a convenience thing so that we can call hydrate(model) instead of hydrate(model.attributes), but it also adds a bit of robustness in case you're doing something weird and call hydrate on a model where some, but not all, of the nest models are already Backbone objects. This'll still work and just hydrate the ones that still need it. I don't have a guard here against already-hydrated Collections, but it would be easy enough to add one.

The function then walks through each field in the map, recursively calling hydrate if there are any child maps present. Once we've reach the bottom of the map, the function checks whether the object is an array (in which case each item needs to be hydrated), or just a simple object (in which case just the target field needs to be hydrated).

This version of the function isn't tail-recursive because JavaScript VMs don't have tail-call optimization so there's not point. It's easy to flip the order of the field.children check though, and make it tail recursive for when ECMAScript 6 comes out. The tail-recursive version of the function is just a little tricker to explain, so I opted for this one.

Ok - got all that? The next step is to actually call the darn thing! Because I'm in an auto-magical mood, we'll stick on the model's  parse() method. Parse fires after every fetch() and save() of the model, so we're guaranteed to get the hydrated version back after every server call. The full customer class thus looks like this:

  class App.Models.Customer extends Backbone.Model
      
    map: () ->
        ...

    parse: (response) ->
        App.hydrate response, @map()

And voila! Every time we sync from the server, the model will hydrate itself! Neato!

Edit: After writing this, I found this answer on Stackoverflow, which is very similar, and I like a lot. In some situations I like my approach because there is a single map of the entire nesting model in one place, on the root model, so it's very quick to get your head around the model relationships. On the other hand, rycfung's approach is nicely encapsulated and the parsing code is much simpler. And there is a lot to be said for simple code when you are up late at night debugging recursive functions.







Wednesday, July 10, 2013

How to Write a Bug Report

Writing good bug reports is the difference between actually seeing your bug get fixed and sending protracted emails over the course of a week convincing a developer that there is, in fact, a bug. By far the most important information you can provide in a bug report are clear reproduction steps. Here's my guidance on how to write these:

1. Pretend you are writing to someone who has a very basic understanding of the product. 
2. Now also pretend that this person is an idiot.
3. Tell them, step-by-step, how to reproduce the issue.

So good reproduction steps look like this:

1. Log in as user test@test.com with password "foo" on the live website
2. View the dashboard
3. Note the dollar amount in the "You subscription plan" area in the lower left
4. Hit the "pay" button next to that amount
5. Note the "Today's payment" field is not populated in step 1 of the checkout dialog.

Yes this can be a little tedious, but if you get in the habit of doing it, it only takes like 30 seconds more per bug, and if it takes longer it's because it's forcing you to actually reproduce the thing consistently, which is the whole point in the first place.

And using this numbered, step-wise, idiot-proof format will cut down on the back-and-forth between you and whoever is fixing the bug by an order of magnitude. So it's ultimately a huge time saver and a great habit to get into. You life will contain 25% fewer bugs and developers will hug you.

Bonus tip: Whenever possible, provide screenshots!

Changing Our Development Process

This is a post about change management at a start-up.

Kaggle, at just under 20 people, is an interesting size because we are right on the cusp of being able to wing it in terms of process and communication, and needing more formal processes to make sure changes and plans are communicated to everyone who needs to know.

For a while our engineering team was just winging it - we used Asana (a sophisticated to-do list application) to assign and track tasks, but didn't have much estimation rigor, work would get scheduled "just in time", and priorities were constantly changing. While this roughly worked, and product was going out the door, we could have been doing better on a few fronts:

  • Productivity. With better prioritization and specification, task-switching time drops, and there are fewer questions to answer while an item is in flight.
  • Morale. With an ad-hoc process it's very easy to agree to deliver a new feature without a clear understanding of what's getting traded off. This leads to misset expectations, or poorly communicated changes and stresses everyone out.
  • Predictability. People are crummy at estimating how long complex tasks will take. So it's no surprise that, lacking good estimation techniques, we were consistently inaccurate estimating delivery dates for features.
  • Transparency. Asana is great for many things, but it was ultimately too flexible and inconsistent for anyone to look to it as a single source of truth for all development activities. So instead, they would ask me or individual engineers when something is going to ship.

We wanted a process that addressed these issues, without creating a bunch of bureaucracy that is anathema to start-ups (or, as one of our engineers put it, "better to manage, but worse"). I'll detail the process itself in a future post, but wanted to discuss here how we managed the change.

There were two basic approaches we could take:

  1. Make incremental changes, learning along the way, and evolve a process the whole team was happy with. We could research and try a few different tools, experiment with stand-ups, and ultimately figure out what makes the most sense for our team
  2. Make a sudden, larger, opinionated change and tweak from there. Do a big-bang roll-out of an agile development process and incorporate the agile tools I've come to really like over the years.

I went for option 2. Management strong-arming is not likely to go over well with any engineering team, but I thought it actually made sense in this case for a few reasons:

  • The developers on the team had either good experiences with agile, or no experience with agile; there was no scar tissue.
  • While our company culture is contemplative, flat, and autonomous, we also place a lot of value on just trying stuff quickly.
  • I had earned enough political capital as an engineering manager that I the team would give it a fair shake even if I was being a dictatorial jerk. In fact, I had tried to do something like this much earlier in my tenure at Kaggle and the team pushed back because I hadn't earned credibility yet.

And crucially, during the roll-out meeting, I explained that yes, I am being unilateral here, but here's why I think it makes sense anyway, and here is the meeting we are having in 3 weeks to evaluate how you like it, and at that same meeting you can decide to blow it all up if you hate it.

The final piece of the puzzle was to make it feel like a real change. When changing largely invisible things like process there's the risk that, in spite of the change, people roughly go back to doing roughly what they were doing before, just with different window dressing. To make it feel tangible and different, we mounted a big TV on the wall that displayed our current sprint, rearranged the floor plan to give a clear sense of space to the team, and made sure out our tracking tool (Pivotal) was integrated with our company chat room on day one so everyone would seem the stream of activity generated by the new process.

And bang! On Friday we were in ad-hoc winging-it land, and on Monday we were starting our first sprint, sitting in our first sizing meeting, with Fibonacci planning poker cards in front of us.

Thursday, January 17, 2013

Getting Started with Pandas - Predicting SAT Scores for New York City Schools


Predicting SAT Scores for NYC Schools

An Introduction to Pandas

This tutorial will get you started with Pandas - a data analysis library for Python that is great for data preparation, joining, and ultimately generating well-formed, tabular data that's easy to use in a variety of visualization tools or (as we will see here) machine learning applications. This tutorial assumes a solid understanding of core Python functionality, but nothing about machine learning or Pandas.

You can download the Python code for this tutorial here, download the IPython notebook file (ipynb) here. Lastly you can read this entire post as IPython Notebook HTML output here, which is significantly easier to read.

Goals

  1. Using data from [NYC Open Data] (https://data.cityofnewyork.us/), build a unified, tabular dataset ready for use with machine learning algorithms to predict student SAT scores on a per school basis.
  2. Learn and use the Pandas data analysis package.
  3. Learn how data is typically prepared for machine learning algorithms (ingestion, cleaning, joining, feature generation).
First, let's ingest the data and get the lay of the land. You can download the data sets referenced below from NYC Open Data, or directly download a zip file with the relevant data.
In [8]:
import pandas as pd
 
# Load the data
dsProgReports = pd.read_csv('C:/data/NYC_Schools/School_Progress_Reports_-_All_Schools_-_2009-10.csv')
dsDistrict = pd.read_csv('C:/data/NYC_Schools/School_District_Breakdowns.csv')
dsClassSize = pd.read_csv('C:/data/NYC_Schools/2009-10_Class_Size_-_School-level_Detail.csv')
dsAttendEnroll = pd.read_csv('C:/data/NYC_Schools/School_Attendance_and_Enrollment_Statistics_by_District__2010-11_.csv')[:-2] #last two rows are bad
dsSATs = pd.read_csv('C:/data/NYC_Schools/SAT__College_Board__2010_School_Level_Results.csv') # Dependent
In [9]:
dsSATs
Out[9]:

Int64Index: 460 entries, 0 to 459
Data columns:
DBN                      460  non-null values
School Name              460  non-null values
Number of Test Takers    460  non-null values
Critical Reading Mean    460  non-null values
Mathematics Mean         460  non-null values
Writing Mean             460  non-null values
dtypes: object(6)

Outline

Pandas has read the data files without issue. Next let's create a rough map of where we are going.
We have five datasets here, each with information about either schools or districts. We're going to need to join all of this information together into a tabular file, with one row for each school, joined with as much information we can gather about that school & its district, including our dependent variables, which will be the mean SAT scores for each school in 2010.
Drilling down one level of detail, let's look at the dataset dsSATs, which contains the target variables:
In [10]:
dsSATs
Out[10]:

Int64Index: 460 entries, 0 to 459
Data columns:
DBN                      460  non-null values
School Name              460  non-null values
Number of Test Takers    460  non-null values
Critical Reading Mean    460  non-null values
Mathematics Mean         460  non-null values
Writing Mean             460  non-null values
dtypes: object(6)

Target Variable and Joining Strategy

We are going to build a dataset to predict Critical Reading Mean, Mathematics Mean, and Writing Mean for each school (identified by DBN).
After digging around in Excel (or just taking my word for it) we identify the following join strategy (using SQL-esque pseudocode):
dsSATS join dsClassSize on dsSATs['DBN'] = dsClassSize['SCHOOL CODE']
join dsProgReports on dsSATs['DBN'] = dsProgReports['DBN']
join dsDistrct on dsProgReports['DISTRICT'] = dsDistrict['JURISDICTION NAME']
join dsAttendEnroll on dsProgReports['DISTRICT'] = dsAttendEnroll['District']
Now that we have the strategy identified at a high level, there are a number of details we have to identify and take care of first.

Primary Keys - Schools

Before we can join these three datasets together, we need to normalize their primary keys. Below we see the mismatch between the way the DBN (school id) field is represented in the different datasets. We then write code to normalize the keys and correct this problem.
In [11]:
pd.DataFrame(data=[dsProgReports['DBN'].take(range(5)), dsSATs['DBN'].take(range(5)), dsClassSize['SCHOOL CODE'].take(range(5))])
Out[11]:
01234
DBN01M01501M01901M02001M03401M063
DBN01M29201M44801M45001M45801M509
SCHOOL CODEM015M015M015M015M015
In [12]:
#Strip the first two characters off the DBNs so we can join to School Code
dsProgReports.DBN = dsProgReports.DBN.map(lambda x: x[2:])
dsSATs.DBN = dsSATs.DBN.map(lambda x: x[2:])
 
#We can now see the keys match
pd.DataFrame(data=[dsProgReports['DBN'].take(range(5)), dsSATs['DBN'].take(range(5)), dsClassSize['SCHOOL CODE'].take(range(5))])
Out[12]:
01234
DBNM015M019M020M034M063
DBNM292M448M450M458M509
SCHOOL CODEM015M015M015M015M015

Primary Keys - Districts

We have a similar story with the district foreign keys. Again, we need to normalize the keys. The only additional complexity here is that dsProgReports['DISTRICT'] is typed numerically, whereas the other two district keys are typed as string. We do some type conversions following the key munging.
In [13]:
#Show the key mismatchs
#For variety's sake, using slicing ([:3]) syntax instead of .take()
pd.DataFrame(data=[dsProgReports['DISTRICT'][:3], dsDistrict['JURISDICTION NAME'][:3], dsAttendEnroll['District'][:3]])
Out[13]:
012
DISTRICT111
JURISDICTION NAMECSD 01 ManhattanCSD 02 ManhattanCSD 03 Manhattan
DistrictDISTRICT 01DISTRICT 02DISTRICT 03
In [14]:
#Extract well-formed district key values
#Note the astype(int) at the end of these lines to coerce the column to a numeric type
import re
dsDistrict['JURISDICTION NAME'] = dsDistrict['JURISDICTION NAME'].map(lambda x: re.match( r'([A-Za-z]*\s)([0-9]*)', x).group(2)).astype(int)
dsAttendEnroll.District = dsAttendEnroll.District.map(lambda x: x[-2:]).astype(int)
 
#We can now see the keys match
pd.DataFrame(data=[dsProgReports['DISTRICT'][:3], dsDistrict['JURISDICTION NAME'][:3], dsAttendEnroll['District'][:3]])
Out[14]:
012
DISTRICT111
JURISDICTION NAME123
District123

Additional Cleanup

At this point we could do the joins, but there is messiness in the data still. First let's reindex the DataFrames so the semantics come out a bit cleaner. Pandas indexing is beyond the scope of this tutorial but suffice it to say it makes these operations easier.
In [15]:
#Reindexing
dsProgReports = dsProgReports.set_index('DBN')
dsDistrict = dsDistrict.set_index('JURISDICTION NAME')
dsClassSize = dsClassSize.set_index('SCHOOL CODE')
dsAttendEnroll = dsAttendEnroll.set_index('District')
dsSATs = dsSATs.set_index('DBN')
Let's take a look at one of our target variables. Right away we see the "s" value, which shouldn't be there.
We'll filter out the rows without data.
In [16]:
#We can see the bad value
dsSATs['Critical Reading Mean'].take(range(5))
Out[16]:
DBN
M292    391
M448    394
M450    418
M458    385
M509      s
Name: Critical Reading Mean
In [17]:
#Now we filter it out
 
#We create a boolean vector mask. Open question as to whether this semantically ideal...
mask = dsSATs['Number of Test Takers'].map(lambda x: x != 's')
dsSATs = dsSATs[mask]
#Cast fields to integers. Ideally we should not need to be this explicit.
dsSATs['Number of Test Takers'] = dsSATs['Number of Test Takers'].astype(int)
dsSATs['Critical Reading Mean'] = dsSATs['Critical Reading Mean'].astype(int)
dsSATs['Mathematics Mean'] = dsSATs['Mathematics Mean'].astype(int)
dsSATs['Writing Mean'] = dsSATs['Writing Mean'].astype(int)
 
#We can see those values are gone
dsSATs['Critical Reading Mean'].take(range(5))
Out[17]:
DBN
M292    391
M448    394
M450    418
M458    385
M515    314
Name: Critical Reading Mean

Feature Construction

dsClassSize will be a many-to-one join with dsSATs because dsClassSize contains multiple entries per school. We need to summarize and build features from this data in order to get one row per school that will join neatly to dsSATs.
Additionally, the data has an irregular format, consisting of a number of rows per school describing different class sizes, then a final row for that school which contains no data except for a number in the final column, SCHOOLWIDE PUPIL-TEACHER RATIO.
We need to extract the SCHOOLWIDE PUPIL-TEACHER RATIO rows, at which point we'll have a regular format and can build features via aggregate functions. We'll also drop any features that can't be easily summarized or aggregated and likely have no bearing on the SAT scores (like School Name).
In [18]:
#The shape of the data
print dsClassSize.columns
print dsClassSize.take([0,1,10]).values
array([BORO, CSD, SCHOOL NAME, GRADE , PROGRAM TYPE,
       CORE SUBJECT (MS CORE and 9-12 ONLY),
       CORE COURSE (MS CORE and 9-12 ONLY), SERVICE CATEGORY(K-9* ONLY),
       NUMBER OF CLASSES, TOTAL REGISTER, AVERAGE CLASS SIZE,
       SIZE OF SMALLEST CLASS, SIZE OF LARGEST CLASS, DATA SOURCE,
       SCHOOLWIDE PUPIL-TEACHER RATIO], dtype=object)
[[M 1 P.S. 015 ROBERTO CLEMENTE 0K GEN ED - - - 1.0 21.0 21.0 21.0 21.0 ATS
  nan]
 [M 1 P.S. 015 ROBERTO CLEMENTE 0K CTT - - - 1.0 21.0 21.0 21.0 21.0 ATS
  nan]
 [M 1 P.S. 015 ROBERTO CLEMENTE nan nan nan nan nan nan nan nan nan nan nan
  8.9]]
In [19]:
#Extracting the Pupil-Teacher Ratio
 
#Take the column
dsPupilTeacher = dsClassSize.filter(['SCHOOLWIDE PUPIL-TEACHER RATIO'])
#And filter out blank rows
mask = dsPupilTeacher['SCHOOLWIDE PUPIL-TEACHER RATIO'].map(lambda x: x > 0)
dsPupilTeacher = dsPupilTeacher[mask]
#Then drop from the original dataset
dsClassSize = dsClassSize.drop('SCHOOLWIDE PUPIL-TEACHER RATIO', axis=1)
 
#Drop non-numeric fields
dsClassSize = dsClassSize.drop(['BORO','CSD','SCHOOL NAME','GRADE ','PROGRAM TYPE',\
'CORE SUBJECT (MS CORE and 9-12 ONLY)','CORE COURSE (MS CORE and 9-12 ONLY)',\
'SERVICE CATEGORY(K-9* ONLY)','DATA SOURCE'], axis=1)
 
#Build features from dsClassSize
#In this case, we'll take the max, min, and mean
#Semantically equivalent to select min(*), max(*), mean(*) from dsClassSize group by SCHOOL NAME
#Note that SCHOOL NAME is not referenced explicitly below because it is the index of the dataframe
grouped = dsClassSize.groupby(level=0)
dsClassSize = grouped.aggregate(np.max).\
    join(grouped.aggregate(np.min), lsuffix=".max").\
    join(grouped.aggregate(np.mean), lsuffix=".min", rsuffix=".mean").\
    join(dsPupilTeacher)
 
print dsClassSize.columns
array([NUMBER OF CLASSES.max, TOTAL REGISTER.max, AVERAGE CLASS SIZE.max,
       SIZE OF SMALLEST CLASS.max, SIZE OF LARGEST CLASS.max,
       NUMBER OF CLASSES.min, TOTAL REGISTER.min, AVERAGE CLASS SIZE.min,
       SIZE OF SMALLEST CLASS.min, SIZE OF LARGEST CLASS.min,
       NUMBER OF CLASSES.mean, TOTAL REGISTER.mean,
       AVERAGE CLASS SIZE.mean, SIZE OF SMALLEST CLASS.mean,
       SIZE OF LARGEST CLASS.mean, SCHOOLWIDE PUPIL-TEACHER RATIO], dtype=object)

Joining

One final thing before we join - dsProgReports contains distinct rows for separate grade level blocks within one school. For instance one school (one DBN) might have two rows: one for middle school and one for high school. We'll just drop everything that isn't high school.
And finally we can join our data. Note these are inner joins, so district data get joined to each school in that district.
In [20]:
mask = dsProgReports['SCHOOL LEVEL*'].map(lambda x: x == 'High School')
dsProgReports = dsProgReports[mask]
In [21]:
final = dsSATs.join(dsClassSize).\
join(dsProgReports).\
merge(dsDistrict, left_on='DISTRICT', right_index=True).\
merge(dsAttendEnroll, left_on='DISTRICT', right_index=True)

(Even More) Additional Cleanup

We should be in a position to build a predictive model for our target variables right away but unfortunately there is still messy data floating around in the dataframe that machine learning algorithms will choke on. A pure feature matrix should have only numeric features, but we can see that isn't the case. However for many of these columns, the right approach is obvious once we've dug in.
In [22]:
final.dtypes[final.dtypes.map(lambda x: x=='object')]
Out[22]:
School Name                      object
SCHOOL                           object
PRINCIPAL                        object
PROGRESS REPORT TYPE             object
SCHOOL LEVEL*                    object
2009-2010 OVERALL GRADE          object
2009-2010 ENVIRONMENT GRADE      object
2009-2010 PERFORMANCE GRADE      object
2009-2010 PROGRESS GRADE         object
2008-09 PROGRESS REPORT GRADE    object
YTD % Attendance (Avg)           object
In [23]:
#Just drop string columns.
#In theory we could build features out of some of these, but it is impractical here
final = final.drop(['School Name','SCHOOL','PRINCIPAL','SCHOOL LEVEL*','PROGRESS REPORT TYPE'],axis=1)
 
#Remove % signs and convert to float
final['YTD % Attendance (Avg)'] = final['YTD % Attendance (Avg)'].map(lambda x: x.replace("%","")).astype(float)
 
#The last few columns we still have to deal with
final.dtypes[final.dtypes.map(lambda x: x=='object')]

Categorical Variables

We can see above that the remaining non-numeric field are grades . Intuitively, they might be important so we don't want to drop them, but in order to get a pure feature matrix we need numeric values. The approach we'll use here is to explode these into multiple boolean columns. Some machine learning libraries effectively do this for you under the covers, but when the cardinality of the categorical variable is relatively low, it's nice to be explicit about it.
In [24]:
gradeCols = ['2009-2010 OVERALL GRADE','2009-2010 ENVIRONMENT GRADE','2009-2010 PERFORMANCE GRADE','2009-2010 PROGRESS GRADE','2008-09 PROGRESS REPORT GRADE']
 
grades = np.unique(final[gradeCols].values) #[nan, A, B, C, D, F]
 
for c in gradeCols:
    for g in grades:
        final = final.join(pd.Series(data=final[c].map(lambda x: 1 if x is g else 0), name=c + "_is_" + str(g)))
 
final = final.drop(gradeCols, axis=1)
 
#Uncomment to generate csv files
#final.drop(['Critical Reading Mean','Mathematics Mean','Writing Mean'],axis=1).to_csv('C:/data/NYC_Schools/train.csv')
#final.filter(['Critical Reading Mean','Mathematics Mean','Writing Mean']).to_csv('C:/data/NYC_Schools/target.csv')

That's it!

We now have a feature matrix that's trivial to use with any number of machine learning algorithms. Feel free to stop here and run the two lines of code below to get nice CSV files written to disk that you can easily use in Excel, Tableau, etc. Or run the larger block of code to see how easy it is to build a random forest model against this data, and look at which variables are most important.
In [ ]:
#Uncomment to generate csv files
#final.drop(['Critical Reading Mean','Mathematics Mean','Writing Mean'],axis=1).to_csv('C:/data/NYC_Schools/train.csv')
#final.filter(['Critical Reading Mean','Mathematics Mean','Writing Mean']).to_csv('C:/data/NYC_Schools/target.csv')
In [25]:
from sklearn.ensemble import RandomForestRegressor
 
target = final.filter(['Critical Reading Mean'])
#We drop all three dependent variables because we don't want them used when trying to make a prediction.
train = final.drop(['Critical Reading Mean','Writing Mean','Mathematics Mean'],axis=1)
model = RandomForestRegressor(n_estimators=100, n_jobs=-1, compute_importances = True)
model.fit(train, target)
 
predictions = np.array(model.predict(train))
rmse = math.sqrt(np.mean((np.array(target.values) - predictions)**2))
imp = sorted(zip(train.columns, model.feature_importances_), key=lambda tup: tup[1], reverse=True)
 
print "RMSE: " + str(rmse)
print "10 Most Important Variables:" + str(imp[:10])
RMSE: 80.13105688
10 Most Important Variables:[('PEER INDEX*', 0.81424747874371173), ('TOTAL REGISTER.min', 0.060086333792196724), ('2009-2010 ENVIRONMENT CATEGORY SCORE', 0.023810405565050669), ('2009-2010 ADDITIONAL CREDIT', 0.021788425210174274), ('2009-2010 OVERALL SCORE', 0.019046860376900468), ('AVERAGE CLASS SIZE.mean', 0.0094882658926829649), ('2009-2010 PROGRESS CATEGORY SCORE', 0.0094678349064146652), ('AVERAGE CLASS SIZE.min', 0.0063723026953534942), ('2009-2010 OVERALL GRADE_is_nan', 0.0057710237481254567), ('Number of Test Takers', 0.0053660239780210584)]