My son is interested in playing soccer at college and I wanted to get a dataset of colleges and universities with their associated conference and division, average SAT score, acceptance rate and annual cost. Maybe there’s a way to do this using existing tools but what better way to deal with the stress by turning this into a data problem!
I started by using wikitable2csv to download the tables for the following Wikipedia Pages:
Then I downloaded the most recent College School Card Data from:
All these files are in the data
directory. Then I ran the convert.py
program to:
I originally tried using the Wikidata reconciliation service on both datasets and then joining based on WikidataID. But I ran into issues where there were false positives, perhaps because I wasn’t also using the city and state as part of the reconciliation.
A simple school name match wasn’t good enough (562/854 matches). After some experimentation I ended up using the state to limit the matches for each school, and then using a Levenshtein Distance to find the best match (785/854 matches). When I get around to it I’ll manually match up the remaining 70.