Some notes about the data set

After spending some time with the data, I’m starting to realize that it has a number of limitations. There are 100 lines in the data set, but unfortunately there are not 100 data points. There were 5 people who entered the survey, but exited after the first question. This is not a large data set and we have no way of knowing if it is representative of the population of graduate philosophy applicants (we in fact have some excellent reasons to believe that it is not; the response bias inherent in online survey, and the fact that TGC attracts …. a certain crowd).

There were several ‘summary’ questions added to the survey after it had opened. These questions asked about how many programs in the top 20, top 50, and T-7 (for Masters’ programs) each applicant had applied and been accepted to. Some people used this questions in addition to reporting their school-by-school results, but some only reported their results in these questions. Since they were added, some people did not have an opportunity to report their results using these questions at all.

This poses a few problems to anyone interested in analyzing this data. Basically, there are two sources of data in the spreadsheet: the summary questions, and the school by school reporting. They report related, but not identical, pieces of information. One or both is missing or incomplete for almost every person who took the survey. It might be possible to answer the survey questions based on the school by school data, but it would be time consuming, to say the least. The answers to the survey questions are also of limited use for continental students; many top continental schools are unlisted or lowly ranked on PGR.

The method I used to count acceptances, rejections, and waitlists that I used in the post about traditions only captured the data contained in the school by school reporting. I want to be honest and forthcoming about the method I used, and the flaws it has as a result. The people I identified as ‘high achievers’ were also only those that were captured by the school by school reporting (although since there was no statistical analysis, the observations I made are still valid; they just don’t capture all the people who fit into that category. Consider it a limited sample). I was only made aware of this problem because someone recognized that they hadn’t been included in that post, although they had been extremely successful this application season.

All of this is to say that while I’ve found it interesting to look at the data more closely, it should all be taken with a grain of salt. The data has its limitations, just as I have mine (as I’ve said before, I am not a mathematician).

I am still very grateful that Ian was willing to take this on, and I think he did a great job. But if someone were to take up this mantle next year, I hope they would look to design a survey that could avoid these issues.


One thought on “Some notes about the data set

