PhD students should probably avoid the best schools

MViz
6 min readNov 18, 2020

Malcolm Gladwell is a wonderful storyteller — in his book David vs Goliath¹, he tells stories about being an underdog. In particular, he talks about the advantages of being a big fish in a small pond.

He tells you a story about dropout rates of STEM students, explaining that the best predictor of whether a student will drop out of a STEM degree is not their test scores, nor IQ, but in fact their class rank. Students are more likely to drop out if they are in the bottom half of the class, even if they’re brilliant — so choosing the best school could be detrimental to a student’s success.

This blog is about one of Gladwell’s follow up stories: a paper written by Conley, John P., and Ali Sina Önder². They looked at the publishing rates of economics PhDs six years after they graduate and found something interesting: only a few students from each graduating class end up regularly publishing in their early careers. It begs a comparison to the STEM dropouts.

This blog is going to take a look at the underlying data from that study, and make the case that PhD students should choose a program where they will be in the top 10% of their class.

The study collected data from around fourteen thousand graduates of economics PhD programs between the years 1985 and 2000. Here’s what the data might look like for two graduates:

Tabular data with the following columns: student id, graduation year, school rank, number of AER equivalent papers published.
A peek at the data: two example graduates

Each pseudonymous graduate is tied to a school rank and a publishing score. The school rank is calculated by Coupé (2003), for example:

  • Harvard is Rank 1
  • UPenn is Rank 2
  • Carnegie Mellon is Rank 30
  • etc.

The publishing score was created by the authors as a way of comparing the publishing success of the graduates who publish papers in many different journals. They weighed each journal against the gold standard of economics journals: The American Economic Review (AER). Here’s an example of two graduates, and some fake journals:

Diagram showing how different papers might be weighted to created a publication score.
Example publication scoring

So the question is: does going to a higher ranked school help students publish more papers when they graduate?

Let’s imagine two scenarios for the sake of examining our biases.

  1. School rank does not affect how many papers students publish
  2. Higher ranked schools help students publish more papers than lower ranked schools

In both scenarios we assume students choose schools at random. This helps us look at these scenarios is isolation.

We’re going to compare students at equivalent percentiles: comparing the best students from each school, the average students from each school, etc.

If school rank does not affect the number of papers published, you would expect most schools to look pretty similar, like this:

Now, let’s look at the possibility that higher ranked schools help students publish more papers than the lower ranked schools. This is the standard thinking, isn’t it?

For example, you might guess that graduates of the top ranked school publish twice as much as the 30th ranked school. Here’s what that might look like:

The scale here doesn’t really matter, nor does the amount that the school helps in this fake world. The point is to demonstrate how the data might appear if common wisdom were true. Notice that in this world, the median student at School 1 performs as well or better than the top students from the bottom ranked schools.

I think you can guess that this is not what the real world looks like.

Here’s the actual data:

Here are some high level observations:

  • The overall top publishers are from the highest ranked schools
  • The highest performing graduates from each school have strong publishing records
  • The median graduates aren’t publishing much, regardless of the school

The data seems to say being an average graduate from a top school, isn’t so great.

Analyzing the data

Feel free to skip to the conclusion if you don’t care about the analysis :)

The data from each school does look different, but are the differences statistically significant? Or, what is the mostly like average publication rate of each school, given the observed data? To answer that question I used a technique called Bayesian inference.

You can see the analysis here, and a three dimensional rendering of the data here.

I modeled the data by grouping the students by their schools, and looking at the school’s overall performance over the years. The biggest benefit of the model is that we can get uncertainty about each school which helps us make comparisons between schools.

Sorting students by school rank

The model produces an estimate and an uncertainty range which tells us with 94% certainty where we believe the real average lies. Each school will have an estimated range of the average number of papers published by their graduates. Here’s an example:

Explaining the forest plot bars

Here’s what we get from the schools:

The model says schools are statistically different when there is a gap between the blue bars. If there is no gap (i.e. there is overlap), then we failed to show a difference between the schools — we didn’t prove there is no difference (innocent until proven guilty, if you will).

Identifying statistical difference in the forest plot

When you look at the chart above, you might notice that there is a lot of overlap between schools, so what conclusions can we draw from the data?

The data is retrospective: it’s a history of what has happened, and can only give us clues as to what might happen in the future. The biggest limiting factor of the data is that we know that the best students choose top ranked programs, so we don’t know to what extent graduate performance is a result of good students, or good teaching. If only students chose PhD programs at random, right?

Of course it’s possible that these outcomes reflect the fact that publishing papers is really hard — but that raises the question about whether these programs are doing a good job training students in the first place.

Unfortunately the data can’t tell you where you are most likely to publish the most papers.

Where does that leave us?

If trends continue, we can expect top ranked schools to produce the highest average of papers published. However, it seems only a few graduates from each school end up with strong early publishing careers. If the necessary skills to publish are provided by the school, then only a handful students are getting this benefit. If the skills are innate to the students, then who cares what school you pick — just choose a cool advisor at a cheap school?

Huh. Remember Malcom Gladwell? His takeaway is that it’s better to be a big fish in a small pond, i.e. choose a school where you think you can be one of the higher performing graduates.

It can be easy to forget that these statistics represent real people with their own advantages and challenges. I believe this data is good evidence to challenge the prevailing wisdom that the top schools should be everyone’s goal. Students should look at actual outcomes of school programs, and find the best fit for themselves. While Gladwell’s conclusions stretch beyond the limits of correlated data, I think there is wisdom in putting your ego aside and finding a program where you can celebrate your own success.

1. Gladwell, M. (2013). David and Goliath: Underdogs, misfits, and the art of battling giants (First edition.). New York: Little, Brown and Company.

2. Conley, John P., and Ali Sina Önder. 2014. “The Research Productivity of New PhDs in Economics: The Surprisingly High Non-success of the Successful.” Journal of Economic Perspectives, 28 (3): 205–16.

--

--

MViz

Dipping my toes into data science and data viz.