Warning: Spoilers ahead if you have not seen the first two episodes of the new season
In the first episode of the season Brooke, after winning the quickfire, claimed she was in a good position because the winner of the first challenge often goes on to win the whole thing. Actually, only one contestant has won the first quickfire and gone on to be top chef (Richard in season 8), and that was a team win. The winner of the first elimination challenge has won the competition 5 of 12 times (not counting season 9 when a whole team won the elimination challenge). This got me wondering if there were other predictors as to who would win Top Chef.
There's not too much data after the first elimination challenge, but I tried building a predictive model using the chef's gender, age, quickfire and elimination performance, and current residence (though I ultimately selected the most predictive features from the list). I used this data as features with a target variable of elimination number to build a gradient-boosted decision tree model to predict when the chefs this season would be eliminated. I validated the model with seasons 12 and 13 and then applied the model to season 14. I looked at the total distance between the predicted and actual placings of the contestants as the metric to optimize during validation. The model predicted both of these seasons correctly, but seasons 12 and 13 were two seasons where the winner of the first elimination challenge became top chef.
The most significant features in predicting the winner were: elimination challenge 1 performance, season (catching general trends across seasons), gender, home state advantage, being from Boston, being from California, and being from Chicago. Male chefs do happen to do better as do chefs from the state where Top Chef is being filmed. Being from Chicago is a little better than being from California, which is better than being from Chicago. To try to visualize this better, I used these significant features and performed a PCA to plot the data in two dimensions. This shows how data clusters, without any knowledge of the ultimate placement of the contestants.
A plot of the PCA components using the key identified features. The colors represent the ultimate position of the contestants. Blue represents more successful contestants where red represents less successful contestants. The direction corresponds mostly to first elimination success (with more successful contestants on the right), and the direction corresponds primarily to gender (with male on top). The smaller spreads indicate the other features, such as the contestant's home city. We see that even toward the left there are dark blue points, meaning that nothing is an absolute deal-breaker to become top chef, but of course, winning the first challenge puts contestants in a better position.
My prediction model quite predictably puts Casey as the favorite for winning it all, with Katsuji in second place. The odds are a bit stacked against Casey though. If she were male or from Chicago or if this season's Top Chef were taking place in California, she would have a higher chance of winning. Katsuji's elevated prediction is coming from being on the winning team in the first elimination while being male and from California. He struggled a bit when he was last on the show, though, so I don't know if my personal prediction would put him so high. Brooke, even though she thought she was in a good position this season, is tied for fifth place according to my prediction. My personal prediction would probably put her higher since she did so well in her previous season.
Of course, there's only so much the models can predict. For one thing, there's not enough data to reliably figure out how returning chefs do. This season, it's half new and half old contestants. The model probably learned a bit of this, though, since the experienced chefs won the first elimination challenge, which was included in the model. One thing I thought about adding but didn't was what the chefs actually cooked. Specific ingredients or cooking techniques might be relevant features for the predictive model. However, this data wasn't easy to find without re-watching all the episodes, and given the constraints of all the challenges, I wasn't sure these features would be all that relevant (e.g., season 11 was probably the only time turtle was cooked in an elimination challenge). With more data the model would get better; most winners rack up some wins by the time a few elimination challenges have passed.
The code is available here.