There have been numerous attempts to predict marathon performance based on physiological measurements and training data; Dr. Christof Schweining gives a good overview of these . However, many of these models have been fitted to a very small data sample, and do not generalise well to new data.
Now we have access to a large data set of marathon runners and their training, we can test these models and quantify their predictive power. Unfortunately, beyond age and gender, we have no access to physiological measurements and so we are restricted to models based on training data. As discussed by Christof, the only such model which stands up beyond basic scrutiny is Tanda's prediction formula . Unfortunately, as we will see, this model fails to generalise outside of its very small training data set.
The Tanda model predicts male marathon time by the following formula.
where is the predicted marathon pace in seconds per kilometre, is the average distance run per week in kilometres, and is the average pace run per week in seconds per kilometre. The training block used to obtain and is the 8 weeks prior to the week of the race (i.e. 9 weeks from race day until 1 week from race day). As an example, an athlete running 60 miles per week at an average pace of 7:30 per mile would be expected to run 2:59:11. A calculator for predicting marathon time based on these statistics is available at http://www.paceguru.co.uk/.
As we can see from the equation, the model predicts faster marathon times for those who run more and those who run faster. However, while increasing the average pace will always improve the marathon time, increasing the average distance run per week will result in diminishing returns. Christof has written a more detailed discussion of the Tanda model .
Running the Tanda prediction model on the men in our data set, we can plot estimated finish time against actual finish time.
As we can see, the Tanda formula is systematically underestimating the finish time, and more so for the quicker athletes. This could be due to training which was not logged on Strava, but in my experience athletes, especially faster runners, are very dedicated at uploading all training data.
The Tanda model was fitted to training from 22 athletes partaking in 46 races, with marathon times between 2:47 and 3:36. Moreover, the athletes had fairly similar training profiles in terms of distance per week and average pace. It is possible that the model performs well on athletes with such training profiles.
The plot below highlights the Strava data that has training profile within standard deviations of the sample mean of the Tanda data. I am assuming that the Tanda data is drawn from a multivariate Gaussian with diagonal covariance matrix.
We can now see how the Tanda model performs on these subsets.
|Data||RMSE (Average Error) (mm:ss)|
3 Std. Devs
2 Std. Devs
1 Std. Devs
We can see the Tanda prediction model performs better on data points similar to the original data, but there is still a lot of unexplained variance, and a bias towards underestimating the performance.
It is possible that Tanda has an appropriate model for predicting marathon time, but insufficient data to correctly tune the model. By this I mean a formula of the form
may produce accurate predictions for marathon pace, for suitable values of . If we fit such a function to our data by minimising least squares, we obtain the following equation (parameters given to 3 s.f)
We can now plot predictions for this improved Tanda model.
This new model has an improved RMSE (11:21), and an of 0.501, but still systematically underestimates performance for faster runners and overestimates for slower runners. This suggests that the Tanda model is actually a bad model for the task at hand.
To further illustrate the short comings of the Tanda model, we can construct an extremely simple linear model
Here is the predicted marathon finish time, is the average distance covered per week, and is the average speed of running. Notice that I am using speed instead of pace, as I will continue to do so from here on out, as it is a more physically meaningful quantity than pace. I do not claim that this is necessarily the optimal model but we will see it still outperforms the Tanda models.
Fitting such this linear model to our data produces the following predictions
Even this extremely simple model has a smaller RMSE (11:02) and a greater (0.528) than the Tanda model, further suggesting that the Tanda model is inappropriate for this task. Once again, it is important to note that the black line is not a regression line, it simply helps to illustrate which points (those far from the line) have poor predictions. As an example, the models we have seen so far underestimate performance for the faster athletes and overestimate for the slower athletes.
In the next post, I will attempt to devise an improved model by analysing the effect of other training factors on marathon performance. At this point, it may be prudent to remark that by its very nature our data set is extremely noisy, and it is unlikely that we will be able to get an extremely accurate model, but I believe we can still do quite well!
 Tanda G. Prediction of marathon performance time on the basis of training indices. J. Hum. Sport Exerc. Vol. 6, No. 3, pp. 511-520, 2011.