There have been numerous attempts to predict marathon performance based on physiological measurements and training data; Dr. Christof Schweining gives a good overview of these [1]. However, many of these models have been fitted to a very small data sample, and do not generalise well to new data.

Now we have access to a large data set of marathon runners and their training, we can test these models and quantify their predictive power. Unfortunately, beyond age and gender, we have no access to physiological measurements and so we are restricted to models based on training data. As discussed by Christof, the only such model which stands up beyond basic scrutiny is Tanda's prediction formula [2]. Unfortunately, as we will see, this model fails to generalise outside of its very small training data set.

The Tanda model predicts male marathon time by the following formula.

where is the predicted marathon pace in seconds per kilometre, is the average distance run per week in kilometres, and is the average pace run per week in seconds per kilometre. The training block used to obtain and is the 8 weeks prior to the week of the race (i.e. 9 weeks from race day until 1 week from race day). As an example, an athlete running 60 miles per week at an average pace of 7:30 per mile would be expected to run 2:59:11. A calculator for predicting marathon time based on these statistics is available at http://www.paceguru.co.uk/.

As we can see from the equation, the model predicts faster marathon times for those who run more and those who run faster. However, while increasing the average pace will always improve the marathon time, increasing the average distance run per week will result in diminishing returns. Christof has written a more detailed discussion of the Tanda model [3].

Running the Tanda prediction model on the men in our data set, we can plot estimated finish time against actual finish time.

As we can see, the Tanda formula is systematically underestimating the finish time, and more so for the quicker athletes. This could be due to training which was not logged on Strava, but in my experience athletes, especially faster runners, are very dedicated at uploading all training data.

The Tanda model was fitted to training from 22 athletes partaking in 46 races, with marathon times between 2:47 and 3:36. Moreover, the athletes had fairly similar training profiles in terms of distance per week and average pace. It is possible that the model performs well on athletes with such training profiles.

The plot below highlights the Strava data that has training profile within standard deviations of the sample mean of the Tanda data. I am assuming that the Tanda data is drawn from a multivariate Gaussian with diagonal covariance matrix.

We can now see how the Tanda model performs on these subsets.

Data | RMSE (Average Error) (mm:ss) |
---|---|

All |
14:34 |

3 Std. Devs |
13:06 |

2 Std. Devs |
12:30 |

1 Std. Devs |
11:57 |

We can see the Tanda prediction model performs better on data points similar to the original data, but there is still a lot of unexplained variance, and a bias towards underestimating the performance.

It is possible that Tanda has an appropriate model for predicting marathon time, but insufficient data to correctly tune the model. By this I mean a formula of the form

may produce accurate predictions for marathon pace, for suitable values of . If we fit such a function to our data by minimising least squares, we obtain the following equation (parameters given to 3 s.f)

We can now plot predictions for this improved Tanda model.

This new model has an improved RMSE (11:21), and an of 0.501, but still systematically underestimates performance for faster runners and overestimates for slower runners. This suggests that the Tanda model is actually a bad model for the task at hand.

To further illustrate the short comings of the Tanda model, we can construct an extremely simple linear model

Here is the predicted marathon finish time, is the average distance covered per week, and is the average speed of running. Notice that I am using speed instead of pace, as I will continue to do so from here on out, as it is a more physically meaningful quantity than pace. I do not claim that this is necessarily the optimal model but we will see it still outperforms the Tanda models.

Fitting such this linear model to our data produces the following predictions

Even this extremely simple model has a smaller RMSE (11:02) and a greater (0.528) than the Tanda model, further suggesting that the Tanda model is inappropriate for this task. Once again, it is important to note that the black line is not a regression line, it simply helps to illustrate which points (those far from the line) have poor predictions. As an example, the models we have seen so far underestimate performance for the faster athletes and overestimate for the slower athletes.

In the next post, I will attempt to devise an improved model by analysing the effect of other training factors on marathon performance. At this point, it may be prudent to remark that by its very nature our data set is extremely noisy, and it is unlikely that we will be able to get an extremely accurate model, but I believe we can still do quite well!

### References:

[1] http://christofschwiening.blogspot.co.uk/2016/01/predicting-marathon-performance-from.html

[2] Tanda G. *Prediction of marathon performance time on the basis of training indices.* J. Hum. Sport Exerc. Vol. 6, No. 3, pp. 511-520, 2011.

[3] http://christofschwiening.blogspot.co.uk/2016/01/tanda-2011-viewpoint.html

Dear Will,

your analysis is very interesting.

I am actually testing my model in the field from 2h,10min to 2h,45min (not covered by my original sample group) and to comply with this I am using Strava database. I was very selective about choosing the athletes: I considered only runners with a fairly uniform mass of work during the observed 8 weeks, without other activities such as skying, cycling or swimming. If feasible, I contact in person the runners asking them if all workouts were uploaded. For carefully checked athletes, prediction is still accurate down to 2h,10min (this race time was obtained by Orlando Pizzolato over 30 years ago, who recently put his training data on the web).

At the same time, for some people randomly taken from Strava, my prediction seems to be underestimated by 10-20 min for race times around 2h30min, too much in my experience. I think that these athletes did not upload all workouts on Strava, for instance the workouts made on track, where the use of gps device is useless.

From my study dated 2011, I accumulated about 100 training data (30 from Strava for fast runners) and I am continuing in storing and processing data . If you are interested in, I can send you my recent developments.

Yours sincerely

Giovanni Tanda

Hi Giovanni,

I'm glad you liked the post. I've spoken to Christof who is also interested in your work, and he pointed out I have been somewhat hasty in my criticism of your model. There are two main faults with my current approach, which I hope to remedy. The first, is that as you have pointed out, it is possible that not all training data has been logged on Strava. The second is that I haven't factored into account that some people paced their marathon better (i.e. more evenly) than others.

I'm very interested to hear what you are working on now, I'll drop you an e-mail.

Thanks

Will