There are differing schools of thought when it comes to training for distance running: on one end of the spectrum, you have the low volume, high intensity advocates who believe that relatively short repetitions at race pace or faster is the path to success, and on the other end, we have those who believe that total running volume, or mileage, is the most important training factor, regardless of intensity. While the optimal training load will no doubt fall somewhere between these two extremes, I am of the view that it is better to err on the side of volume rather than intensity.
"Mileage - that's the key. To try and get as much mileage as we can." - Lewis Hamilton
Okay, so Lewis Hamilton may not even compete in the correct sport, but I'm not going to let that dissuade me.
Despite much anecdotal evidence, both for and against, there is limited research available on the effect of high volume on distance running performance. The most detailed paper I found was by Tanda (2011), where he manages to successfully predict marathon finish time as a function of training volume and average training pace. However the study was performed on a small sample of just 22 runners.
As in the case in many areas, modern times have seen a vast influx in the quantity of data available and so we should be able to use this investigate our question. Strava, "the social network for athletes" provides information about the training of a huge number of athletes.
I decided to analyse the training of athletes in the build up to the 2015 Leeds Abbey Dash, a 10 kilometer road race. Not only was this one of the largest 10k races in the country, but it doubled as the national championships, ensuring data on all standards of athletes.
Our data set consists of 532 male athletes, and 369 female athletes, 44 chose not to disclose their gender.
Plotting athletes finishing times against their average running mileage in the 6 weeks prior to the race results in the following graph:
I'm predominantly interested in the affects on trained runners, so restricting to those who covered the race distance in under 40:00 for men, and 50:00 for women. This leaves us with 154 men, and 84 women, and performance in this zone appears to be linearly related to training volume. Performing a linear regression yields the following results:
|Weekly Mileage||Est. Time (Male)||Est. Time (Female)|
This data suggests that each extra mile ran per week would produce an improvement of 6.2 seconds over 10k for men, and 9.0 seconds for women. A rather significant improvement, and hopefully enough to convince undecided readers that mileage is an important ingredient in endurance running success!
Interestingly, our model suggests that female athletes benefit more from increased mileage than male athletes. I believe this is because the women in our sample are more systemically under trained than the men: they are running significantly less volume than the men, and have comparatively weaker times when considered as a percentage of the world records.
One should read these results with caution, especially when extrapolating beyond the range of our data set. In addition, our data source inherently has a large amount of noise in it: what people upload may not be a complete representation of their training.
Obviously this is not the complete story: we've taken an simple, one dimensional measure of an athletes training and attempted to extrapolate their performance. The way the miles are made up, through average pace and pace variation, will also be very important, as suggested by Tanda. I am not advocating athletes go and run 100 miles a week at the expense of all other facets of their training!
We restricted our sample to consider the faster runners, and by doing so biased our data set towards more talented runners. What information can we learn by studying the entire population? If we plot the logarithm of average mileage against finish time, we obtain a much more linear graph, which suggests we will have a relationship of the form . Due to technical considerations detailed below, we run a weighted linear regression to obtain the following results:
|Weekly Mileage||Est. Time (Male)||Est. Time (Female)|
We note that it should be expected that towards the faster end of the spectrum more athletes fall below the regression line, as these athletes are more talented than the average runner.
To summarise, I hope to have provided evidence that large training volumes are crucial for success at distance running and quantified up to some degree of uncertainty, what improvements an athlete can expect to see when increasing their training volume. In addition, hopefully I have illustrated the huge potential Strava has for training analysis, and hope to perform a more detailed investigation in the future.
I would be very interested to hear opinions on this from both physiologists and statisticians as my knowledge in both areas is somewhat rudimentary.
Reference: Tanda, G. Prediction of marathon performance time on the basis of training indices. Journal of Human Sport and Exercise, 6 Oct 2011.
For a more in depth discussion of how the data was obtained and analysed, see below:
Strava exposes an API, but it is somewhat limited in what can be accessed. While you can access segment (race) times, it isn't possible to obtain training information from anyone other than yourself. However, navigating through the web interface when logged in it is possible to see other athletes training information, albeit with varying levels of detail depending on the users privacy settings.
We proceed as follows: First, use stravalib, a python wrapper for the REST API. From this, we can pull off all race times from the 2015 Leeds Abbey dash, along with the corresponding athlete ids.
Using this list of athlete ids, we can use scrapy, with suitable HTTP headers to convince Strava we are logged in, to trawl the pages for these athletes. On each of these pages we see the following graph:
This graph shows the weekly mileage covered by athletes for the last year, and we can extract this data by parsing the HTML. Unfortunately however the graph includes mileage from all sports. For athletes with lax privacy settings we can drill down into each week and see exactly what constituted these miles. However, for a simpler, one size fits all strategy, we make the simplifying assumption that athletes use Strava for just cycling and running, and we can extract the total all time distance ran and cycled from the side bar, which is obtained by an XHR request. We then ignore all athletes who have more cycled more than 20% of their distance run, as these would be a large contribution from cycling to their mileage.
Finally, to ease the data analysis, we run a simple python script which computes the average mileage in the build up to the race for various time periods, and performs an SQL style join of this data with the race times along the athlete ids.
Code can be found here.
I choose to use R to study the resulting data set. Our initial least squares regression runs smoothly and our residuals appear to be normally distributed with constant variance and have QQ plots supporting this.
When dealing with the regressions of the form: , we experience heteroscedasticity as illustrated in the residual plot below:
To cope with this we perform a weighted linear regression with weights
The R workbook can be found here.