Want to run faster? Run more!

There are differing schools of thought when it comes to training for distance running: on one end of the spectrum, you have the low volume, high intensity advocates who believe that relatively short repetitions at race pace or faster is the path to success, and on the other end, we have those who believe that total running volume, or mileage, is the most important training factor, regardless of intensity. While the optimal training load will no doubt fall somewhere between these two extremes, I am of the view that it is better to err on the side of volume rather than intensity.

"Mileage - that's the key. To try and get as much mileage as we can." - Lewis Hamilton

Okay, so Lewis Hamilton may not even compete in the correct sport, but I'm not going to let that dissuade me.

Despite much anecdotal evidence, both for and against, there is limited research available on the effect of high volume on distance running performance. The most detailed paper I found was by Tanda (2011), where he manages to successfully predict marathon finish time as a function of training volume and average training pace. However the study was performed on a small sample of just 22 runners.

As in the case in many areas, modern times have seen a vast influx in the quantity of data available and so we should be able to use this investigate our question. Strava, "the social network for athletes" provides information about the training of a huge number of athletes.

I decided to analyse the training of athletes in the build up to the 2015 Leeds Abbey Dash, a 10 kilometer road race. Not only was this one of the largest 10k races in the country, but it doubled as the national championships, ensuring data on all standards of athletes.

Our data set consists of 532 male athletes, and 369 female athletes, 44 chose not to disclose their gender.

Plotting athletes finishing times against their average running mileage in the 6 weeks prior to the race results in the following graph:

All Runners at Leeds Abbey Dash 2015
All runners at Leeds Abbey Dash 2015

I'm predominantly interested in the affects on trained runners, so restricting to those who covered the race distance in under 40:00 for men, and 50:00 for women. This leaves us with 154 men, and 84 women, and performance in this zone appears to be linearly related to training volume. Performing a linear regression yields the following results:

Faster Runners at the Leeds Abbey Dash
Sub 40:00 men and sub 50:00 women at the Leeds Abbey Dash 2015

Time_{men} = 2380.682-6.217\times mileage
Time_{women} =2896.784-9.061\times mileage

Weekly Mileage Est. Time (Male) Est. Time (Female)
0 39:41 48:17
10 38:39 46:46
20 37:36 45:16
30 36:34 43:45
40 35:32 42:14
50 34:30 40:44
60 33:28 39:13
70 32:25 37:43
80 31:23 36:12
90 30:21 34:41
100 29:19 33:11

This data suggests that each extra mile ran per week would produce an improvement of 6.2 seconds over 10k for men, and 9.0 seconds for women. A rather significant improvement, and hopefully enough to convince undecided readers that mileage is an important ingredient in endurance running success!

Interestingly, our model suggests that female athletes benefit more from increased mileage than male athletes. I believe this is because the women in our sample are more systemically under trained than the men: they are running significantly less volume than the men, and have comparatively weaker times when considered as a percentage of the world records.

One should read these results with caution, especially when extrapolating beyond the range of our data set. In addition, our data source inherently has a large amount of noise in it: what people upload may not be a complete representation of their training.

Obviously this is not the complete story: we've taken an simple, one dimensional measure of an athletes training and attempted to extrapolate their performance. The way the miles are made up, through average pace and pace variation, will also be very important, as suggested by Tanda. I am not advocating athletes go and run 100 miles a week at the expense of all other facets of their training!

We restricted our sample to consider the faster runners, and by doing so biased our data set towards more talented runners. What information can we learn by studying the entire population? If we plot the logarithm of average mileage against finish time, we obtain a much more linear graph, which suggests we will have a relationship of the form time = \alpha + \beta \times log(mileage). Due to technical considerations detailed below, we run a weighted linear regression to obtain the following results:

All runners at Leeds Abbey Dash 2015
All runners at Leeds Abbey Dash 2015

Time_{men} = 4125.2-510.4\times log(mileage)
Time_{women} =5010.6-652.3\times log(mileage)

Weekly Mileage Est. Time (Male) Est. Time (Female)
10 49:10 58:29
20 43:16 50:57
30 39:49 46:32
40 37:23 43:24
50 35:29 40:59
60 33:56 39:00
70 32:37 37:19
80 31:29 35:52
90 30:29 34:35
100 29:35 33:26

We note that it should be expected that towards the faster end of the spectrum more athletes fall below the regression line, as these athletes are more talented than the average runner.

To summarise, I hope to have provided evidence that large training volumes are crucial for success at distance running and quantified up to some degree of uncertainty, what improvements an athlete can expect to see when increasing their training volume. In addition, hopefully I have illustrated the huge potential Strava has for training analysis, and hope to perform a more detailed investigation in the future.

I would be very interested to hear opinions on this from both physiologists and statisticians as my knowledge in both areas is somewhat rudimentary.

Reference: Tanda, G. Prediction of marathon performance time on the basis of training indices. Journal of Human Sport and Exercise, 6 Oct 2011.

For a more in depth discussion of how the data was obtained and analysed, see below:

Data Mining

Strava exposes an API, but it is somewhat limited in what can be accessed. While you can access segment (race) times, it isn't possible to obtain training information from anyone other than yourself. However, navigating through the web interface when logged in it is possible to see other athletes training information, albeit with varying levels of detail depending on the users privacy settings.

We proceed as follows: First, use stravalib, a python wrapper for the REST API. From this, we can pull off all race times from the 2015 Leeds Abbey dash, along with the corresponding athlete ids.

Using this list of athlete ids, we can use scrapy, with suitable HTTP headers to convince Strava we are logged in, to trawl the pages for these athletes. On each of these pages we see the following graph:

An example Strava mileage graph
An example Strava mileage graph

This graph shows the weekly mileage covered by athletes for the last year, and we can extract this data by parsing the HTML. Unfortunately however the graph includes mileage from all sports. For athletes with lax privacy settings we can drill down into each week and see exactly what constituted these miles. However, for a simpler, one size fits all strategy, we make the simplifying assumption that athletes use Strava for just cycling and running, and we can extract the total all time distance ran and cycled from the side bar, which is obtained by an XHR request. We then ignore all athletes who have more cycled more than 20% of their distance run, as these would be a large contribution from cycling to their mileage.

Finally, to ease the data analysis, we run a simple python script which computes the average mileage in the build up to the race for various time periods, and performs an SQL style join of this data with the race times along the athlete ids.

Code can be found here.

Data Analysis

I choose to use R to study the resulting data set. Our initial least squares regression runs smoothly and our residuals appear to be normally distributed with constant variance and have QQ plots supporting this.

When dealing with the regressions of the form: time = \alpha + \beta log(mileage), we experience heteroscedasticity as illustrated in the residual plot below:

Heteroscedasticity in the residuals
Heteroscedasticity in the residuals

To cope with this we perform a weighted linear regression with weights  log(mileage)^{-1}

The R workbook can be found here.

15 Replies to “Want to run faster? Run more!”

  1. Great analysis, especially given the difficulties in getting hold of decent data. There's clearly a correlation, but maybe not causation as its plausible that faster runners, through eg increased commitment to their sport, choose to run more often and for further distances, but in less time in comparison to less experienced runners.
    Having said that, Looking at elite endurance runners, and their 120+miles per week shows how the evolution of training protocols has concluded that extra mileage is needed in order to run fast (but not to the detriment of the quality speed sessions in the plan)

    1. Yes, good points made here. I suspect that given the lack of elite endurance runners (as far as I know) on relatively low mileage programmes, that it wouldn't be too bold to conclude the high mileage is essential. Bernard Lagat is typically the example given, running around 60 miles per week, but he has experienced most success at the shorter distances of 1500m and 5000m.

  2. what if naturally superior runners choose to run larger volumes
    or if the ability to sustain high training volume is related to natural ability

    1. Yes, the first question would be a typical example of correlation vs. causation. I certainly would answer the second affirmatively, in my experience the ability to handle high training loads is a characteristic of more elite runners.

  3. Interesting analysis. Have you seen/read Bounce? Well worth a look at trying to explain why some people excel whilst others only achieve a high standard.

  4. Really good and interesting article Will!!

    I think I might actually be persuaded and I have always been someone who thinks mileage is overrated. I think I was going to train for >5k, I would take this into account.
    Did you get a P-value or anything similar?
    And how many of the people that finished within your time limits were not on strava?
    could you do a graph men running 50miles or more?

    I really like your idea of using strava more for these kind of studies as you are right that atm most of the ones out there are a bit rubbish.

  5. I love the article and have shared it. I am how wet very interested in HIT and believe that the missing parameter her is understanding what benefit a lower mileage group would have if they performed their training at the very high intensities rather than just doing less training overall.
    In basic terms, a non runner who runs a marathon will most likely be a better runner by the end, but the principle of 'junk miles' has more credibility than this information suggests.
    I would hypothesise that those with higher overall mileage are achieve ing the distances with interval sessions and hill reps etc alongside long runs, the lower mileage counterparts are more likely to be just running distance.

    Food for thought.

    Still love the article and am interested to see more of your work as it evolves.

    Would you be interested in a collaborative study using some participants training for a marathon to test both our theories?

    1. Hi Paul, thanks for the share and your comments. I agree with your hypothesis that the majority of higher mileage runners are no doubt running more sophisticated training sessions, however I believe it unlikely that the majority of the lower mileage runners are not doing such training. While I have no data to back this up, in my experience interacting with athletes, many mileage athletes run less mileage under the guise of "saving themselves for hard sessions". However these athletes also tend to be predominantly shorter distance runners and so need a different style of training.

      Certainly worth investigating further.

      I would be interested in some form of study to test these theories, although I'm not sure how much time I'll be able to commit, I'll drop you an email some point in the week.

  6. This is a great post. In case, you've not seen it, this website uses the Tanda data to predict marathon time off mileage:

  7. Would you be willing to make the dataset available for download? I'd like to replicate your analysis and do some more work.

Leave a Reply

Your email address will not be published. Required fields are marked *