Marathon Data Analysis Part 1: Initial Thoughts

The marathon season is well and truly upon us. Whether you are recovering from the Boston Marathon or one of the 40 000 gearing up for London this weekend, it is likely your running shoes are beginning to look somewhat the worse for wear.

Marathon running is not incomprehensibly complicated, and in the age of data it is surprising that no extensive study of the factors affecting performance has been carried out. Strava, a social network for athletes, collates detailed training data from a large number of athletes and while they provide a summary of some training data [1], the insights that can be gained from such are limited. The Guardian published a brief list of results that can be obtained from this data [2].

Knowing which training factors affect performance and to what extent they matter is useful for two main reasons. First, it can allow objective scientific design of training schedules and produce the best possible performance for an athlete subject to training time constraints and minimising injury risk. Moreover, if we can form an accurate prediction of an athlete's performance we can decide on an appropriate pacing strategy for race day and limit the chance of hitting the dreaded 'wall' [3].

Using a web crawler, I have collected the training data of all athletes on Strava who completed the 2016 London Marathon in under 3:30 for the 16 weeks prior to the race.  After filtering to remove clearly false data and those who only upload data to Strava sporadically, we can already begin to gain some interesting insights.

Scatter plots of training variables against marathon performance.
Scatter plots of training variables against marathon performance.

Unsurprisingly, those who record faster marathon performances tend to run more and faster in training. We can also observe that a significant proportion (13%) of athletes, run a marathon in training before the race.

For me, the most interesting observation is that while the average speed (reciprocal of pace) of training increases roughly linearly as marathon time decreases (as seen in the middle left graph), an increase in average distance run per week seems to have little effect on athletes slower than 3 hours, but correlates with vast improvements on marathon time for those faster than 3 hours (as seen in the top right graph).

Running at different speeds produces a varying training stimulus and it is generally agreed that training at a variety of paces is beneficial to performance. However, it is a matter of debate as to what the optimal proportion of running volume at various paces should be.

Using an athlete's marathon time, we use the Riegel formula [4] to infer estimated easy running and race paces for a variety of distances.  We now plot the volume and proportion of mileage spent in a collection of training zones for the fastest, median, and slowest 10 athletes. Each coloured bar in each graph corresponds to a different athlete.

Mileage in pace zones (Fastest 10 male athletes) [Times between 2:15:38 and 2:25:13]
Mileage in pace zones (Median 10 male athletes) [Times between 3:03:12 and 3:03:31]
Mileage in pace zones (Slowest 10 male athletes) [Times between 3:29:59 and 3:30:11]
From these graphs, we can observe that the fastest runners not only run greater mileage, but they run a much larger proportion of mileage significantly slower than race pace.

In the next post, I will discuss the Tanda prediction [5] of marathon performance, and hopefully come up with an improved model.

Technical details on data collection and cleaning:

To extract the data from Strava, I used a Scrapy spider to download 20 GB of data, including all the available GPS streams from 16 weeks of training of those athletes on Strava with public profiles who ran 3:30:11 or under at the 2016 London Marathon.

I then parsed this file to extract useful statistics, such as average pace, total mileage etc. for a variety of time periods prior to the race. In doing so, I discarded any data which was clearly an incorrectly tagged or corrupted GPS file (e.g. average speed greater than 10 metres per second). For manual run uploads (those without a GPS trace), I assumed constant speed for the duration of the activity.

The velocity time series obtained from Strava via the GPS trace still seems to be noisier than desirable, and after a trailing a few methods, a rolling median with a window size of 10 GPS points seems to clean up the time series nicely.

To remove athletes who only use Strava to log some of their training, or those who joined Strava relatively close to race day, I required that athletes have a minimum of 16 uploads, and at least one of those in the first 3 weeks of the 16 week block prior to the race.

I used genderize.io to classify athletes who didn't provide their gender to Strava.

The paces for steady, easy and very easy running are defined by the Riegel pace prediction for 100km, 1000km and 10000km distance respectively. Even though the predictor scales poorly to such distances, it still gives reasonable estimates of steady, easy and very easy paces.  We now define the pace zones as follows.

For each of the distances d  (one of 1500m, 3k, 5k, 10k, HM, M, Steady, Easy, Very Easy), let v_d denote the speed corresponding to the Riegel estimate for this distance. The pace zone corresponding to d is the set of speeds that are closer to v_d than v_{d'} for any other distance d'.

References:

[1] https://www.strava.com/running-races/2016-london-marathon
[2] https://www.theguardian.com/lifeandstyle/the-running-blog/2016/apr/21/sub-3-marathon-data-strava-london
[3] https://en.wikipedia.org/wiki/Hitting_the_wall
[4] https://en.wikipedia.org/wiki/Peter_Riegel
[5] Tanda G. Prediction of marathon performance time on the basis of training indices. J. Hum. Sport Exerc. Vol. 6, No. 3, pp. 511-520, 2011

11 Replies to “Marathon Data Analysis Part 1: Initial Thoughts”

  1. You need to be careful with correlation=causation on some of your observations. e.g. "an increase in average distance run per week seems to have little effect on athletes slower than 3 hours, but causes vast improvements on marathon time for those faster than 3 hours". Causes? Or correlates with runners who are more committed / better at marathoning in the first place? You can say that athletes slower than 3 hours seem to display a broad range of average distances per week, and that athletes faster than 3 hours tend to display a narrower, higher band of average distances per week, but I'm not sure you can necessarily conclude that one causes the other. (Though it isn't an unreasonble hypothesis).

    Also, would it not be useful to show the race data in categories relating to %age of, say, world record pace so that you treat female athletes more fairly. Grouping 3hr women with 3hr men is unfair as their relative levels of performances are quite different.

    There's still plenty of screwy data in there too - I don't believe there are truly any sub 2:30 runners who train less than 20 miles per week, for example. Though I'm sure you're well aware of that!

    Otherwise, a really interesting extraction of data from a source that I suspect has great data mining potential!

    1. Yes good spot, this is purely correlation rather than causation, I've amended the language to reflect this. In future analysis I plan on treating men and women separately, as there is hopefully enough data to still get useful insights. And of course you are right that this is a very noisy data set, although I hope it'll still be possible to extract some of the underlying signal!

  2. Fantastic piece of work buddy!
    It would be really interesting to see the section regarding training paces extended to cover race day performance. It's clear from the above analysis that "quicker" athletes tend to understand the benefits of (and spend more time) running slowly in training. However do these quicker athletes necessarily hold their pace better throughout the event? Addressing the topic of spending enough time at easy / relaxed running pace increasing aerobic efficiency and protecting an athlete from hitting the wall. An extract of splits from strava might make this possible to investigate?

    1. Thanks for the comment, I'm glad you liked it! Studying race splits and the effect of training on this is definitely an interesting idea! I hope I'll find the time to look into it!

  3. An interesting analysis and great work extracting the data. As All says it would be useful to consider the race performance – some of those running 2.45 will have over performed while others have hit the wall – so their training regime hasn’t necessarily been beneficial.

  4. Fascinating and valuable data but I agree totally with Matt L any causal relationship has to be speculative. In my book ’55 Years Running’ I make the point that while training very obviously affects performance so does performance affect training. It’s much easier for a good runner to do more and faster training than a poorer runner. Also the better a runner you are the more you enjoy running, the more highly motivated you are likely to be and the more running you will want to do.

    1. Agreed. I am not suggesting athletes immediately stop their training and go and do X, Y and Z. I merely hope to examine common trends between faster athletes and hopefully inspire people to think more deeply about the best type of training. This is far from a substitute for a knowledgeable coach!

  5. What you have seemingly not acknowledged is that training is relatively slower because race pace is relatively faster.
    The benefit of a faster marathoner running closer to race pace more often is probably negated by the risk of the extra stress ergo they train relatively slower.
    The training Zones for a faster runner are probably much wider than those for a slower runner because of the much higher aerobic capacity.
    There should be no reason to discourage a 2:45 marathoner from running easy with a 2:25 marathoner for this reason.

    1. Yes this is a good point. However, one would imagine that these paces are still reasonable proxies for effort. In an ideal world we would be able to have more detailed information about an athlete such as shorter distance PBs and VO2 max, but after giving it some thought this method appears to be the best approximation to the task at hand.

  6. I do agree with all the ideas you've presented in your post. They are really convincing and will definitely work. Still, the posts are too short for starters. Could you please extend them a little from next time? Thanks for the post.

Leave a Reply

Your email address will not be published. Required fields are marked *