The marathon season is well and truly upon us. Whether you are recovering from the Boston Marathon or one of the 40 000 gearing up for London this weekend, it is likely your running shoes are beginning to look somewhat the worse for wear.
Marathon running is not incomprehensibly complicated, and in the age of data it is surprising that no extensive study of the factors affecting performance has been carried out. Strava, a social network for athletes, collates detailed training data from a large number of athletes and while they provide a summary of some training data , the insights that can be gained from such are limited. The Guardian published a brief list of results that can be obtained from this data .
Knowing which training factors affect performance and to what extent they matter is useful for two main reasons. First, it can allow objective scientific design of training schedules and produce the best possible performance for an athlete subject to training time constraints and minimising injury risk. Moreover, if we can form an accurate prediction of an athlete's performance we can decide on an appropriate pacing strategy for race day and limit the chance of hitting the dreaded 'wall' .
Using a web crawler, I have collected the training data of all athletes on Strava who completed the 2016 London Marathon in under 3:30 for the 16 weeks prior to the race. After filtering to remove clearly false data and those who only upload data to Strava sporadically, we can already begin to gain some interesting insights.
Unsurprisingly, those who record faster marathon performances tend to run more and faster in training. We can also observe that a significant proportion (13%) of athletes, run a marathon in training before the race.
For me, the most interesting observation is that while the average speed (reciprocal of pace) of training increases roughly linearly as marathon time decreases (as seen in the middle left graph), an increase in average distance run per week seems to have little effect on athletes slower than 3 hours, but correlates with vast improvements on marathon time for those faster than 3 hours (as seen in the top right graph).
Running at different speeds produces a varying training stimulus and it is generally agreed that training at a variety of paces is beneficial to performance. However, it is a matter of debate as to what the optimal proportion of running volume at various paces should be.
Using an athlete's marathon time, we use the Riegel formula  to infer estimated easy running and race paces for a variety of distances. We now plot the volume and proportion of mileage spent in a collection of training zones for the fastest, median, and slowest 10 athletes. Each coloured bar in each graph corresponds to a different athlete.
From these graphs, we can observe that the fastest runners not only run greater mileage, but they run a much larger proportion of mileage significantly slower than race pace.
Technical details on data collection and cleaning:
To extract the data from Strava, I used a Scrapy spider to download 20 GB of data, including all the available GPS streams from 16 weeks of training of those athletes on Strava with public profiles who ran 3:30:11 or under at the 2016 London Marathon.
I then parsed this file to extract useful statistics, such as average pace, total mileage etc. for a variety of time periods prior to the race. In doing so, I discarded any data which was clearly an incorrectly tagged or corrupted GPS file (e.g. average speed greater than 10 metres per second). For manual run uploads (those without a GPS trace), I assumed constant speed for the duration of the activity.
The velocity time series obtained from Strava via the GPS trace still seems to be noisier than desirable, and after a trailing a few methods, a rolling median with a window size of 10 GPS points seems to clean up the time series nicely.
To remove athletes who only use Strava to log some of their training, or those who joined Strava relatively close to race day, I required that athletes have a minimum of 16 uploads, and at least one of those in the first 3 weeks of the 16 week block prior to the race.
I used genderize.io to classify athletes who didn't provide their gender to Strava.
The paces for steady, easy and very easy running are defined by the Riegel pace prediction for 100km, 1000km and 10000km distance respectively. Even though the predictor scales poorly to such distances, it still gives reasonable estimates of steady, easy and very easy paces. We now define the pace zones as follows.
For each of the distances (one of 1500m, 3k, 5k, 10k, HM, M, Steady, Easy, Very Easy), let denote the speed corresponding to the Riegel estimate for this distance. The pace zone corresponding to is the set of speeds that are closer to than for any other distance .
 Tanda G. Prediction of marathon performance time on the basis of training indices. J. Hum. Sport Exerc. Vol. 6, No. 3, pp. 511-520, 2011