Marathon Data Analysis Part 2: Testing Tanda

There have been numerous attempts to predict marathon performance based on physiological measurements and training data; Dr. Christof Schweining gives a good overview of these [1]. However, many of these models have been fitted to a very small data sample, and do not generalise well to new data.

Now we have access to a large data set of marathon runners and their training, we can test these models and quantify their predictive power. Unfortunately, beyond age and gender, we have no access to physiological measurements and so we are restricted to models based on training data. As discussed by Christof, the only such model which stands up beyond basic scrutiny is Tanda's prediction formula [2]. Unfortunately, as we will see, this model fails to generalise outside of its very small training data set.

Continue reading "Marathon Data Analysis Part 2: Testing Tanda"

Marathon Data Analysis Part 1: Initial Thoughts

The marathon season is well and truly upon us. Whether you are recovering from the Boston Marathon or one of the 40 000 gearing up for London this weekend, it is likely your running shoes are beginning to look somewhat the worse for wear.

Marathon running is not incomprehensibly complicated, and in the age of data it is surprising that no extensive study of the factors affecting performance has been carried out. Strava, a social network for athletes, collates detailed training data from a large number of athletes and while they provide a summary of some training data [1], the insights that can be gained from such are limited. The Guardian published a brief list of results that can be obtained from this data [2].

Knowing which training factors affect performance and to what extent they matter is useful for two main reasons. First, it can allow objective scientific design of training schedules and produce the best possible performance for an athlete subject to training time constraints and minimising injury risk. Moreover, if we can form an accurate prediction of an athlete's performance we can decide on an appropriate pacing strategy for race day and limit the chance of hitting the dreaded 'wall' [3]. Continue reading "Marathon Data Analysis Part 1: Initial Thoughts"

Want to run faster? Run more!

There are differing schools of thought when it comes to training for distance running: on one end of the spectrum, you have the low volume, high intensity advocates who believe that relatively short repetitions at race pace or faster is the path to success, and on the other end, we have those who believe that total running volume, or mileage, is the most important training factor, regardless of intensity. While the optimal training load will no doubt fall somewhere between these two extremes, I am of the view that it is better to err on the side of volume rather than intensity.

"Mileage - that's the key. To try and get as much mileage as we can." - Lewis Hamilton

Okay, so Lewis Hamilton may not even compete in the correct sport, but I'm not going to let that dissuade me.

Despite much anecdotal evidence, both for and against, there is limited research available on the effect of high volume on distance running performance. The most detailed paper I found was by Tanda (2011), where he manages to successfully predict marathon finish time as a function of training volume and average training pace. However the study was performed on a small sample of just 22 runners.

As in the case in many areas, modern times have seen a vast influx in the quantity of data available and so we should be able to use this investigate our question. Strava, "the social network for athletes" provides information about the training of a huge number of athletes.

I decided to analyse the training of athletes in the build up to the 2015 Leeds Abbey Dash, a 10 kilometer road race. Not only was this one of the largest 10k races in the country, but it doubled as the national championships, ensuring data on all standards of athletes.

Continue reading "Want to run faster? Run more!"

'Twas the night before BUCS: An algorithmic approach for predicting cross country performances.

'Twas the night before BUCS, when all through Gloucester
Not an athlete was drinking, not even one beer.
The spikes were stood by the door with care,
In hopes that some medals soon would be theirs.

The greatest day of the year is almost upon us: tomorrow is the British Universities (BUCS) Cross Country championships. Obviously being too excited to do any work I decided to see if it were possible to predict the results based on previous performances.

PowerOf10 provides a fantastic source of athletics data and a few Scrapy spiders later I had a large dataset to play with.

The race entry lists provided a list of names and cross referencing with PowerOf10 allowed me to obtain a complete set of historical race results. Unfortunately several names were insufficiently unique or misspelt (some universities were more prone to this mistake than others... no comment) which meant obtaining performances was impossible.

Based on analysis of every cross country race from 1st of January 2015, my algorithm predicted the following top 20 mens' team results based on a 6 to run, 4 to score system:
Continue reading "'Twas the night before BUCS: An algorithmic approach for predicting cross country performances."