*21.11.2018*

*This article and the accompanying spreadsheet represent a long overdue resource for practitioners in the disciplines of exercise and sport science who want to monitor the progress of their individual athletes, patients, clients or teams in a reasonably rigorous quantitative fashion. In a previous article/slideshow and spreadsheet I focused on quantitative assessment of change since the previous test or measurement, but there has always been a need for quantitative assessment of trend over multiple tests to determine the extent to which the individual is on track for improvement or recovery. Equally, there is a need for assessment of acute deviations from the trend to determine the likelihood of benefit or harm from short-term changes in training or other strategies. This new spreadsheet meets all these needs.*

The previous spreadsheet included a panel for inserting a single measurement to estimate the chances that the individual's true value falls either side of a reference value or threshold, such as a critical blood concentration of vitamin D. The present spreadsheet cannot be used for such calculations, but I will produce a similar upgraded resource, if there is enough demand for it.

The new spreadsheet can be used to track an individual team's scores, but if the data for the team come from the sum of the scores of individual athletes, you should consider analyzing the athletes' data as a time series with either the post-only crossover spreadsheet or mixed modeling. You can then perform the usual sample-based inferential analyses, which are potentially publishable, if you have got some interesting results with the team.

The new spreadsheet comes as a workbook with a second spreadsheet for simulating realistic test scores and thereby investigating error rates with known true values for starting score, trend, changes on chosen time points, and typical error of measurement. Instructions on their use are included at the top of both spreadsheets and are supplemented with comments in some cells.

Both spreadsheets require input of values for the smallest important change and for a target trend. Input of a value for the typical (standard) error of measurement is also desirable. There are numerous articles at this site and elsewhere about the meaning and estimation of these values, which at the request of the reviewer I have summarized in appendices. What follows is a brief description of how the inferences are derived for the three kinds of monitoring: change between measurements, deviation of one measurement or the average of several measurements from the trend, and magnitude of the trend. The operation of the spreadsheet for simulating data is also described.

The probability that the true change is a substantial increase is calculated by combining the observed change, the smallest important change, and the short-term typical error (with its degrees of freedom) into the relevant value of the t statistic. The probability of a substantial decrease is calculated in a similar manner. The probability of a trivial change is the difference between 100% and the sum of the percent probabilities of a substantial increase and decrease.

To the extent that measurements on an individual represent a sample of the individual's true values, it seems reasonable to apply the rules of magnitude-based inference to make conclusions about the individual's true changes. I have chosen the non-clinical version of magnitude-based inference, according to which a change is unclear if the true change *could* be both a substantial increase and a substantial decrease. I have set the default value for *could* to 10%, rather than the usual 5%, which in simulations (described below) does not produce enough clear outcomes for real changes with noisy measures (typical error ³ smallest important change). Unclear changes are indicated with a "?". Clear changes are indicated with «, , and/or ¯ for trivial changes, substantial increases and/or substantial decreases respectively, and very likely changes (those with chances >90%) are indicated with an asterisk (*).

Inserting a value for the short-term typical error is optional. If you do not have a value, the spreadsheet uses the typical error of the estimate from the trend analysis, which will generally overestimate the error of measurement (if the individual made sporadic real changes off the trend) and thereby make it harder to get clear outcomes. Also, you need enough test scores (at least 10) on the individual to get even modest precision for the typical error of the estimate.

A scatterplot of the data allows for easy identification of the scores to include for calculating the trend line. The plot shows the regression line flanked by dashed lines defining a zone of trivial changes plus uncertainty in predicted scores. When a score falls outside the zone defined in this manner, the true difference from the predicted score is approximately likely (>75% chance) to be substantial.

A trend line is fitted to chosen assessments using the Excel functions for simple linear regression, then the difference between the score predicted by the linear trend and the other test scores is analyzed in the same manner as for changes between measurements. (Formulae for the predicted scores and their uncertainty were copied from the spreadsheet for validity.)

When measures are noisy, averaging repeated measurements is the best strategy to reducing the noise. I have therefore included an option for averaging a cluster of values to compare with the values predicted by the trend line.

If you insert a value for a change that the individual wants or needs to show over an extended period (which you also have to insert; e.g., 10 weeks, a season, or a year), the spreadsheet divides the change by the period to give a target trend. This trend is then treated as a smallest important trend, and the spreadsheet combines this value with the slope of the regression line and its uncertainty to give probabilities that the true trend is greater than the positive trend, less than the negative trend, and trivial (falls between ± this trend). Magnitude-based inferences for the trend are provided in the same manner as for changes. (The formula for the uncertainty in the slope was copied from the spreadsheet for validity.)

You will be a better practitioner if you have some idea of the errors you can make when assessing changes and trend in an individual. I have therefore gone to considerable trouble to make a version of the spreadsheet that simulates and analyses test scores similar to yours, so you can see directly how often errors will occur. To generate the score at a given time point, the spreadsheet combines a true starting score with a true trend multiplied by the elapsed time, adds any extra true change at one or more time points, then adds a value of the true typical error multiplied by NORMSINV(RAND()), which simulates the effect of normally distributed random measurement error. In this way the spreadsheet generates a sample of the kind of scores you will get with tests on your individual, if you have inserted realistic values of true typical error, true trend, true extra changes, and of course a true starting score. The spreadsheet then performs inferences on changes and trends in the same manner as for the other spreadsheet, using values for the smallest important change and target trend.

In the absence of an estimate of the typical error from a reliability study, a value can come from an initial analysis of the individual with the first spreadsheet. Find the cell showing the typical error of the estimate from the fitting of the trend line (Cell E81), and use that as the typical error in the simulation sheet. You can also use a typical error obtained by combining the typical errors of the estimate from several individuals, by adding their squares and taking the square root.

To use the simulation spreadsheet, start with a zero true trend and zero extra changes, so you can see how often you get Type-1 (false-positive) errors: that is, how often the inference shows a non-existent substantial trend and non-existent substantial changes. You should play with the number of time points included in the trend analysis, to get some idea of how many you will need with your real data. Then insert a realistic true trend to see how often you get false-negative (Type-2) errors: how often the inference fails to show a substantial trend. Similarly insert substantial changes at individual time points to see how often the inferences fail to show changes from the linear trend and changes from the previous test.

The Type-1 error rate for changes from the previous test depends on the relative magnitudes of typical error and smallest important change. The Type-1 error rate for change from the trend depends additionally on the number of time points included in the trend, which along with the smallest important trend also contributes to the Type-1 error rate for the trend itself. All these factors contribute to the Type-2 error rates, along with the magnitude of the true changes and trend. For noisy measures you will have to learn to live with high error rates.

It is inevitable that the two new spreadsheets have bugs, hopefully only minor formatting errors or ambiguous instructions. Please get back to me by email if you encounter problems or especially if you have any suggestions for improvement.

Buchheit M (2018). Magnitudes matter more than beetroot Juice. Sport Performance & Science Reports 1, https://sportperfsci.com/magnitudes-matter-more-than-beetroot-juice/

Hopkins WG (2010). Linear models and effect magnitudes for research, clinical and practical applications. Sportscience 14, 49-57

Malcata RM, Hopkins WG (2014). Variability of competitive performance of elite athletes: a systematic review. Sports Medicine 44, 1763-1774

Published June 2017

For a description of smallest important effects for all types of outcomes, see Linear Models and Effect Magnitudes (Hopkins, 2010). See also a slideshowfocusing on medal-winning enhancements presented at the performance analysis conference in 2016. For the following summary I have also drawn on Martin Buchheit's recent article on monitoring athletes in the Aspetar Journal (now published here).

For an athlete who is already winning medals regularly, the smallest improvement in performance results in an extra medal every 10 competitions. Assuming little interaction between the athletes in the competition, the change in performance score (time, distance, weight lifted, or judges' ratings) producing the extra medal is 0.3 of the variability that top athletes show from competition to competition. Expressed as a coefficient of variation, the variability ranges from ~0.5% to ~3.0% for top athletes, depending on the sport. You have to take into account a kinetic factor to convert 0.3 of this variability in competition time or distance into the smallest important change in test performance, if the test measure is power output. The test protocol also has to be taken into account, if the test measure is time in an incremental or constant-power test to exhaustion and/or there is a fatiguing pre-load. See Malcata and Hopkins (2014) for a comprehensive review.

Sub-elite athletes generally show more variability in competition than elites do (perhaps by a factor of 1.1-1.4), so the smallest important change for winning medals in local or age-group competitions is correspondingly larger. But a target trend may be more important than winning medals for these athletes. See Appendix 2.

When you do not know exactly how much change in a test or performance indicator is associated with winning an extra match or medal every 10 competitions, standardization is the only way to define smallest important changes. You will need an estimate of the between-subject standard deviation for athletes on a similar level to your athlete, and in the same playing position for team-sport athletes. (You may pool playing positions to get enough players for a reasonably accurate estimate, if you are satisfied that mean test scores are unlikely to differ substantially between the playing positions.) One-fifth (0.2) of this standard deviation is then the smallest important change.

A test unrelated to performance might still be worth administering if it is related to health. In particular, the smallest important change in a test related to risk of injury or illness can be defined as the change or difference associated with the smallest important reduction in risk in a relevant epidemiological study: a risk (hazard) ratio of 0.90.

In the absence of information on risk, standardization is the only approach. Strictly speaking, the between-subject standard deviation for a health indicator should be "pure" or free of short-term typical error of measurement (in contrast to that for measures of performance, which for team-sport athletes should be the observed between-subject standard deviation). The pure standard deviation is Ö(observed SD^{2} – typical error^{2}) and requires a short-term reliability study on subjects similar to your individual for estimation of the SD and error.

Unfortunately, some physiological measures have real differences between individuals that are only a fraction of the typical error (within-subject variation); that is, any observed differences are due almost entirely to day-to-day variation. The variation itself may differ substantially between individuals. Standardization fails for such measures. Martin Buchheit (2018) has suggested using a fraction or multiple of the typical error as the smallest important change for such measures, in the sense that any observed change greater than this amount can be regarded as a change not due simply to error of measurement. This approach amounts to identifying when a change is "unusual" for the individual. With the current spreadsheet, you can identify such changes by specifying what you mean by *unusual* with the cell that defines the meaning of *very unlikely*. The default of 10% is probably about right for most situations; it corresponds to accepting a single change score as a real change when the change is greater than ~2.0 typical errors (t´Ö2´e, where t is the value of the t statistic for a cumulative probability of 0.9), as shown in one of the cells. You then set the smallest important change to zero (actually a very small number, such as 0.001) rather than specifying a fraction or multiple of the typical error. All unusual changes are now marked with either * or ¯*, and all other changes are unclear (marked with ?). Keep in mind that, if the real change is zero, you will see unusual increases 10% of the time and unusual decreases 10% of the time.

Finally, the above advice is based on the assumption that the test measure could have a useful relationship to performance or health. If you have a strong suspicion that the measure has no such relationship, stop measuring it and put your resources into measuring something that is more likely to have a relationship.

For an individual who wants or needs to improve over a period longer than simply the time to the next test, you can decide on a realistic change over a chosen period. The rate of change (the change divided the period) is then the target trend. For a development athlete aiming (eventually) for Olympic selection, work out the change needed over the next season or year to keep the athlete on target. For sub-elite team-sport athletes, the difference between the athlete's test or performance indicator and that of typical athletes in an elite team is a reasonable target; you then have to decide on a reasonable period over which you hope to gain that improvement. Similar reasoning would apply to measures of muscle mass, body fat, or health indices such as blood lipids or vitamin D.

The target trend evaluated over the period between consecutive tests is another approach to estimating the smallest important change. For example, if the target improvement for the coming year is 4%, then with three months between tests, the smallest important change of 1% will keep the athlete on track for the target.

The optional short-term typical or standard error of measurement represents the standard deviation of an individual's repeated measurements in the absence of any real change in the individual. As such, the error can come from a short-term reliability study of individuals similar to your individual. Alternatively, if you have at least 10 or preferably more consecutive repeated tests on the individual, and there is no reason why there should be substantial changes between the tests other than a gradual trend, the standard error of measurement derived from fitting the trend line is a good estimate of the typical error. Hence, if you do not include a value for the typical error, the spreadsheet will use the standard error of the estimate.

You may use the standard error of the estimate from another individual or individuals when you are starting tests on a new individual. You may also average standard errors of the estimate, but do it by averaging their squares and taking the square root. To be more accurate, weight the squares by their degrees of freedom (number of points in their trend line minus 2). The degrees of freedom of the mean is the sum the degrees of freedom.

If the performance test and score are similar to those of the competition (e.g., a time-trial time for a track athlete, a throwing or jumping distance for a field athlete, an ergometer time-trial time for cyclist, rower or kayaker), and if you are satisfied that the athlete performs the test as reliably as a competition, then the typical error of the test is the published variability of top athletes in that sport, as stated in the Malcata and Hopkins (2014) review and in my recent slideshow. Note that in this case the smallest important change is automatically 0.3´ the typical error of the test. You should consider using the smallest important change as calculated in this manner, but make the typical error somewhat larger than the variability in competitions, by a factor of perhaps ~1.3´. Or you can use the value from a reliability study, or if that's not available, let the spreadsheet use the typical error of the estimate from the trend analysis.

If the test scores are counts and proportions of counts (e.g., most performance indicators from the analysis of players' actions in games), the typical error depends on the individual's mean count or proportion, but the estimate of typical error from a reliability study will be an average across all the subjects in the study. It is possible to estimate approximate typical errors for the counts and proportions of the individual by assuming they have Poisson and binomial sampling distributions respectively, but then there is a potential problem of non-randomness of the counts, which usually produces a larger typical error ("overdispersion"). I suggest using the estimate of typical error based on the sampling distribution^{a} until you have enough data to let the spreadsheet use the typical error of the estimate from the trend analysis.