Velocity Variance Pseudoscience

Many software development teams are doing estimation in story points using a technique often referred to as relative estimation. In my article Estimating the Value of Story Points, I pointed out that estimating in story points fails to correlate to cycle time outcomes, meaning that story point estimates are measures of effort, not cycle time. I asked the question, “What is the value of estimating effort if it doesn’t predict delivery outcomes?” I also questioned the value of relative estimation itself. What is the virtue of comparing two things in size and labeling each of them in comparison to one another?

In this article, I will discuss a common defense of story point estimation called Velocity Variance. Below is a column chart showing the velocity of a software delivery team between January and the end of June of a year. Each item is represented as a colored block, the story point estimate of the work item is the number on the block. The total story points delivered each sprint is at the top of the column.

This chart appears on its surface to be data-driven and important. We can see how many things we delivered during each two week period (throughput), and we can see the size of each item.

Or can we?

The size numbers are estimates that the team made before performing the work. Those numbers are not the actual sizes of the work completed. They are guesses that the team made about size. Each size assigned to a work item as an estimate has a probability to be the actual size. That probability is generally quite low in software development. This chart cannot be used as historical data of sizes, but rather of estimates.

The variance is mostly estimation error. This is the argument for attempting velocity variance as an analysis. By measuring the variance in estimates over time, we can demonstrate that the team is becoming more predictable as they improve. The claim is that the team learns to make better estimates, and as the columns start to zero in on a consistent amount of story points delivered each sprint, it serves as evidence that the estimation process is improving.

Unfortunately, the stabilization seen on the right of the chart does not demonstrate that. It demonstrates that the team has discovered the number to always choose for their total estimate, not that they are improving the system that they are managing. There is also a tendency for the team to converge on a politically acceptable number, not a statistically meaningful one.

There is no evidence here that estimates are improving or stabilizing to provide predictability. Repeated experience with estimates in software development shows that they have a low correlation coefficient with actual outcomes. I have had many thousands of projects managed within my organization or by myself during my decades in management in software development. I have observed no correlation between estimates and actual outcomes. When performing multiple exercises comparing estimates to costs, days to deliver, variance from scheduled completion… these things only come near the original estimates on projects when a large number of change requests are repeatedly processed to change scope, schedule, and cost estimates as the delivery date approaches. It is why software development adopted iterations: to provide short enough timeline that a lack guess accuracy could be mitigated.

No software development project manager would accept a $1,000 bet that they could deliver a multi-month project on time, on scope, and on budget as agreed before work began without any change requests allowed. They would be foolish to do so.

Since estimates correlate poorly to actuals, totaling up the estimates and analyzing them for accuracy without an outcome measure is an expensive yet non-useful activity.

If we remove the story point sizes from the estimates, we get the chart below.

Without the sizes on each of the work items, instead we are now seeing observable facts about the number of things completed each sprint. Throughput is factual, historical data collected. Throughput is a useful measurement because it is one of the variables in Little’s Law that describes how work-in-process, cycle time, and throughput relate to one another in any process. Throughput is a valuable measurement that is collected and used in calculating Monte Carlo Analysis to predict the probability of completing a number of items by a range of dates. Monte Carlo Analysis is well-established in queueing theory and reliability engineering and is used in many industries to forecast system performance based on historical data.

What does this mean? It means that velocity is throughput with a noisy multiplier. Since the sizes were guesses, they are not measurements. They cannot be input into a statistical process control as data and used to forecast future events.

Champions of velocity attempting to defend relative estimation will argue that it is the variance between the estimates that is important, because it helps a team learn to estimate better by comparing their total estimates completing each sprint. They will say that by measuring the variance in estimates from sprint to sprint, the team can slowly zero in and improve their estimation to become predictable.

They are attempting to apply statistical process control to story points as if they are a historical measure such as dollars spent, hours consumed, days to complete (cycle time), or number of items completed per unit of time (throughput). But story points are not a measurement of something that happened. They were a guess.

To improve the team’s ability to predict their ability to deliver, instead of story points, a cycle time measure would need to be added to the throughput chart and the points removed. That chart would look like this:

This chart provides predictability. We can now see historical data about how long it took to deliver each item. Rather than using guesses and treating them as though the guesses were all correct, we now have a collection of measurement data to which we can apply statistical analysis. This chart tells us that historically 95% of work items completed in 19 days or less.

What to take away from this:

  • Estimates are not measurements
  • Variance in guesses is not process variance.
  • Estimates are politically influenced
  • Measure cycle time and throughput instead

Over time, estimates are not becoming more accurate. They are becoming more acceptable. Measure cycle time, throughput, and WIP. Estimating in story points, gathering velocity, and attempting analysis on the velocity to validate assumptions is circular reasoning. Velocity variance shows us only one thing: estimates are becoming more politically acceptable over time.

A final thought: What is the virtue of using a pseudoscientific system that looks like math and statistical analysis to improve estimation when estimation can be accomplished by simply adopting short iterations to reduce the scale of any variance between estimates and actuals and improve precision. Why spend money and time assigning points, building charts around points, and then analyzing the patterns in the charts when the team can simply be asked, “Can you get this done in 2-3 days? No? How can we make it smaller?”