How to shorten your change lead time and improve your change failure rate
Change lead time and change failure rate are two of the most important KPIs for assessing the efficiency of your software development lifecycle and the effectiveness of your DevOps initiatives, according to the widely referenced book Accelerate. (Other useful metrics include ‘time to restore service’ and ‘deployment frequency,’ which I’ll cover in a different article).
But what do these metrics actually mean, why are they useful, and how are they measured? Let’s discuss:
Change lead time
Lead time is the length of time between first logging an issue and resolving that issue. This encompasses cycle time-the length of time an issue takes to resolve from the moment work has actually started on it-plus the amount of time it takes for an issue to start being worked on after it’s introduced.
Shorter lead times are better; elite performers have lead times of one hour, according to the 2018 DORA report. That might be ambitious, so at the very least, your average lead time should be short enough to fit in a sprint.
Why does change lead time matter?
It provides an important way to track whether cycle time is efficient enough to handle an influx of requests; if requests keep coming in and lead time increases significantly, this is a sign that something needs to be done to get it back under control. Otherwise, your team might quickly be buried under a pile of requests they can’t dig out of, leading to long waits for issues to be resolved or damage to the user experience.
It can also be used to estimate how quickly a request can be fulfilled when there are other jobs in the backlog, and how long the issue will be in the queue.
How can change lead times be shortened?
Typically, change lead times are extended by having too many manual processes or dedicating too few resources to completing requests. One way to bring lead times down is introducing more automation, if possible, but anything that increases the efficiency and speed at which the task can be completed will improve lead time. The efficiency of development can also be increased by having more regression unit tests, so any regressions introduced by code changes can be identified as early as possible.
Change failure rate
Change failure rate is the number of deployments in which something goes wrong, out of the total number of deployments in any given period of time. ‘Failure’ includes having to do anything more than simply running the script to start deployment. Ideally, your change failure rate will be low. The 2018 DORA report found that elite DevOps performers had a change failure rate between 0% and 15%.
What is change failure rate used for?
The point of change failure rate is to tell you how effective your deployment process is. To adhere to DevOps best practices, this process should be completely automated, consistent, and reliable.
For teams without established processes, deployments can still be relatively painless (e.g. a small change needs to be made after deployment has started, but it might be quick or easy to do). But every one of these small changes still counts as a failure, so without automated processes, this rate will likely stay high.
How can it be improved?
Once you’ve nailed the automation of your deployment process, change failure rate should quickly fall.
Measuring both of these KPIs
The first step to improving these KPIs is to find out where you stand. The control chart in Jira automatically displays lead time and cycle time, and change failure rate can just be tracked by noting any failures at the end of each deployment. By measuring your current performance, you’ll have a benchmark to track your progress against; the goal is to have continuous improvement, even if it’s done in small steps.