How to Estimate Software Delivery Timelines with AI

Written by: Michael Neale
7 min read

This article is the first in a series of blog posts discussing the future of engineering management. These posts will demonstrate how new technical capabilities will enable engineering leaders to leverage data to improve team performance, optimize engineering resources and more accurately predict software delivery timelines.


Software project timelines have a reputation for being unpredictable, often undeservedly so. Some folks tend to make the analogies with construction projects and ask, “why can’t building software be like that?” (If you look closely at construction projects, you will see massive time and cost blowouts and failures, but that doesn’t stop people from making the comparison). This comparison is likely made because physical progress can be seen by external stakeholders in construction, which is not the case for the software development process. 

Even the word estimation is often a dirty word in software circles. Fundamentally it is hard to estimate the delivery timing on “something that hasn’t been done before” (which in theory describes all software development.) If it has been done before, then that is a copy/paste, or a library or API you reuse. However, we can more accurately predict delivery predictability by relying on data from our engineering teams when making decisions.

By using our accumulated historical data, and mining it, we can train machine learning models and use them to make AI-powered predictions on when something may be done, and what the outcome is likely to be. Consider this example:

Before showing further background, it is worth noting that there is a graphql powered System of Record data store to provide this historical data, which was built to collect and normalize data from many sources as part of CloudBees Software Delivery Management: 

Combining all these data sources in a graph database means we can query for correlations and relationships that would otherwise be hidden, track metrics and create reports on this data, gaining visibility into and across the software delivery process. 

It means that we can extract data that cuts across issues, repos, pipelines, comments, pull requests, commits, products - and turn it all into a whole lot of features which we can then use to predict a “target” value (eg: when will it be done!). GraphQL is most helpful for this, as it allows otherwise unthought of connections to be followed. For example, a GitHub pull request belongs to a repository that is used in three separate “products.” Those products also have other repositories, pipelines and collections of their issues which are in progress, and of course a whole lot of builds and deployment histories. 

What this means is that we use past successful or historically unsuccessful changes to continuously train models, we can then use those models to predict what the outcomes are likely to be for current or upcoming work. 

Unsuccessful changes (issues, pull requests) are useful data points as well. The system of record isn’t an exhaustive record of everything people work on - it is only a subset of what goes on- so we need to know the wins and fails to be able to make reasonable predictions. 

Estimating how long things will take 

In the above case, there are a few changes which are predicted to take a while (this is duration, measured in days) but are recently opened, with a small number of files changed. This is useful to know because if you tackle these, you can cut down on the total time the project takes or prevent it from blowing out from what *may* be a small change. 

We also focus on the duration of tasks, not the “effort.” Those two things may be related, but in a distributed world, it is often the case that people work on many tasks at once. When some change may be blocked on a review, you can move on to another task - that is normal and perfectly fine. What we do need to be mindful of is the total time it takes to complete a task, which is sometimes called “cycle time.

We can also predict if an item is likely to be resolved soon, or has more work required. For example: 

Knowing a predicted outcome of a deliverable is useful to consider when deciding what work to prioritise. An important issue that looks like it is heading to a CLOSE (rejection) may be worth your attention. Something marked as MERGE means it may be a quick win. (We always end up with a backlog of things and it's easy to miss the quick wins). As they say in the Lion King, “There is more to do than can ever be done.” 

Combine this all together into a “tell me what to look at next” and you can save on standup time where everyone waits their turn to repeat the things that they already typed into the computer.


AI can tell me what to work on next 

We can also identify “cycle time bombs” as noted below:

Some of these have already gone off (i.e. were opened a while ago, too late). In this case, perhaps you can learn from them and see what could be done to prevent similar instances. In other cases, you can jump on them ahead of time to prevent them going off. 

This is all known as predictive analytics and it can help free up this otherwise boring time spent in standup meetings. (You can still have meetings if you want - talk about COVID, the weather bond, or ask for help solving an issue instead of staring at tickets). 

Presenting estimates as a histogram gives us an idea of the task breakdown: 

Happily, in the above case, we have a lot of items in the “0” bucket, meaning a lot of them are estimated to take less than a day. This is usually a good indicator of continuous delivery practices. It is ok to have some larger items spread out to the right, but if you see too many, then you may have an issue and need to break work down into smaller chunks (conversely, you may realise you have capacity to take on some bigger items from the backlog if you like). 

There is a lot more information latent in the system, comments, titles and descriptions. This text can be measured for mood (emotional content, strength). Whilst the actual values at a given point in time do not matter, a change is interesting. A sudden shift down may indicate a system problem or frustration.

All of this is powered by the System of Record in CloudBees Software Delivery Management, which does the heavy lifting of importing and cleaning the data and then exposing it via a GraphQL API. I have found data as clean as this is very useful for training deep learning models. Writing queries to extract insights is very useful, but often looking back historically at the real truth of how things work yields useful insights and input to train deep neural networks. Some have called this “Software 2.0” where instead of hand coding rules and queries, we use historical precedent to train (in this case) millions of parameters to do it for you (and of course, it retrains regularly, to learn as you ship code). 


Sign up for a one-on-one demo, if you want to see how CloudBees Engineering Efficiency unlocks engineering productivity data to give you the insights to keep your teams focused on delivering value quickly and predictably. Please make sure you mention this blog post.

Stay up to date

We'll never share your email address and you can opt out at any time, we promise.