Capers Jones, who has spent most of his career trying to understand the data that we do collect, has this to say: The software industry lacks standard metric and measurement practices. Almost every software metric has multiple definitions and ambiguous counting rules… The result of metrics problems is a lack of solid empirical data on software costs, effort, schedules, quality, and other tangible matters. Strengths and Weaknesses of Software Metrics, 2006
But how can we get better, or know that we need to get better, or know if we are getting better, if we can’t or don’t measure something? How can we tell if a new approach is helping, or make a business case for more people or a rewrite, or even justify our existing costs in hard times, without proof?
What do you really have to measure?
Different metrics matter to management, the team and the customer. There are metrics that you have to track because of governance or compliance requirements, or because somebody more important than you said so. There are metrics that the team wants to track because they find them useful and want to share. There are stealth metrics that as a manager you would like to track quietly because you think they will give you better insight into the team’s performance and identify problems, but you aren’t sure yet or you don’t want people to adapt and try to game what you are measuring. And most important, the measures that the customer actually cares about: are you delivering what they need when they need it, is the software reliable and usable?
Like most people who manage development shops, it’s much more important to me to get work done than to collect measurement data. Any data must be easy and cheap to collect and simple to use and understand. If it takes too much work, people won’t be able to keep it up over the long term – they’ll take shortcuts or just stop doing it. The best metrics are the ones that come for free, as part of the work that we are already doing. http://www.amazon.com/Managing-Design-Factory-Donald-Reinertsen/dp/0684839911 I need a small number of metrics that can do double duty, that can be combined and used in different ways.
In order to compare data over time or between systems and teams, you need a consistent measure of the size of the system, the size of the code base and how this is changing.Counting the Lines of Code, whether it is Source Lines of Code (SLOC)or Effective Lines of Code (ELOC) or NCLOC (Non Commented Lines of Code), is simple and cheap to measure for a code base and easy to understand (as long as you have a consistent way of dealing with blank lines and other formatting issues), but you can’t use it to compare code written in different languages and on different platforms.
This is where Function Points come in – a way of standardizing measurements of work for estimation and costing independent of the platform. Unfortunately it’s not easy to understand or measure Function Points – while there are lots of tools that can measure the number of lines of code in a code base and every programmer can see and understand what 100 or 1000 lines of code in a certain language looks like, Function Point calculations need to be done by people who are certified Function Point counters (I’m not kidding), using different techniques depending on which of the 20 different Function Point measurement methods they choose to follow.
Function Points have been around since the late 1970s but this way of measuring hasn’t caught on widely because it isn’t practical – too many programmers and managers don’t understand it and aren’t convinced it is worth the work. See “Function Points are Fantasy Points”.
A rough compromise is to use “backfiring” rules-of-thumb to convert from LOC in each language to Function Points (you’ll need to find a good backfiring conversion table). With backfiring, 1 Function Point is approximately 55 lines of Java (on average), or 58 lines of C#, or 148 lines of C. There are a lot of concerns about creeping inaccuracies using this approach, but for rough comparisons of size between systems in different languages it’s all that we really have for now.
Where are you spending time and money, and are you doing this responsibly?
Everybody has to track at least some basic cost data – its part of running any project or business. What’s interesting is coming up with useful breakdowns of cost data. You want to know how much people are spending on new stuff, on enhancements and other changes, on bug fixing and other rework, on support, on large-scale testing work, on learning – and how much time is being wasted. Whatever buckets you come up with should be coarse-grained – you don’t need detailed, expensive time tracking and other cost data to make management decisions. If everyone is tracking time for billing purposes or project costing or OPEX/CAPEX financial reporting or some other reason it’s easy. Otherwise, it’s probably enough to take samples; ask everyone to track time for a while, work with this data, then track time again for a while to see what changes. With a common definition of size across systems (see above), you can compare costs across teams and systems, as well as look at trends over time.
For maintenance teams, Capers Jones recommends that teams track Assignment Scope: the amount of code that one programmer can maintain and support in a year. You can use this for planning purposes (how many people will you need to maintain a system), watch this change over time and compare between teams and systems (based on size again), or even against industry benchmarks. According to Capers Jones’s data, an average programmer should be able to maintain and support around 1,000 Function Points of code per year, depending on the programmer’s experience, the language, quality of code and tools.
Everybody wants software delivered as fast as possible. Agile teams measure Velocity to see how fast they are delivering at any point in time, but the problem with velocity, as Mike Cohn and others have explained in detail, is that it can’t be compared across teams, or even within the same team over a long period of time, because the definition of how much is being delivered, in Story Points, isn’t stable or standardized – Story Points mean different things to different teams, and even different things for the same team over a long enough period of time (as people enter and leave the team and as they work on different kinds of problems). Velocity is only useful as a short-term/medium-term predictor of one team’s delivery speed.
Teams that are continuously delivering to production can measure speed by the number of changes they are making, and the how big the changes are (we’re back to measuring size again). Etsy measures the frequency of changes and the size of the change sets, and correlates this with production incident data (frequency and impact and time to recover) to help understand whether they are changing too much or too quickly to be safe.
Probably the most useful measure of speed comes from Lean and Kanban. To determine if you are keeping up with customer demand, measure the turnaround or cycle time on important changes and fixes: how long it takes from when the customer finds a problem or asks for a change to when they get what they asked for. Turnaround is simple to measure and obvious to the business and to the team, especially if you use something like Cumulative Flow Diagrams to make cycle time visible. It’s easy to see if you are keeping up or falling behind, if you are speeding up or slowing down. Focusing on turnaround ties you directly to what is important to the business and your customers.
Reliability and Quality
But speed isn’t everything – there’s no point in delivering as fast as possible if what you are delivering is crap. You also need some basic measures of quality and reliability.
For online businesses, Uptime is the most common and most useful measurement of reliability: measuring the number of operational problems, the time between operational failures (MTTF), and the time to recover from each failure (MTTR). Basic Ops stuff.
And you need some kind of defect data, which come free if you are using a bug tracking system like Bugzilla or Jira. Identify areas in the code that have the most bugs, measure how long it takes to fix bugs and how many bugs the team can fix each month. Over the longer term, track the bug opened/closed ratio to see if there are more bugs being found than normal, if the team is falling behind on fixing bugs, if they need to slow down and review and fix code rather than attempting to deliver features that aren’t actually ready.
To compare between teams or systems or to watch trends develop over time, one of the key metrics is defect density: the number of bugs per KLOC (or whatever measure of code size you are using – back to the size measurement again). With enough data, you can even try to use defect density to help determine when code is really ready to be shipped.
How Healthy is the Code?
You also want to measure the health of the code base, how much technical debt you are taking on, using code analysis tools. Code coverage is a useful place to start, especially if you are putting in automated testing, or even if you just want to measure the effectiveness of your manual testing. Tools like EMMA and Clover instrument the code and trace execution as you are testing to give you a picture of what code was covered by testing, and more important, what code wasn’t. It’s easy to drill down and see where the holes in your testing are, and to build trends from this data, to set code coverage targets and watch out for people cutting back on testing under pressure.
Measuring Code Complexity, identifying your most complex code and watching to see if complexity is increasing over time helps you make decisions about where to focus your reviews and testing and refactoring work. If you are measuring both test coverage and complexity, you can correlate the data together to identify high risk code: complex routines that are poorly tested. Code health can also be measured by static analysis checkers like Findbugs and PMD or Coverity or Klocwork – tools that look for coding mistakes and security vulnerabilities and violations of good coding practices. Higher-level code quality management tools like Sonar and CodeExcellence, which consolidate and correlate data from multiple static analysis tools, give you an overall summary picture of the health of all of your code, and let you look at code health over time and across teams and systems.
Using Metrics in Ways that aren’t Evil
Many programmers think that metrics are useless or evil, especially if they are used by management to evaluate and compare programmer productivity. IBM for example, apparently uses static measurement data to scorecard application developers and teams and in an attempt to quantify programmer performance. Using this kind of data to rate developers for HR purposes can easily lead to abuse and reinforcing of the wrong behavior. If you are measuring “productivity” by size and simple code quality measures, no programmers will want to take on hard problems or step up to maintain gnarly legacy code because their scores will always look bad. Writing lots of simple code to make a tool happy doesn’t make someone a better programmer.
Only measure things to identify potential problems and negative trends, strong points and weak points in the code and your development process. If you want to set goals for the team to improve speed or cost of quality, let them decide what to measure and how to do it. And measure whatever you need to build a business case – to prove to others that the team is moving forward and doing a good job, or to make a case for change. Make metrics work for the team. You can do all of this without adding much to the cost of development, and without being evil.