August 2009 – Three Kinds of Measurement and Two Ways to Use Them

This article was originally featured in the July/August 2009 issue of Better Software magazine. Read the entire issue or become a subscriber.

People often quote Lord Kelvin: “I often say that when you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot express it in numbers, your knowledge is of a meager and unsatisfactory kind; it may be the beginning of knowledge, but you have scarcely, in your thoughts, advanced to the stage of science, whatever the matter may be.”But, few note the sentence that precedes the passage: “In physical science the first essential step in the direction of learning any subject is to find principles of numerical reckoning and practicable methods for measuring some quality connected with it.” The missing sentence prompts some questions: Are software development and testing sciences subject to the same kind of numerical measurement that we use in physics? If not, what kinds of measurements should we use? How could we think more usefully about measurement?

Gerald M. (Jerry) Weinberg suggests thinking in terms of three broad categories. First-order measurement, he says, is what we need to get started–“just adequate to the task of getting something built.” First-order measurement tends to be qualitative, fast, and inexpensive; it generally doesn’t require mechanisms or devices to enhance or extend the observation. In a recent conversation, Jerry told me that first-order measurements “are unobtrusive, or minimally obtrusive, and can be used without a whole lot of fuss. They help give you a lot of important information that can lead to other information or, in the best case, to immediate action if needed.”

First-order measurement is what we’re doing most of the time as we’re driving a car. We look through the windows, listen to the engine, and feel the acceleration and deceleration. We make observations and comparisons without getting hung up on quantification. “The road is dry. It’s cloudy. There’s traffic on the right and a car up ahead with its brake lights on.” First-order measurement suggests answers to the questions What seems to be happening? and What should I do now? In this situation, if you feel like you’re driving too fast, you probably are driving too fast. If so, first-order measurement is enough to inform an immediate and appropriate action: slow down.

Because it’s based on ongoing experience and feelings, rather than on careful experiments and controlled data intake, wise use of first-order measurement requires us to consider a number of possible interpretations of the meaning and significance of what we see. Suppose you feel like you’re driving fast, but not too fast. Now you observe a set of red and blue lights on the top of the car ahead. The extra data suddenly prompts you to realize that you’re uncertain about your relationship to the speed limit. The situation and first-order measurement prompt a different response in the form of questions: What else do I need to know? and Where should I look? At this point, you move into second-order measurement and refer to the speedometer.

Second-order measurement, says Jerry, is the kind of measurement that engineers use to tune relatively stable systems, making them cheaper, stronger, lighter, more reliable, faster–or slower, if that’s what’s desired. Second-order measurement focuses on questions like What’s really happening? and How is it changing? tending to be more quantitative, subject to more refined models, and generally busier than first-order measurement. It is often assisted by external instruments to supplement or refine direct sensory intake. In particular, metrics–mathematical functions that relate objects or events to numbers via a model–are second-order measurements.

Back in the car, second-order measurement is the kind of information that you obtain from looking at the dashboard. You note that your speed is forty-three miles per hour and that the posted limit is thirty-five miles per hour. Your quantitative, second-order measurement tells you that you’re above the legal limit. The apprehensive feeling in your gut, triggered by the combination of police car and the second-order measurement, informs a decision to slow down.

What of third-order measurement? That, says Jerry, is the kind of precise, highly quantitative measurement that supports the physicist’s search for new natural laws. It helps us answer the question What happens? in a universal and general sense. But third-order measurement can be precise only because it tends to be about very simple systems (such as two interacting masses) or very simple models of complex systems (in which we choose to ignore many dimensions of the system, but analyze a very small number of dimensions very thoroughly). Perhaps most significantly, third-order measurement emphasizes and depends upon keeping messy human traits–variability, subjectivity, and values–out of the way. As noted in an important paper by Cem Kaner and Walter P. Bond, using metrics and higher-order measurement wisely depends on construct validity–critical rigor in evaluating the models and the functions that form the basis for the measurement.

In Rapid Testing, we define a control metric as any metric that drives a decision. Some development groups standardize the decision to ship the product when it contains a low-enough threshold number of high-severity bugs. Others consider a program adequately tested if there’s one positive and one negative test per “requirement” (meaning per line in a requirements document). Still others deem a test group “successful” if there is a low-enough percentage of rejected bug reports. By contrast, an inquiry metric is one that prompts a question: We have three open high-severity bugs–What’s the story there? Jim and Mark are two days behind where we thought they’d be–Do they need help? The program managers are deferring a lot of problem reports–Are the problems insignificant, or do we need more training because we don’t understand the product?

One of my recent clients rated the quality of its products and customer satisfaction with a basket of five second-order metrics. Each measurement collapsed months of work and tons of data into a single number. “Better” numbers earned praise; “worse” numbers earned a reprimand, so management meetings dragged on while people tried to explain changes from last month’s numbers–especially when things had gotten worse. At this company, schedules frequently slipped and shipments were often delayed. Yet when I asked testers the simple question: What slows you down? I got a wealth of information. They told me about broken and buggy builds, inadequate test environments, excessive emphasis on scripts that were out of date by the time the product arrived, and a lack of information about real customer needs. They also said they were wasting time collecting data that wasn’t being used to help speed up development or testing, and they offered dozens of ways in which the numbers could be gamed.

A different client, also working on one-year project cycles, focused on questions like: What happened this week? What did we get done? What problems did we run into? Managers used personal contact–direct observation of and conversation with people–as their primary approach to assessing the project’s status. They took a good number of quantitative measurements, but used them only as indicators to refine their initial assessments and to inform new first-order questions. The team made rough long-term estimates and more precise short-term estimates, dividing two-week cycles into tasks of two days or less, with clear deliverables that signaled completion. When tasks weren’t finished in the estimated time, no one was punished; instead, everyone considered what he hadn’t understood earlier, what he had learned, and what might inform a better estimate next time. Team members didn’t collect metrics on things that weren’t immediately interesting and important to them. They were interested in understanding the situation and optimizing the quality of the work, not in the appearances afforded by the metrics. They emphasized the game and the season over the box scores. And they consistently shipped high-quality products on time.

They did use one–and only one–control metric. When the amount of open problems exceeded a certain number, they stopped working on new features and fixed problems until the list was comprehensible and manageable again.

Jerry observes that in software engineering we seem obsessed with higher-order measurements. Why? He suggests that decisions about quality are political and emotional, based on discussions and decisions about whose values count and how much they count relative to one another. Such issues are often distasteful to people who want to appear rational and “scientific,” so we try to avoid those issues with appeals to higher-order measurement.

Each new software project involves a human context–interaction between different sets of clients, developers, tasks, and problems to solve, with high variability, contending values, and small sample sizes. In those environments, third-order measurement isn’t achievable; it’s an expensive distraction. That leaves us with cycles of first- and second-order inquiry measurement–not physics, but easily good enough to build and tune our systems.

References:

  • Thomson, William (Lord Kelvin). “Electrical Units of Measurement.” Popular Lectures and Addresses I (London, 1981-94).
  • Weinberg, Gerald M. Quality Software Management, Volume 2: First-Order Measurement. Dorset House Publishing, New York, 1993
  • Weinberg, Gerald M. Personal correspondence with the author, May 18, 2009.
  • Kaner, Cem and Walter P. Bond. “Software Engineering Metrics: What Do They Measure and How Do We Know?.” 10th International Software Metrics Symposium. Chicago, IL, 2004.
  • Weinberg, Gerald M. Quality Software Management, Volume 1: Systems Thinking. Dorset House Publishing, New York, 1991.
Michael Bolton

Michael Bolton provides training and consulting services in software testing and is a co-author with James Bach of Rapid Software Testing, a course and methodology on how to do testing more quickly, less expensively, and with excellent results. Contact Michael at mb@developsense.com.

Michael Bolton
Michael Bolton provides training and consulting services in software testing and is a co-author with James Bach of Rapid Software Testing, a course and methodology on how to do testing more quickly, less expensively, and with excellent results.

The Related Post

I’ve been intending to write a book review of How We Test Software At Microsoft, by Alan Page, Ken Johnston, and Bj Rollison, but for whatever reason I just never found the time, until now. In general, I like this book a lot. It’s a nice blend of the tactical and the strategic, of the ...
Introduction Keyword-driven methodologies like Action Based Testing (ABT) are usually considered to be an Automation technique. They are commonly positioned as an advanced and practical alternative to other techniques like to “record & playback” or “scripting”.
Introduction Keyword-driven testing is a software testing technique that separates much of the programming work of test automation from the actual test design. This allows tests to be developed earlier and makes the tests easier to maintain. Some key concepts in keyword driven testing include:
At VISTACON 2011, Jane sat down with LogiGear Sr. VP, Michael Hackett, to discuss complex systems.
March Issue 2019: Leading the Charge with Better Test Methods
Introduction All too often, senior management judges Software Testing success through the lens of potential cost savings. Test Automation and outsourcing are looked at as simple methods to reduce the costs of Software Testing; but, the sad truth is that simply automating or offshoring for the sake of automating or offshoring will only yield poor ...
This article was developed from concepts in the book Global Software Test Automation: Discussion of Software Testing for Executives. Introduction Many look upon Software Testing as a cost. While it is true that Software Testing does cost money, in many cases significant amounts of money, it is also an activity that helps an organization to ...
Trying to understand why fails, errors, or warnings occur in your automated tests can be quite frustrating. TestArchitect relieves this pain.  Debugging blindly can be tedious work—especially when your test tool does most of its work through the user interface (UI). Moreover, bugs can sometimes be hard to replicate when single-stepping through a test procedure. ...
This article was originally featured in the May/June 2009 issue of Better Software magazine. Read the entire issue or become a subscriber. In my travels, I’ve worked with a number of companies that have attempted to assess the quality of their testing — or worse, their testers — using poorly considered metrics. Sometimes the measurement ...
Please note: This article was adapted from a blog posting in Karen N. Johnson’s blog on July 24, 2007. Introduction The password field is one data entry field that needs special attention when testing an application. The password field can be important (since accessing someone’s account can start a security leak), testers should spend more ...
Introduction This 2 article series describes activities that are central to successfully integrating application performance testing into an Agile process. The activities described here specifically target performance specialists who are new to the practice of fully integrating performance testing into an Agile or other iteratively-based process, though many of the concepts and considerations can be ...
LogiGear Magazine – July 2011 – The Test Methods & Strategies Issue

Leave a Reply

Your email address will not be published. Required fields are marked *

Stay in the loop with the lastest
software testing news

Subscribe