» DEFINITION, OPERTIONAL DEFINITION, AND MEASURMENT

» LEVEL OF MEASUREMENT

» SOME BASIC MEASURES

Fundamentals of Measurement Theory topics:

It is undisputed that measurement is crucial to the progress of all sciences. Scientific progress is made through observations and generalizations based on data and measurements, the derivation of theories as a result and in turn the confirmation or refutation of theories via hypothesis testing based on further empirical data. As an example, consider the proposition “the more rigorously the front end of the software development process is executed, the better the quality at the back end.” To confirm or software development process” and distinguish the process step and activities of the front end from those of the back end. Assume that after the requirements gathering process, our development process consists of the following phases:

• Design

• Design reviews and inspections

• Code

• Code inspection

• Debug and development tests

• Integration of components and modules to form the product

• Formal machine testing

• Early customer programs

Integration is the development phase during which various parts and components are integrated to form one complete software product. Usually after integration the product is under formal change control. Specifically, after integration every change of the software must have a specific reason (e.g., to fix a bug uncovered during testing) and must be documented and tracked. Therefore, we may want to use integration as the cutoff point: The design, coding, debugging, and integration phase are classified as the front end of the development process and the formal machine testing and early customer trials constitute the back end.

We then define rigorous implementation both in the general sense and in specific terms as they relate to the front end of the development process. Assuming the development process has been formally documented, we may define rigorous implementation as total adherence to the process: whatever is described in the process documentation that needs to be executed, we execute. However, this general on is not sufficient for our purpose, which is to gather data to test our propositions. We need to specify the indicator(s) of the definition and to make it (them) operational. For example, suppose the process documentation says all designs and code should be inspected. One operational definition of rigorous implementation may be inspection coverage expressed in terms of the percentage of the estimated lines of code (LOC) or of thee function [points (FP) that are actually inspected. Another indicator of good reviews and inspections could be the scoring of ach inspections by the inspectors at the end of the inspection, based on a set of criteria. We may want to operationally use a five – point Likert scale to denote the degree of effectiveness (e.g., 5 = very effective, 4 = effective, 3 = somewhat effective,2 = not effective, 1 = poor inspection). There may also be other indicators.

In addition to design, design reviews, code implementation, and code inspections, development testing is part of our definition of the front end of the development process. We also need to operationally define “rigorous execution“of this test. Two indicators that could be used are the percent coverage in terms of instructions executed (as measured by some test coverage measurement tools) and the defect rate expressed in terms of number of defects removed per thousand lines of source code (KLOC) or per function point. Likewise, we need to operationally define “quality at the back end” and decide which measurement indicators to us. For the sake of simplicity let us use defects found per KLOC (or defects per function point) during formal machine testing as the indicator of back end quality. From these metrics, we can formulate several testable hypotheses such as the following:

• For software projects, the higher the percentage of the design and code that are inspected, the lower the defect rat at the later phase of formal machine testing.

• The more effective the design reviews and the code inspections as scored by the inspection team, the lower the defect rate at the later phase of formal machine testing.

• The more through the development testing (in terms of test coverage) before integration, the lower the defect rate at the formal machine testing phase.

With the hypotheses formulated, we can set out to gather data and test the hypotheses. We also need to determine the unit of analysis for our measurement and data. In this case, it could be at the project level or at the component level of a large project. If we are able to collect a number of data points that from a reasonable sample size (e.g., 35 projects or components), we can perform statistical analysis to test the hypotheses. We can classify projects or components into several groups according to the independent variable of each hypothesis, and then compare the outcome of the dependent variable (defect rate during formal machine testing) across the groups. We can conduct simple correlation analysis, or we can perform more sophisticated statistical analyses. If the hypotheses are substantiated by the data, we confirm the proposition. If they are rejected, we refute the proposition. If we have doubts or unanswered questions during the process (e.g., are our indicators valid? Are our data reliable? Are there other variables we need to control when we conduct the analysis for hypothesis testing?, then perhaps more research is needed. However, if the hypothesis (e s) or the proposition is confirmed, we can use the knowledge thus gained and act accordingly to improve our software development quality.

The example demonstrates the importance of measurement and data. Measurement and data really drive the progress of science and engineering. Without empirical verification by data and measurement, theories and propositions remain abstract. The example also illustrates that from theory to testable hypothesis, and likewise from concepts to measurement, there are several steps with levels of abstraction. Simply put, a theory consists of one or more prepositional statements that describe the relationships among concepts--- usually expressed in terms of cause and effect. From each proposition, one or more empirical hypotheses can be derived. The concepts are then formally defined and operationalized. The operationalization process produces metrics and indicators for which data can be collected. The hypotheses thus can be tested empirically. A hierarchy from theory to hypothesis and from concept to measurement indicators is illustrated in the below Figure 1.

The building blocks of theory are concepts and definitions. In a theoretical definition a concept is defined in terms of other concepts that are already well understood. In the deductive logic system, certain concepts would be taken as undefined: they are the primitives. All other concepts would be defined in terms of the primitive concepts.

Operational definitions, in contrast, are definitions that spell out the metrics and the procedures to be used to obtain data. An operational definition of “body weight” would indicate how the weight of a person is to be measured, the instrument to be used, and the measurement unit to record the results. An operational definition of “software product defect rate” would indicate the formula for defect rate, the defect to be measured (numerator), the denominator (e.g., lines of code count, function point), how to measure, and so forth.

We have seen that from theory to empirical hypothesis and from theoretically defined concepts to operational definitions, the process is by no means direct .As the example illustrates, when we operationalize a definition and derive measurement indicators, we must consider the scale of measurement. For instance, to measure the quality of software inspection we may use a five-point scale to score the inspection effectiveness or we may use percentage to indicate the inspection coverage. For some cases, more than one measurement scale is applicable; for others, the nature of the concept and the resultant operational definition can b measured only with a certain scale. In this section, we briefly discuss the four levels of measurement: normal scale, ordinal scale, interval scale, and ratio scale.

The most simple operation in science and the lowest level of measurement is classification. In classifying we attempt to sort elements into categories with respect to a certain attribute. For example, if the attribute of interest is religion, we may classify the subjects of the study into Catholics, Protestants, Jews, Buddhists, and so on. If we classify software products by the development process models through which the products were developed, then we may have categories such as waterfall development process and other. In a normal scale, the two key requirements for the categories are jointly exhaustive and mutually exclusive. Mutually exclusive mans a subject can be classified into one and only one category. Jointly exhaustive means that all categories together should cover all possible categories of the attribute. If the attribute has more categories than we are interested in, an “other” category is needed to make the categories jointly exhaustive.

In a normal scale, the names of the categories and their sequence bear no assumptions about relationships among categories. For instance, we place the waterfall development process in front of spiral development process, but we do not imply that one is “better than” or “greater than” the other. As long as the requirements of mutually exclusive and jointly exhaustive are met, we may want to compare the values of interested attributes such as defect rate, cycle time, and requirements defects across the different categories of software products.

Ordinal scale refers to the measurement operations through which the subjects can be compared in order. For example, we may classify families according to socioeconomic status: upper class, middle class, and lower class. We may classify software development projects according to the SEI maturity levels or according to a process rigor scale: totally adheres to process, somewhat adheres to process, does not adhere to process. Our earlier example of inspection effectiveness scoring is an ordinal scale.

The ordinal measurement scale is at a higher level than the nominal scale in the measurement hierarchy. Through it we are able not only to group subjects into categories, but also to order the categories. An ordinal scale is asymmetric in the sense that if A>B is true then B>A is false. It has the transitivity property in that if A>B and B>C, A>C.

We must recognize that an ordinal scale offers no information about the magnitude of the differences between elements. For instance, for the process rigor scale we know only that “totally adheres to process” is better than “somewhat adheres to process” in terms of the quality outcome of the software product, and “somewhat adheres to process” is better than “dos not adheres to process.” However, we cannot say that the difference between the former pair of categories is the same as that between the letter pair. In customer satisfaction surveys of software products, the five-point Likert scale is often used with 1=completely dissatisfied, 2= somewhat dissatisfied, 3= neutral, 4= satisfied, and 5= completely satisfied. We know only 5>4, 4>3, and 5>2, and so forth, but we cannot say how much greater 5 is than4. Nor can we say that the difference between categories 5and 4 is equal to that between categories 3 and 2, Indeed, to move customers from satisfied (4) to very satisfied (5) versus from dissatisfied (2) to neutral (3) , to neutral (3), may require very different actions and types of improvements.

Therefore, when we translate order relations into mathematical operations, we cannot use operations such as addition, subtraction, multiplication, and division. We can use “greater than” and “less than”. However, in real-world application for some specific types of ordinal scales (such as the Likert five-point, seven-point, or ten-point scales), the assumption of equal distance is often made and operations such as averaging are applied to these scales. In such cases, we should be aware that the measurement assumption is deviated, and then use extreme caution when interpreting the results of data analysis.

An interval scale indicates the extract differences between measurement points. The mathematical operations of addition and substraction can be applied to interval scale data .For instance, assuming products A, B, and C are developed in the same language, if the defect rate of software product A is 5 defects per KLOC and product B’s rate is 3.5 defects per KLOC, then we can say product A’s defect level is 1.5 defects per KLOC higher than product B’s defect level. An interval scale of measurement requires a well-defined unit of measurement that can be agreed on as a common standard and that is repeatable. Given a unit of measurement, it is possible to say that the difference between two scores is 15 units or that one difference is the same as a second. Assuming product C’s defect rate is 2 defects per KLOC, we can thus say the difference in defect rate between products A and B is the same as that between B and C.

When an absolute or no arbitrary zero point can be located on an interval scale, it becomes a ratio scale. Ratio scale is the highest level of measurement and all mathematical operations can be applied to it, including division and multiplication. For example, we can say that product A’s defect rate is twice as much as product C’s because when the defect rate is zero, that means not a single defects exists in the product. Had the zero point been arbitrary, the statement would have been illegitimate. A good example of an interval scale with an arbitrary zero point is the traditional temperature measurement (Fahrenheit and centigrade scales).Thus we say that the difference between the average summer temperature (80 degree F) and the average winter temperature (16 degree F is 64 degree F, but we do not say that 80 degree F is five times as hot as 16 degree F Fahrenheit and centigrade temperature scales are interval, not ratio, scales. For this reason, scientists developed the absolute temperature scale (a ratio scale) for use in scientific activities.

Except for a few notable examples, for all practical purposes almost all interval measurement scales are also ratio scales. When the size of the unit is established, it is usually possible to conceive of a zero unit.

For internal and ratio scales, the measurement can be expressed in both integer and no integer data. Integer data are usually given in terms of frequency counts (e.g., the number of defects customers will encounter for a software product over a specified time length).

We should note that the measurement scales are hierarchical. Each higher level scale possesses all properties of the lower ones. The higher the level of measurement, the more powerful analysis can be applied to the data. Therefore, in our operationalization process we should devise metrics that can take advantage of the highest level of measurement allowed by the nature of the concept and its definition. A higher-level measurement can always make various types of comparisons if the scale is in terms of actual defect rate. However, if the scale is in terms of excellent, good, average, worse than average, and poor, as compared to an industrial standard, then our ability to perform additional analysis of the data is limited.

Regardless of the measurement scale, when the data are gathered we need to analyze them to extract meaningful information. Various measures and statistics are available for summarizing the raw data and for making comparisons across groups. In this section we discuss some basic measures such as ratio, proportion, percentage, and rate, which are frequently used in our daily lives as well as in various activities associated with software development and software quality. These basic measures, while seemingly easy, are often misused. There are also numerous sophisticated statistical techniques and methodologies that can be employed in data analysis. However, such topics are not within the scope of this discussion.

A ratio results from dividing one quality by another. The numerator and denominator are from two distinct populations and are mutually exclusive. For example, in demography, sex ratio is defined as:

Number of males

---------------- X 100 %

Number of females

If the ratio is less than 100, there are more females than males: otherwise there are more males than females.

Ratios are also used in software metrics. The most often used, perhaps, is the ratio of number of people in an independent test organization to the number of those in the development group. The test/ development head-count ratio could range from 1:1 to 1:10 depending on the management approach to the software development process. For the large-ratio (e.g., 1:10) organizations, the development group usually is responsible for the complete development (including extensive development tests) of the product, and the test group conducts system-level testing in terms of customer environment verifications. For the small-ratio organizations, the independent group takes the major responsibility for testing (after debugging and code integration) and quality assurance.

Proportion is different from ratio in that the numerator in a proportion is a part of the denominator:

P = a/a +b

Proportion also differs from ratio in that the ratio is best used for two groups, whereas proportion is used for multiple categories (or populations) of one group. In other words, the denominator in the preceding formula can be more than just a+b. If

a + b + c + d + e + N

Then we have

a/N + b/N + c/N + d/N + e/N = 1

When the numerator and the denominator are integers and represent counts of certain events, then p is also referred to as a relative frequency. For example, the following gives the proportion of satisfied customers of the total customer set:

-----------------------------

The numerator and the denominator in a proportion need not be integers. Thy can be frequency counts as well as measurement units on a continuous scale ( e.g., height in inches, weight in pounds).When the measurement unit is not integer, proportions are called fractions.

A proportion or a fraction becomes a percentage when it is expressed in terms of per hundred units (the denominator is normalized to 100). The word percent mans per hundred. A proportion p is therefore equal to 100%; percent (100/7%).

Percentages are frequently used to report results, and as such are frequently misused. First, because percentages represent relative frequencies, it is important that enough contextual information be given, specially the total number of cases, so that the readers can interpret the information correctly. Jones (1992) observes that many reports and presentations in the software industry are careless in using percentages and ratios. He cites the example:

Requirements bugs were 15% of the total, design bugs were 25% of the total, coding bugs were 50% of the total, and other bugs made up 10% of the total.

Had the results been stated as follows, it would have been much more informative:

The project consists of 8 thousand lines of code (KLOC). During its development a total of 200 defects were detected and removed, giving a defect removal rate of 25 defects per KLOC. Of the 200 defects, requirements bugs constituted 15%, design bugs 25%, coding bugs 50%, and other bugs made up 10%.

A second important rule of thumb is that the total number of cases must be sufficiently large enough to use percentages. Percentages computed from a small total are not stable: they also convey an impression that a large number of cases are involved. Some writers recommend that the minimum number of cases for which percentages should b calculated is 50. We recommend that, depending on the number of categories, the minimum number be 30, the smallest sample size required for parametric statistics. If the number of cases is too small, then absolute numbers, instead of percentages, should be used. For instance,

Of the total 20 defects for the entire project of 2 KLOC, there were 3 requirements bugs, 5 design bugs, 10 coding bugs, and 2 others.

When results in percentages appear in table format, usually both the percentages and actual numbers are shown when there is only one variable. When there are more than two groups, such as the example in Table 1, it is better just to show the percentages and the total number of cases (N) for each group. With percentages and N known, one can always reconstruct the frequency distributions. The total of 100.0% should always be shown so that it is clear how the percentages are computed. In a two-way table, the direction in which the percentages are computed depends on the purpose of the comparison. For instance, the percentages in Table 1 are computed vertically (the total of each column is 100.0% ), and the purpose is to compare the defect type- profile across projects (e.g., project B proportionally has more requirements defects than project A).

In Table 2, the percentages are computed horizontally. The purpose here is to compare the distribution of defects across projects for each type of defect. The interpretations of the two tables differ. Therefore, it is important to carefully examine percentage tables to determine exactly how the percentages are calculated.

Type of Defect |
Project A(%) |
Project B (%) |
Project C (%) |
|||

Requirements | 15.0 | 41.0 | 20.3 | |||

Design | 25.0 | 21.8 | 22.7 | |||

Code | 50.0 | 28.6 | 36.7 | |||

Others | 10.0 | 8.6 | 20.3 | |||

Total | 100.0 | 100.0 | 100.0 | |||

(N) | (200) | (105) | (125) |

Project |
||||||||||

Type of Defect |
A |
B |
C |
Total |
(N) |
|||||

Requirements (%) |
30.3 |
43.4 |
26.3 |
100.0 | (99) |
|||||

Design (%) |
49.0 | 22.5 | 28.5 |
100.0 | (102) |
|||||

Code (%) |
56.5 |
16.9 | 26.6 |
100.0 | (177) |
|||||

Others (%) |
36.4 | 16.4 | 47.2 | 100.0 | (55) |

Rations, proportions, and percentages are static summary measures. They provide a cross- sectional view of the phenomena of interest at a specific time. The concept of rate is associated with the dynamics (change) of the phenomena of interest: generally it can be defined as a measure of change in one quantity (Y) per unit of another quantity (X) on which the former (y) depends. Usually the x variable is time. It is important that the time unit always be specified when describing a rate associated with time. For instance, in demography the crude birth rate (CBR) is defined as:

Crude birth rate = B/P*K

Where B is the number of live births in a given calendar year, P is the mid- year population, and K is a constant, usually 1,000.

The concept of exposure to risk is also central to the definition of rate, which distinguishes rate from proportion. Simply stated, all elements or subjects in the denominator have to be at risk of becoming or producing the elements or subjects in the numerator. If we take a second look at the crude birth rate formula, we will note that the denominator is mid-year population is subject to the risk of giving birth. Therefore, the operational definition of CBR is not in compliance with the concept of population at risk, and for this reason, it is a “crude” rate. A better measurement is the general fertility rate, in which the denominator is the number of women of childbearing age, usually defined as ages 15 to 44.In addition, there are other more refined measurements for birth rate.

In literature about quality, the risk exposure concept is defined as opportunities for error (OFE). The numerator is the number of defects of interest. Therefore,

Defect rate = Number of defects/OFE * K

In software, defect rate is usually defined as the number of defects per thousand source lines of code ( KLOC or KSLOC) in a given time unit ( e.g., one year after the general availability of the product in the marketplace, or for the entire life of the product).Note that this metric, defects per KLOC< is also a crude measure. First, thee opportunity for error is not known. Second, while any line of source code may be subject to error, a defect may involve many source lines. Therefore, the metric is only a proxy measure of defect rate, even assuming no other problems. Such limitations should be taken into account when analyzing results or interpreting data pertaining to software quality.