ISSN: 2378-315X BBIJ

Biometrics & Biostatistics International Journal
Research Article
Volume 2 Issue 1 - 2014
Diagnostics for Hedonic Models Using an Example for Cars (Hedonic Regression)
Ilker Ercan1* and Hakan Demirtas2
Central Statistics Office/ University College Cork, Ireland
Received: November 25, 2014 | Published: February 26, 2015
*Corresponding author: Kevin McCormack, Central Statistics Office, University College Cork, Central Statistics Office Skehard Road Cork, Ireland, Tel: 00353876780326; Email:
Citation: McCormack K (2015) Diagnostics for Hedonic Models Using an Example for Cars (Hedonic Regression). Biom Biostat Int J 2(1): 00022. DOI: 10.15406/bbij.2014.2.00022

Abstract

This paper provides a detailed account of the steps involved in the development and diagnostics of an OLS regression model (hedonic) using the Irish Central Statistics Offices’ Consumer Price Index for New Cars as an example. The areas of Collinearity: effects on parameter estimates, effects on inference, effects on prediction, what to do about collinearity and Model Diagnostics: residuals’ “standardized residuals, residual plots, outliers, studentized residuals (t-residuals), influential observations, leverage, cooks’ distance and transformations are discussed in detail with examples.

Abbreviations

CPI: Consumer Price Indices; HICP: Harmonized Index of Consumer Price Index; ECB: European Central Bank’s; OLS: Ordinary Least Squares

Introduction

Hedonic regression
Hedonic regression is a method used to determine the value of a good or service by breaking it down into its component parts. The value of each component is then determined separately through regression analysis. See Lancaster K [1], Griliches Z [2,3] and Diewert [4-6] for detail discussions on the development and application of hedonic prices. In addition, Moulton [7] provides a information on the importance and expanded use of Hedonic Methods.

The term “hedonic methods” refers to the use in economic measurement of a “hedonic function,” h ( ),
p i =h( c i ) MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaqcLbuacaWGWb WcdaWgaaqaaiaadMgaaeqaaKqzafGaaGzbVlabg2da9iaayIW7caaM f8UaamiAaKqbaoaabmaajugqbeaacaWGJbqcfa4aaSbaaSqaaiaadM gaaKqzafqabaaakiaawIcacaGLPaaaaaa@4612@
Where p is the price of a variety (or model) iof a good and ci is a vector of characteristics associated with the variety. One of the more widely-recognised examples of hedonic regression is the Consumer Price Index, which examines changes to the value of a basket of goods over time and the hedonic function is used to adjust for differences in characteristics between varieties of the good in calculating its price index. The hedonic function is normally estimated by regression analysis.

Consumer price indices
All the goods and services that consumers purchase have a price and that price may vary over time, and these changes are a measure of inflation. Consumer Price Indices (CPI) are designed to measure such changes. A useful way to understand the nature of these indices is to imagine a very large shopping basket comprising of a set, or basket of fixed composition, quantity and as far as possible quality of goods and services bought by a typical private household.

The European CPI is titled the Harmonized Index of Consumer Price Index (HICP - Council Regulation (EC) No 2494/95) [8]. The HICP is an important contributor the European Central Bank’s (ECB) monetary policy and is compiled monthly using a Laspeyres index formula, see Allen RGD [9]. This HICP it expresses the current cost of a fixed market basket of consumer goods and services as a percentage of the cost of the same identical basket at a base (normally mid- December of the year previous to the reference date).

The HICP is defined as

HIC P t = j w j,b ( p j,t p j,b ) MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaqcLbwacaWGib GaamysaiaadoeacaWGqbWcdaWgaaqaaiaadshaaeqaaOGaaGzbVlab g2da9iaaywW7daaeqbqaaKqzGfGaam4DaOWaaSbaaSqaaiaadQgaca GGSaGaamOyaaGcbeaadaqadaqaamaalaaabaqcLbwacaWGWbGcdaWg aaWcbaGaamOAaiaacYcacaWG0baakeqaaaqaaKqzGfGaamiCaOWaaS baaSqaaiaadQgacaGGSaGaamOyaaGcbeaaaaaacaGLOaGaayzkaaaa leaacaWGQbaakeqajugybiabggHiLdaaaa@527C@

Where w j,b MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaqcLbuacaWG3b qcfa4aaSbaaSqaaiaadQgacaGGSaGaamOyaaqabaaaaa@3AE1@ is the weight assigned to item j determined by the base period consumer expenditure shares and p j,t MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaqcLbwacaWGWb WcdaWgaaqaaiaadQgacaGGSaGaamiDaaqabaaaaa@3A7E@ refers to the price of item j in period t.

Whenever a comparison between current and base period products is not possible, or when the current period basket reflects new market developments (i.e. quality improvements). This then leads to the well-known problem of how to measure price developments when the quality of the underlying goods and services is changing over time.

It is important the appropriate methods are employed to take account of quality change in the HICP as this index need to be a credible and transparent. In this context, this paper provides a detailed account of the steps involved in the development of an OLS regression model (hedonic) and associated diagnostics using the Irish Central Statistics Office’s Consumer Price Index for New Cars as an example.

This paper is designed to be a reference document for those challenged with using regression analysis to determine the value of each characteristic of a goods or service. It is divided into six sections, each dealing with a particular aspect of model development. Section one deals with summarising relationships, section two discusses fitting curves – regression analysis, while section three presents the hedonic regression model. In section four, a test for Collinearity within the model is undertaken and a strategy for dealing with Collinearity is presented. In section five, the model diagnostics: residuals – standardized residuals, residual plots, outliers, studentized residuals (t-residuals), influential observations, leverage, cook’s distance, and transformations are undertaken and discussed in detail.

Finally, section six provides an illustrative example of how hedonic based quality adjustment can be applied in a situation when the price an individual car model was available in a January of a particular year, but was not available in the February of the same year. It is shown that without the application of the quality adjustment method the New Car Price Index would provide an incorrect measurement of the price changes of new cars over the period, which in turn would miss-inform the ECB’s monetary policy.

Summarising Relationships

 Introduction

Statistical analysis is used to document relationships among variables. Relationships that yield dependable predications can be exploited commercially or used to eliminate waste from processes. A marketing study done to learn the impact of price changes on coffee purchases is a commercial use. A study to document the relationship between moisture content of raw material and yield of usable final product in a manufacturing plant can result from finding acceptable limits on moisture content and working with suppliers to provide raw material reliably within these litmus. Such efforts can improve the efficiency of manufacturing process.

We strive to formulate statistical problems in terms of comparisons. For example, the marketing study in the preceding paragraph was conducted by measuring coffee purchases when prices were set a several different levels over a period of time. Similarly, the raw material study was conducted by comparing the yields from batches of raw materials that exhibited moisture content.

Scatter plots

Scatter plots display statistical relationship relationships between two metric variables (e.g. price and cc) in this section the details of scatter plotting are presented using the data in (Table 1). The data were collected for used in compiling the New Car Index in the Irish CPI Scatter plots are used to try to discover a tendency for plotted variables to be related in a simple way. Thus the more the scatter plot reminds us of a mathematical curve, the more closely related we infer the variables are. In the scatter plot a direct relationship between the two variables is inferred.

The above graph shows a relatively a linear relationship between the two metric variables (price and cc). However to investigate further the relationship between these two variables we can apply the universal method of logarithmic transformation to the cc variable. This transformation discounts larger values of cc and leaves smaller and intermediate ones intact and has the effect ofincreasing the linearity of relationship (Figure 1).

Correlation coefficient

The descriptive statistic most widely used to summarise a relationship between metric variables is a measure of the degree of linearity in the relationship. It is called product-moment correlation coefficient denoted by the symbol r and it is defined by

r= 1 n1 i=1 n [ ( x i x ¯ s x )( y i y ¯ s y ) ] MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaqcLbuacaWGYb GaaGzbVlabg2da9iaaywW7juaGdaWcaaGcbaqcLbuacaaIXaaakeaa jugqbiaad6gacqGHsislcaaIXaaaaKqbaoaaqahakeaajuaGdaWada Gcbaqcfa4aaeWaaOqaaKqbaoaalaaakeaajugqbiaadIhajuaGdaWg aaGcbaqcLbuacaWGPbaakeqaaKqzafGaeyOeI0scfa4aa0aaaOqaaK qzafGaamiEaaaaaOqaaKqzafGaam4CaKqbaoaaBaaakeaajugqbiaa dIhaaOqabaaaaaGaayjkaiaawMcaaKqbaoaabmaakeaajuaGdaWcaa GcbaqcLbuacaWG5bqcfa4aaSbaaOqaaKqzafGaamyAaaGcbeaajugq biabgkHiTKqbaoaanaaakeaajugqbiaadMhaaaaakeaajugqbiaado hajuaGdaWgaaGcbaqcLbuacaWG5baakeqaaaaaaiaawIcacaGLPaaa aiaawUfacaGLDbaaaeaajugqbiaadMgacqGH9aqpcaaIXaaakeaaju gqbiaad6gaaiabggHiLdaaaa@6949@

Where x  ¯ MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaWaa0aaaeaaju gybiaadIhacaqGGaaaaaaa@3876@  and s x MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaam4CamaaBa aaleaacaWG4baabeaaaaa@3817@ are the mean and standard deviation of the x variable and  and are the mean and standard deviation of the y variable.

The product moment correlation coefficient has many properties, the most important of which are
  1. Its numerical value lies between – 1 and + 1 , inclusive
  2. If r = 1, then the scatter plot shows that the data lie exactly on a straight line with a positive slope; if r = -1, then the scatter plot shows that the data lie on a straight line with a negative slope.
  3. An r = 0 indicates that there is no linear component in the relationship between the two variables.

These properties emphasise the role of r as a measure of linearity. Essentially, the more the scatter plot looks like a positively sloping straight line, the closer r is to +1, and the more the scatter plot looks like a negatively sloping straight line, the closer r is to -1.

Using the equation above, r is estimated for the relationship shown in (Figure 2) to be 0.92 indicating a strong linear relationship between the price of a new car and cylinder capacity. For the relationship shown in (Figure 1), r is estimated to be 0.93, indicating that using the logarithmictransformation does indeed increase the linearity of relationship between the two metric variables. The LINEST function in EXCEL was used to estimate r in both of the above relationships.

Fitting Curves- Regression Analysis

In the sections above we showed how to summarise the relationships between metric variables using correlations. Although correlations are valuable tools, they are not powerful enough to handle many complex problems in practice (Table 2). Correlations have two major limitations:

  1. They summarise only linearity in relationships.
  2. They do not yield models for how one variable influences another. The tool of regression analysis overcomes these limitations by using mathematical curves to summarise relationships among several variables. A regression model consists of the mathematical curve summarising the relationship together with measures of variation from that curve. Because any type of curve can be used, relationships can be nonlinear. Regression analysis also easily accommodates transformations of variables and categorical variables, and it provides a host of diagnostic statistics that help assess the utility of variables and transformations and the impact of such features as outliers and missing data.

 

Price
£ (p)

CC
(cc)

No. of doors
(d)

Horse
Power
(ps)

Weight
Kg (w)

Length
cm
(l)

Power steering (pst)

ABS
(abs)

Air bags
(ab)

Toyota Corolla 1.3L Xli Saloon

13,390

1300

4

78

1200

400

1

0

0

Toyota Carina 1.6L SLi Saloon

15,990

1600

4

100

1400

450

1

1

1

Toyota Starlet 1.3L

10,780

1300

3

78

1000

370

0

0

0

Ford Fiesta Classic 1.3L

9,810

1100

3

60

1000

370

0

0

1

Ford Mondeo LX  1.6I

15,770

1600

4

90

1400

450

1

1

1

Ford Escort CL 1.3I

12,095

1300

5

75

1200

400

1

0

0

Mondeo CLX 1.6i

16,255

1600

5

90

1400

450

1

1

1

Opel Astra GL X1.4NZ

12,935

1400

5

60

1200

400

1

0

0

Opel Corsa City X1.2SZ

9,885

1200

3

45

1000

370

1

0

1

Opel Vectra GL X1.6XEL

16,130

1600

4

100

1400

450

1

1

1

Nissan Micra 1.0L

9,780

1000

3

54

1000

370

0

0

0

Nissan Almera 1.4GX 5sp

13,445

1400

5

87

1200

400

1

0

0

Nissan Primera SLX

16,400

1600

4

100

1400

450

1

1

1

Fiat Punto 55 SX

8,790

1100

3

60

1000

370

0

0

0

VW Golf CL 1.4

12,995

1400

5

60

1200

400

1

0

0

VW Vento CL 1.9D

15,100

1900

4

64

1400

450

1

1

1

Mazda 323 LX 1.3

12,700

1300

3

75

1200

400

1

0

0

Mazda 626 GLX 2.0I S/R

17,970

2000

5

115

1400

450

1

1

1

Mitsubushi Lancer 1.3 GLX

13,150

1300

4

74

1200

400

1

0

1

Mitsubushi Gallant 1.8 GLSi

16,600

1800

5

115

1400

450

1

1

1

Peugeot 106 XN 1.1 5sp

9,795

1100

5

45

1000

370

0

0

0

Peugeot 306 XN 1.4  DAB

12,295

1400

4

75

1200

400

1

0

0

406 SL 1.8 DAB S/R

16,495

1800

4

112

1400

450

1

1

1

Rover 214 Si

12,895

1400

3

103

1200

400

1

0

1

Renault Clio 1.2 RN

10,990

1200

5

60

1000

370

1

0

1

Renault Laguna

15,990

1800

5

95

1400

450

1

1

1

Volvo 440 1.6 Intro Version

14,575

1600

5

100

1400

450

1

0

1

Honda Civic 1.4I SRS

14,485

1400

4

90

1200

400

1

0

0

Table 1: Characteristics of cars used in the Irish New Car Price Index.

Figure 1: Scatter plot of price (£) versus log (cc).

Figure 2: Scatter plot of price (£) versus cylinder capacity (cc).

  1.  
  1. CC
  2.         (cc)
  3.  
  1. Price
  2.     £ (p)
  3.  
  1. Fitted Price (fp)
  1. Residual
  2.     (res=p-fp)
  1. 1
  1. 1300
  1. 13,390
  1. 12,150
  1. 1,240
  1. 2
  1. 1600
  1. 15,990
  1. 14,883
  1. 1,107
  1. 3
  1. 1300
  1. 10,780
  1. 12,150
  1. - 1,370
  1. 4
  1. 1100
  1. 9,810
  1. 10,328
  1. - 518
  1. 5
  1. 1600
  1. 15,770
  1. 14,883
  1. 887
  1. 6
  1. 1300
  1. 12,095
  1. 12,150
  1. -55
  1. 7
  1. 1600
  1. 16,255
  1. 14,883
  1. 1,372
  1. 8
  1. 1400
  1. 12,935
  1. 13,061
  1. - 126
  1. 9
  1. 1200
  1. 9,885
  1. 11,239
  1. - 1,354
  1. 10
  1. 1600
  1. 16,130
  1. 14,883
  1. 1,247
  1. 11
  1. 1000
  1. 9,780
  1. 9,417
  1. 363
  1. 12
  1. 1400
  1. 13,445
  1. 13,061
  1. 384
  1. 13
  1. 1600
  1. 16,400
  1. 14,883
  1. 1,517
  1. 14
  1. 1100
  1. 8,790
  1. 10,328
  1. - 1,538
  1. 15
  1. 1400
  1. 12,995
  1. 13,061
  1. - 66
  1. 16
  1. 1900
  1. 15,100
  1. 17,616
  1. - 2,516
  1. 17
  1. 1300
  1. 12,700
  1. 12,150
  1. 550
  1. 18
  1. 2000
  1. 17,970
  1. 18,527
  1. - 557
  1. 19
  1. 1300
  1. 13,150
  1. 12,150
  1. 1,000
  1. 20
  1. 1800
  1. 16,600
  1. 16,705
  1. - 45
  1. 21
  1. 1100
  1. 9,795
  1. 10,328
  1. - 533
  1. 22
  1. 1400
  1. 12,295
  1. 13,061
  1. - 766
  1. 23
  1. 1800
  1. 16,495
  1. 16,705
  1. - 210
  1. 24
  1. 1400
  1. 12,895
  1. 13,061
  1. - 166
  1. 25
  1. 1200
  1. 10,990
  1. 11,239
  1. - 249
  1. 26
  1. 1800
  1. 15,990
  1. 16,705
  1. - 715
  1. 27
  1. 1600
  1. 14,575
  1. 14,883
  1. - 308
  1. 28
  1. 1400
  1. 14,485
  1. 13,061
  1. 1,424

Table 2: Fitted values and Residuals for new car data.

Models

A model describes how a process works. For scientific purposes, the most useful models are statements of the form “if certain conditions apply, then certain consequences follow”. The simplest of such statements assert that the list of conditions result in a single consequence without fails. For example, we learn in physics that if an object falls toward earth, then it accelerates at about 981 centimetres per second.

A less simple statement is one that assesses a tendency: “Loss in competition tends to arouse anger.” While admitting the existence of exceptions, this statement is intended to be universal, that is anger is the expected to loss in competition.

To be useful in documenting the behaviour of processes, models must allow for a range of consequences or outcomes. They must also be able to describe a range of conditions, fixed levels of predictor variables (x), for it is impossible to hold conditions constant in practice. When a model describes the range of consequences corresponding to a fixed set of conditions, it describes local behaviour. A summary of the local behaviours for a range of conditions is called global behaviour. Models are most useful if they describe global behaviour over a range of conditions encountered in practice. When they do, they allow us to make predictions about the consequences corresponding to conditions that have not actually been observed. In such cases, the models help us reason about processes despite being unable to observe them in complete detail.

Linear regression

To illustrate this topic refer back to the sample of cars in (Table 1) and (Figure 2) (the outcome of the scatter plot of price vs. cc) above. Our eyes detect a marked linear rend in the plot. Before reading further, use a straight-edge to draw a line through the points that appear to you to be the best description of the trend. Roughly estimate the co-ordinates of two points (not necessarily points corresponding to data points) that lie on the line. From these two estimated points, estimate the slope and y intercept of the line as follows:

Let ( x 1 , y 1 ) MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaqcfa4aaeWaaO qaaKqzafGaamiEaKqbaoaaBaaaleaajugabiaaigdaaSqabaqcLbqa caGGSaqcLbuacaWG5bqcfa4aaSbaaSqaaKqzaeGaaGymaaWcbeaaaO GaayjkaiaawMcaaaaa@4077@ and ( x 2 , y 2 ) MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaqcfa4aaeWaaO qaaKqzafGaamiEaKqbaoaaBaaaleaajugabiaaikdaaSqabaqcLbua caGGSaGaamyEaKqbaoaaBaaaleaajugabiaaikdaaSqabaaakiaawI cacaGLPaaaaaa@400A@ denote two points, with ( x 1 y 1 ) MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaqcfa4aaeWaaO qaaKqzafGaamiEaKqbaoaaBaaaleaajugabiaaigdaaSqabaqcLbua cqGHGjsUcaWG5bqcfa4aaSbaaSqaaKqzaeGaaGymaaWcbeaaaOGaay jkaiaawMcaaaaa@411F@ on a line whose equation is y = a + bx. Then the slope of the line is b= Difference in y coordinates Difference in x coordinates = y 1 y 2 x 1 x 2 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaqcLbuacaWGIb GccaaMf8Uaeyypa0JaaGzbVNqbaoaalaaakeaajugqbiaabseacaqG PbGaaeOzaiaabAgacaqGLbGaaeOCaiaabwgacaqGUbGaae4yaiaabw gacaqGGaGaaeyAaiaab6gacaqGGaGaamyEaiaabccacaqGJbGaae4B aiaab+gacaqGYbGaaeizaiaabMgacaqGUbGaaeyyaiaabshacaqGLb Gaae4CaaGcbaqcLbuacaqGebGaaeyAaiaabAgacaqGMbGaaeyzaiaa bkhacaqGLbGaaeOBaiaabogacaqGLbGaaeiiaiaabMgacaqGUbGaae iiaiaadIhacaqGGaGaae4yaiaab+gacaqGVbGaaeOCaiaabsgacaqG PbGaaeOBaiaabggacaqG0bGaaeyzaiaabohaaaGaaGzbVlabg2da9i aaywW7juaGdaWcaaqcLbuabaGaamyEaKqbaoaaBaaajugqbeaajuga biaaigdaaKqzafqabaqcLbqacqGHsislcaWG5bqcfa4aaSbaaKqzae qaaiaaikdaaeqaaaqcLbuabaGaamiEaSWaaSbaaeaacaaIXaaabeaa jugqbiabgkHiTiaadIhajuaGdaWgaaWcbaGaaGOmaaqcLbuabeaaaa aaaa@8302@ And the y intercept is a= x 1 y 2 x 2 y 1 x 1 x 2 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaqcLbuacaWGHb GccaaMf8Uaeyypa0tcLbuacaaMf8Ecfa4aaSaaaKqzafqaaiaadIha lmaaBaaabaGaaGymaaqabaqcLbuacaWG5bqcfa4aaSbaaKqzafqaaK qzaeGaaGOmaaqcLbuabeaajugabiabgkHiTiaadIhajuaGdaWgaaqc LbqabaGaaGOmaaqabaGaamyEaKqbaoaaBaaajugabeaacaaIXaaabe aaaKqzafqaaiaadIhalmaaBaaabaGaaGymaaqabaqcLbuacqGHsisl caWG4bqcfa4aaSbaaSqaaiaaikdaaKqzafqabaaaaaaa@52D7@ Next, describe the manner in which the data points deviate from your estimated line. Finally, suppose you are told that car has cylinder capacity of 1600cc, and you are asked to use your model to predict the price of the car. Give your best guess at the range of plausible market values. If you do all these things, you will have performed the essential operations of a linear regression analysis of the data.

If you followed the suggestions in the previous paragraph, you were probably pleased to find that regression analysis is really quite simple. On the other hand, you may not be pleased with the prospect of analysing many large data sets “by eye” or trying to determine a complex model that relates price cylinder capacity, horse power, weight and length simultaneously. To do any but the most rudimentary, the help of a computer is needed.

Statistical software does regression calculations quickly, reliably, and efficiently. In practice one never has to do more than enter data, manipulate data, issue command that ask for calculations and graphs, and interpret output. Consequently, computational formulas are not presented here.

The most widely available routines for regression computations use least squares methods. In this section the ideals behind ordinary least squares (OLS) are explained. Ordinary least squares fit a curve to data pairs ( x 1 , y 1 ),( x 2 , y 2 ),,( x n , y n ) MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaqcfa4aaeWaaO qaaKqzafGaamiEaKqbaoaaBaaaleaajugabiaaigdaaSqabaqcLbua caGGSaGaamyEaKqbaoaaBaaaleaajugabiaaigdaaSqabaaakiaawI cacaGLPaaajugqbiaacYcajuaGdaqadaGcbaqcLbuacaWG4bqcfa4a aSbaaSqaaKqzaeGaaGOmaaWcbeaajugqbiaacYcacaWG5bqcfa4aaS baaSqaaKqzaeGaaGOmaaWcbeaaaOGaayjkaiaawMcaaKqzafGaaiil aiabl+UimjaacYcajuaGdaqadaGcbaqcLbuacaWG4bqcfa4aaSbaaS qaaKqzaeGaamOBaaWcbeaajugqbiaacYcacaWG5bqcfa4aaSbaaSqa aKqzaeGaamOBaaWcbeaaaOGaayjkaiaawMcaaaaa@59FA@  by minimising the sum of the squared vertical distances between the y values and the curve. Ordinary least squares are fundamental building block of most other fitting methods.

Fitting a line by ordinary least squares

When a computer program (in this case the LINEST function in EXCEL) is asked to fit a straight-line model to the data in (Figure 2) using the method of ordinary least squares, the following equation is obtained

y ^ =307+9.11x MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaWaaCbiaeaaca WG5baaleqabaGaaiOxaaaakiaaywW7cqGH9aqpcaaMf8UaaG4maiaa icdacaaI3aGaey4kaSIaaGyoaiaac6cacaaIXaGaaGymaiaadIhaaa a@434D@

The symbol y stands for a value of Price (response variable), and the symbol ^ over the y indicates that the model gives only an estimated value. The symbol x (predictor variable) stands for a value of cylinder capacity. This result can be put into the representation

Observation = Fit and Residual

Where y is the observation, y ^ MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaWaaCbiaeaaca WG5baaleqabaGaaiOxaaaaaaa@381F@ is the fit (ted) value and y y ^ MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamyEaiabgk HiTmaaxacabaGaamyEaaWcbeqaaiaac6faaaaaaa@3A0A@ is the residual. Consider car No. 1 in (Table 1) that has a cylinder capacity x = 1300. The corresponding observed price is y = 13,390. The fitted value given by the ordinary least square line is   y ^ MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaWaaCbiaeaaca WG5baaleqabaGaaiOxaaaaaaa@381F@ = 307 + 9.11(1300)

= 307 + 11,843

= 12,150

The vertical distance between the actual price and the fitted price is = 13,390 – 12,150 = 1,240, which is the residual. The positive sign indicated the actual price is above the fitted line. If the sign was negative it means the actual price is below the fitted line.

 Figure 3 below shows a scatter plot of the data with ordinary least squares line fitted through the points. This plot confirms that the computer can be trained to do the job of fitting a line. The OLS line was fitted using the linear trend line option in WORD for a chart. Another output from the statistical software is a measure of variation: s = 1028. This measure of variation is the standard deviation of the vertical differences between the data points and the fitted line, that is, the standard deviation of the residuals (Figure 4).

An interesting characteristic of the method of least squares is: for any data set, the residuals from fitting a straight line by the method of OLS sum to zero (assuming the model includes a y – intercept term). Also because the mean of the OLS residuals is zero, their standard deviation is the square root of the sum of their squares divided by the degrees of freedom. When fitting a straight line by OLS, the number of degrees of freedom is two less than the number of cases, denoted by n-2 because

The residuals sum to zero

The sum of the products of the fitted values and residuals, case by case is zero.

Figure 3: Scatter plot of price (£) versus cylinder capacity and OLS line.
Figure 4: Scatter plot of Residual versus Fitted from new car data.

Analysis of residuals

Two fundamental tests are applied to residuals from a regression analysis: a test for normality and a scatter plot of residuals versus fitted values. The first test can be performed by checking the percentage of residuals within in one, two and three standard deviations of their mean, which is zero. The second test gives visual cues of model inadequacy.

What do we look for in a plot of residuals versus fitted values? We look for a plot that suggests random scatter. As we noted above, the residuals satisfy the constraint  

( y y ^ ) y ^ =0 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaWaaabqaeaada qadaqaaiaadMhacqGHsisldaWfGaqaaiaadMhaaSqabeaacaGGEbaa aaGccaGLOaGaayzkaaWaaCbiaeaacaWG5baaleqabaGaaiOxaaaaae qabeqdcqGHris5aOGaaGzbVlabg2da9iaaywW7caaIWaaaaa@44B5@

where the summation is done over all cases in the data set. The constraint, in turn, implies that the product-moment correlation coefficient between the residuals and the fitted values is zero. If the scatter plot is somehow not consistent with this fact because it exhibits a trend or other peculiar behaviour, then we have evidence that the model has not adequately captured the relationship between x and y. This is the primary purpose of residual analysis; to seek evidence of inadequacy.

The scatter plot in (Figure 3) suggests random scatter and the regression equation above is, therefore, consistent with the above constraint.

Hedonic Regression Model

The inclusion of additional metric variables

So far the variable used to account for the variation in the price of a new car is a measure of a physical characteristic which is more or less permanent, though cylinder capacity can change with improvements or deterioration’s. This variable does not link up directly with economic factors in the market place, however. Regardless of the cylinder capacity of the car the price of a new is also related to the horse power, weight, length and number of doors.

Defining the characteristics of a new car as follows 

x 1 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamiEaKqbao aaBaaaleaajugabiaaigdaaSqabaaaaa@38E2@ x 2 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamiEaKqbao aaBaaaleaajugabiaaikdaaSqabaaaaa@38E3@  = cylinder capacity (cc)

  x 2 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamiEaKqbao aaBaaaleaajugabiaaikdaaSqabaaaaa@38E3@ = number of doors (d)

x 3 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamiEaKqbao aaBaaaleaajugabiaaiodaaSqabaaaaa@38E4@  = horse power (ps)

We propose to fit a model of the form  

             y ^ = b 0 + b 1 x 1 + b 2 x 2 + b 3 x 3 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaWaaCbiaeaaca WG5baaleqabaGaaiOxaaaakiaaywW7cqGH9aqpcaaMf8UaamOyamaa BaaaleaacaaIWaaabeaakiabgUcaRiaadkgadaWgaaWcbaGaaGymaa qabaGccaWG4bWaaSbaaSqaaiaaigdaaeqaaOGaey4kaSIaamOyamaa BaaaleaacaaIYaaabeaakiaadIhadaWgaaWcbaGaaGOmaaqabaGccq GHRaWkcaWGIbWaaSbaaSqaaiaaiodaaeqaaOGaamiEamaaBaaaleaa caaIZaaabeaaaaa@4C16@            (3.1)

When applying regression models (Hedonic regression) to a car index it is usual to fit a semi logarithmic form as it has been proven to fit the data best. That is  

                     log e y ^ = b 0 + b 1 x 1 + b 2 x 2 + b 3 x 3 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaciiBaiaac+ gacaGGNbWaaSbaaSqaaiaadwgaaeqaaOWaaCbiaeaacaWG5baaleqa baGaaiOxaaaakiaaywW7cqGH9aqpcaaMf8UaamOyamaaBaaaleaaca aIWaaabeaakiabgUcaRiaadkgadaWgaaWcbaGaaGymaaqabaGccaWG 4bWaaSbaaSqaaiaaigdaaeqaaOGaey4kaSIaamOyamaaBaaaleaaca aIYaaabeaakiaadIhadaWgaaWcbaGaaGOmaaqabaGccqGHRaWkcaWG IbWaaSbaaSqaaiaaiodaaeqaaOGaamiEamaaBaaaleaacaaIZaaabe aaaaa@5006@              (3.2) 

This model relates the logarithm of the price of a new car to absolute values of the characteristics. Natural logarithms are used, because in such a model an b coefficient, if multiplied by a hundred measures, will provide an estimate of the percentage increase in price due to a one unit change in the particular characteristic or “quality, holding the level of the other characteristics constant. 

Using the LINEST function in EXCEL (or PROC REG in SAS) the following estimates for the b coefficients are obtained when the above model is applied to the data in (Table 1).

      log e y ^ =8.43+0.000436 x 1 +0.033094 x 2 +0.003605 x 3 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaciiBaiaac+ gacaGGNbWaaSbaaSqaaiaadwgaaeqaaOWaaCbiaeaacaWG5baaleqa baGaaiOxaaaakiaaywW7cqGH9aqpcaaMf8UaaGioaiaac6cacaaI0a GaaG4maiabgUcaRiaaicdacaGGUaGaaGimaiaaicdacaaIWaGaaGin aiaaiodacaaI2aGaamiEamaaBaaaleaacaaIXaaabeaakiabgUcaRi aaicdacaGGUaGaaGimaiaaiodacaaIZaGaaGimaiaaiMdacaaI0aGa amiEamaaBaaaleaacaaIYaaabeaakiabgUcaRiaaicdacaGGUaGaaG imaiaaicdacaaIZaGaaGOnaiaaicdacaaI1aGaamiEamaaBaaaleaa caaIZaaabeaaaaa@5D19@                 (3.3)

 The interpretation of the above equation is as follows. Keeping the level of other characteristics constant 

  1. A one unit change in cylinder capacity gives a 0.0436% increase in price
  2. A one unit change in the number of doors gives a 3.3094% increase in price
  3. A one unit change in brake horse power gives a 0.3605% increase in price. 

The inclusion of categorical variables

The next step is to incorporate power steering, ABS system and air bags into the model. The variables are categorical variables: their numeric values 1, and 0 stand for the inclusion or exclusion of these features in a car.

The semi logarithmic form of the model is now.

          log e y ^ = b 0 + b 1 x 1 + b 2 x 2 + b 3 x 3 + b 4 x 4 + b 5 x 5 + b 6 x 6 + b 7 x 7 + b 8 x 8 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaciiBaiaac+ gacaGGNbWaaSbaaSqaaiaadwgaaeqaaOWaaCbiaeaacaWG5baaleqa baGaaiOxaaaakiaaywW7cqGH9aqpcaaMf8UaamOyamaaBaaaleaaca aIWaaabeaakiabgUcaRiaadkgadaWgaaWcbaGaaGymaaqabaGccaWG 4bWaaSbaaSqaaiaaigdaaeqaaOGaey4kaSIaamOyamaaBaaaleaaca aIYaaabeaakiaadIhadaWgaaWcbaGaaGOmaaqabaGccqGHRaWkcaWG IbWaaSbaaSqaaiaaiodaaeqaaOGaamiEamaaBaaaleaacaaIZaaabe aakiabgUcaRiaadkgadaWgaaWcbaGaaGinaaqabaGccaWG4bWaaSba aSqaaiaaisdaaeqaaOGaey4kaSIaamOyamaaBaaaleaacaaI1aaabe aakiaadIhadaWgaaWcbaGaaGynaaqabaGccqGHRaWkcaWGIbWaaSba aSqaaiaaiAdaaeqaaOGaamiEamaaBaaaleaacaaI2aaabeaakiabgU caRiaadkgadaWgaaWcbaGaaG4naaqabaGccaWG4bWaaSbaaSqaaiaa iEdaaeqaaOGaey4kaSIaamOyamaaBaaaleaacaaI4aaabeaakiaadI hadaWgaaWcbaGaaGioaaqabaaaaa@6780@              (3.4)

Where

x 4 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamiEamaaBa aaleaacaaI0aaabeaaaaa@37DD@  = weight        

x 5 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamiEaKqbao aaBaaaleaajugabiaaiwdaaSqabaaaaa@38E6@ = length           

x 6 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamiEaKqbao aaBaaaleaajugabiaaiAdaaSqabaaaaa@38E7@  = power steering (pst)

  x 7 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamiEamaaBa aaleaacaaI3aaabeaaaaa@37E0@ = ABS system (abs)

  x 8 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamiEamaaBa aaleaacaaI4aaabeaaaaa@37E1@ = air bags (ab)

Using the LINEST function in EXCEL (or PROC REG in SAS) the following estimates for the b coefficients are obtained when the above model is applied to the relevant data in (Table 1).

Equation (3.5)

log e y ^ =9.37+0.000089 x 1 +0.0197 x 2 +0.0023 x 3 +0.0015 x 4 0.0054 x 5 +0.0649 x 6 +0.113 x 7 0.0075 x 8 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaqcLbuaciGGSb Gaai4BaiaacEgajuaGdaWgaaGdbaqcLbuacaWGLbaaoeqaaKqbaoaa xacaoeaajugqbiaadMhaa4qabeaajugqbiaac6faaaGaaGzbVlabg2 da9iaaywW7caaI5aGaaiOlaiaaiodacaaI3aGaey4kaSIaaGimaiaa c6cacaaIWaGaaGimaiaaicdacaaIWaGaaGioaiaaiMdacaWG4bqcfa 4aaSbaa4qaaKqzafGaaGymaaGdbeaajugqbiabgUcaRiaaicdacaGG UaGaaGimaiaaigdacaaI5aGaaG4naiaadIhajuaGdaWgaaGdbaqcLb uacaaIYaaaoeqaaKqzafGaey4kaSIaaGimaiaac6cacaaIWaGaaGim aiaaikdacaaIZaGaamiEaKqbaoaaBaaaoeaajugqbiaaiodaa4qaba qcLbuacqGHRaWkcaaIWaGaaiOlaiaaicdacaaIWaGaaGymaiaaiwda caWG4bqcfa4aaSbaa4qaaKqzafGaaGinaaGdbeaajugqbiabgkHiTi aaicdacaGGUaGaaGimaiaaicdacaaI1aGaaGinaiaadIhajuaGdaWg aaGdbaqcLbuacaaI1aaaoeqaaKqzafGaey4kaSIaaGimaiaac6caca aIWaGaaGOnaiaaisdacaaI5aGaamiEaKqbaoaaBaaaoeaajugqbiaa iAdaa4qabaqcLbuacqGHRaWkcaaIWaGaaiOlaiaaigdacaaIXaGaaG 4maiaadIhajuaGdaWgaaGdbaqcLbuacaaI3aaaoeqaaKqzafGaeyOe I0IaaGimaiaac6cacaaIWaGaaGimaiaaiEdacaaI1aGaamiEaKqbao aaBaaaoeaajugqbiaaiIdaa4qabaaaaa@9063@

The regression coefficients obtained from Equation (3.5) are interpreted as follows. Keeping the level of other characteristics constant 

  1. A one unit change in cylinder capacity gives a 0.0089% increase in price.
  2. A one unit change in the number of doors gives a 1.97% increase in price.
  3. A one unit change in brake horse power gives a 0.23 % increase in price.
  4. A one unit change in weight (kg) gives a 0.15% increase in price.
  5. A one unit change in the length (cm) gives a 0.54% decrease in price
  6. The inclusion of power steering gives a 6.49% increase in price.
  7. The inclusion of an ABS system gives a 11.26% increase in price.
  8. The inclusion of air bags gives a 0.75% decrease in price 

In Section 4 below it is shown that there is strong Collinearity between weight and length in the above regression model and, therefore, length will be omitted from the model.

The regression model now becomes                        

log e y ^ = b 0 + b 1 x 1 + b 2 x 2 + b 3 x 3 + b 4 x 4 + b 5 x 5 + b 6 x 6 + b 7 x 7 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaciiBaiaac+ gacaGGNbWaaSbaaSqaaiaadwgaaeqaaOWaaCbiaeaacaWG5baaleqa baGaaiOxaaaakiaaywW7cqGH9aqpcaaMf8UaamOyamaaBaaaleaaca aIWaaabeaakiabgUcaRiaadkgadaWgaaWcbaGaaGymaaqabaGccaWG 4bWaaSbaaSqaaiaaigdaaeqaaOGaey4kaSIaamOyamaaBaaaleaaca aIYaaabeaakiaadIhadaWgaaWcbaGaaGOmaaqabaGccqGHRaWkcaWG IbWaaSbaaSqaaiaaiodaaeqaaOGaamiEamaaBaaaleaacaaIZaaabe aakiabgUcaRiaadkgadaWgaaWcbaGaaGinaaqabaGccaWG4bWaaSba aSqaaiaaisdaaeqaaOGaey4kaSIaamOyamaaBaaaleaacaaI1aaabe aakiaadIhadaWgaaWcbaGaaGynaaqabaGccqGHRaWkcaWGIbWaaSba aSqaaiaaiAdaaeqaaOGaamiEamaaBaaaleaacaaI2aaabeaakiabgU caRiaadkgadaWgaaWcbaGaaG4naaqabaGccaWG4bWaaSbaaSqaaiaa iEdaaeqaaaaa@62CA@ (3.6) 

Where

  x 4 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamiEamaaBa aaleaacaaI0aaabeaaaaa@37DD@ = weight        

x 5 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamiEaKqbao aaBaaaleaajugabiaaiwdaaSqabaaaaa@38E6@  = power steering (pst)                               

x 6 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamiEaKqbao aaBaaaleaajugabiaaiAdaaSqabaaaaa@38E7@  = ABS system (abs)

  x 7 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamiEamaaBa aaleaacaaI3aaabeaaaaa@37E0@ = air bags (ab)

 Using the LINEST function in EXCEL (or PROC REG in SAS) the following estimates for the b coefficients are obtained when the above model is applied to the relevant data in (Table 1).    log e y ^ =8.43+0.00008 x 1 +0.015 x 2 +0.002 x 3 +0.0005 x 4 +0.107 x 5 +0.079 x 6 0.034 x 7 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaciiBaiaac+ gacaGGNbWaaSbaaSqaaiaadwgaaeqaaOWaaCbiaeaacaWG5baaleqa baGaaiOxaaaakiaaywW7cqGH9aqpcaaMf8UaaGioaiaac6cacaaI0a GaaG4maiabgUcaRiaaicdacaGGUaGaaGimaiaaicdacaaIWaGaaGim aiaaiIdacaWG4bWaaSbaaSqaaiaaigdaaeqaaOGaey4kaSIaaGimai aac6cacaaIWaGaaGymaiaaiwdacaWG4bWaaSbaaSqaaiaaikdaaeqa aOGaey4kaSIaaGimaiaac6cacaaIWaGaaGimaiaaikdacaWG4bWaaS baaSqaaiaaiodaaeqaaOGaey4kaSIaaGimaiaac6cacaaIWaGaaGim aiaaicdacaaI1aGaamiEamaaBaaaleaacaaI0aaabeaakiabgUcaRi aaicdacaGGUaGaaGymaiaaicdacaaI3aGaamiEamaaBaaaleaacaaI 1aaabeaakiabgUcaRiaaicdacaGGUaGaaGimaiaaiEdacaaI5aGaam iEamaaBaaaleaacaaI2aaabeaakiabgkHiTiaaicdacaGGUaGaaGim aiaaiodacaaI0aGaamiEamaaBaaaleaacaaI3aaabeaaaaa@7288@ (3.7) 

The interpretation of the above equation is as follows. Keeping the level of other characteristics constant

  1. A one unit change in cylinder capacity gives a 0.008% increase in price.
  2. A one unit change in the number of doors gives a 1.5% increase in price.
  3. A one unit change in brake horse power gives a 0.2% increase in price.
  4. A one unit change in weight (kg) gives a 0.05% increase in price.
  5. The inclusion of power steering gives a 10.7% increase in price.
  6. The inclusion of an ABS system gives a 7.9% increase in price.
  7. The inclusion of an airbag gives a 3.4.% decrease in price

Section 4 below shows that Collinearity in not an issue in the regression model described in Equation (3.7).

The output of the regression results for Equation (3.7) is displayed below. All the regression coefficients are significantly different from zero with t statistics (t ratios) greater than 0.8. An R-square of 96% indicates that almost all of the variation in the price of new cars is explained by the selected predictors.

As part of any statistical analysis is to stand back and criticise the regression model and its assumptions. This phase is calledmodel diagnosis. If under close scrutiny the assumptions seem to be approximately satisfied and the model can be used to predict and understand the relationship between response and the response and the predictors.

In Section 5 below the regression model, as described in Equation (3.7), is proven to be adequate for predicting and to understanding the relationship between response and predictors for the new car data described in (Table 1).

Classic definition of hedonic regression
As we can see above, the hedonic hypothesis assumes that a commodity (e.g. a new car) can be viewed as a bundle of characteristics or attributes (e.g. cc, horse power, weight, etc.) for which implicit prices can be derived from prices of different versions of the same commodity containing different levels of specific characteristics.

The ability to desegregate a commodity and price its components facilities the construction of price indices and the measurement of price change across versions of the same commodity. A number of issues arise when trying to accomplish this.

  1. What are the relevant characteristics of a commodity bundle?
  2. How are the implicit (implied) prices to be estimated from the available data?
  3. How are the resulting estimates to be used to construct price or quality indices for a particular commodity?
  4. What meaning, if any, is to be given to the resulting constructs?
  5. What do such indices measure?
  6. Under what conditions do they measure it unambiguously?

Much criticism of the hedonic approach has focused on the last two questions, pointing out the restrictive nature of the assumptions required to establish the “existence” and meaning of such indices. However, what the hedonic approach attempts to do is provide a tool for estimating “missing” prices, prices of particular bundles not observed in the base or later periods. It does not pretend to dispose of the question of whether various observed differences are demand or supply determined, how the observed variety of model in the market is generated, and whether the resulting indices have an unambiguous interpretation of their purpose.

Collinearity

Suppose that in the car data (Table 1) the car weight in pounds in addition to the car weight in kilograms is used as a predictor variable. Let  denote the weight in kilograms and let  denote the weight in pounds. Now since one kilogram is the same as 2.2046 pounds,

  β 1 x 1 + β 2 x 2 = β 1 x + 1 β 2 ( 2.2046 x 1 )=( β 1 +2.2046 β 2 ) x 1 =γ x 1 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaaeiiaiabek 7aInaaBaaaleaacaqGXaaabeaakiaabIhadaWgaaWcbaGaaeymaaqa baGccqGHRaWkcqaHYoGydaWgaaWcbaGaaGOmaaqabaGccaWG4bWaaS baaSqaaiaaikdaaeqaaOGaeyypa0JaeqOSdi2aaSbaaSqaaiaaigda aeqaaOGaamiEamaaBeaaleaacaaIXaaabeaakiabgUcaRiabek7aIn aaBaaaleaacaaIYaaabeaakmaabmaabaGaaGOmaiaac6cacaaIYaGa aGimaiaaisdacaaI2aGaamiEamaaBaaaleaacaaIXaaabeaaaOGaay jkaiaawMcaaiabg2da9maabmaabaGaeqOSdi2aaSbaaSqaaiaaigda aeqaaOGaey4kaSIaaGOmaiaac6cacaaIYaGaaGimaiaaisdacaaI2a GaeqOSdi2aaSbaaSqaaiaaikdaaeqaaaGccaGLOaGaayzkaaGaamiE amaaBaaaleaacaaIXaaabeaakiabg2da9iabeo7aNjaadIhadaWgaa WcbaGaaGymaaqabaaaaa@64B9@

with γ= β 1 +2.2046 β 2 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaeq4SdCMaey ypa0JaeqOSdi2aaSbaaSqaaiaaigdaaeqaaOGaey4kaSIaaGOmaiaa c6cacaaIYaGaaGimaiaaisdacaaI2aGaeqOSdi2aaSbaaSqaaiaaik daaeqaaaaa@4302@ . Here γ MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaeq4SdCgaaa@379D@  represents the “true” regression coefficient associated with the predictor weigh when measured in pounds. Regardless of the value of γ MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaeq4SdCgaaa@379D@ , there are infinitely many different values for β 1 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaeqOSdi2aaS baaSqaaiaaigdaaeqaaaaa@387E@ and β 2 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaeqOSdi2aaS baaSqaaiaaikdaaeqaaaaa@387F@  that produce the same value for γ MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaeq4SdCgaaa@379D@ . If both  and  are included in the model, then β 1 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaeqOSdi2aaS baaSqaaiaaigdaaeqaaaaa@387E@  and β 2 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaeqOSdi2aaS baaSqaaiaaikdaaeqaaaaa@387F@  cannot be uniquely defined and cannot be estimated from the data.

The same difficulty occurs if there is a linear relationship among any of the predictor variables. If some set of predictor variables x 1 , x 2 , x m MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamiEamaaBa aaleaacaaIXaaabeaakiaacYcacaWG4bWaaSbaaSqaaiaaikdaaeqa aOGaeSOjGSKaaiilaiaadIhadaWgaaWcbaGaamyBaaqabaaaaa@3E70@  and some set of constants c 1 , c 2 , c m+1 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaam4yamaaBa aaleaacaaIXaaabeaakiaacYcacaWGJbWaaSbaaSqaaiaaikdaaeqa aOGaeSOjGSKaaiilaiaadogadaWgaaWcbaGaamyBaiabgUcaRiaaig daaeqaaaaa@3FCE@  not all zero

             c 1 x 1 + c 2 x 2 ++ c m x m = c m+1 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaam4yamaaBa aaleaacaaIXaaabeaakiaadIhadaWgaaWcbaGaaGymaaqabaGccqGH RaWkcaWGJbWaaSbaaSqaaiaaikdaaeqaaOGaamiEamaaBaaaleaaca aIYaaabeaakiabgUcaRiablAciljabgUcaRiaadogadaWgaaWcbaGa amyBaaqabaGccaWG4bWaaSbaaSqaaiaad2gaaeqaaOGaeyypa0Jaam 4yamaaBaaaleaacaWGTbGaey4kaSIaaGymaaqabaaaaa@4A2C@            (4.1)

For all values of x 1 , x 2 , x m MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamiEamaaBa aaleaacaaIXaaabeaakiaacYcacaWG4bWaaSbaaSqaaiaaikdaaeqa aOGaeSOjGSKaaiilaiaadIhadaWgaaWcbaGaamyBaaqabaaaaa@3E70@  in the data set, then the predictors x 1 , x 2 , x m MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamiEamaaBa aaleaacaaIXaaabeaakiaacYcacaWG4bWaaSbaaSqaaiaaikdaaeqa aOGaeSOjGSKaaiilaiaadIhadaWgaaWcbaGaamyBaaqabaaaaa@3E70@ are said to be collinear. Exact collinearity rarely occurs with actual data, but approximate collinearity occurs when predictors are nearly linearly related. As discussed later, approximate collinearity also causes substantial difficulties in regression analysis. Variables are said to be collinear even if Equation (4.1) holds only approximately. Setting aside for the moment the assessment of the effects of collinearity, how is it detected?

The search for collinearity between predictor variables are assessed by calculating the correlation coefficients between all pairs of predictor variables and displaying them in a table.

The above table of correlations are only between pairs of predictors and cannot assess more complicated (near) linear relationships among several predictors and expressed in Equation (4.1). To do so the multiple coefficient of determination, R j 2 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamOuamaaDa aaleaacaWGQbaabaGaaGOmaaaaaaa@38A5@ , obtained from regressing the jth predictor variable on all the other predictor variables is calculated. That is, x j MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamiEamaaBa aaleaacaWGQbaabeaaaaa@380E@  is temporarily treated as the response in this regression. The closer this R j 2 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamOuamaaDa aaleaacaWGQbaabaGaaGOmaaaaaaa@38A5@  is to 1 (or 100%), the more serious the collinearity problem is with respect to the jth predictor.

Effects on parameter estimates.
The effect of collinearity on the estimates of regression coefficients may be best seen from the expression giving the standard errors of those coefficients. Standard errors give a measure of expected variability for coefficients – the smaller the standard error the better the coefficient tends to be estimated. It may be shown that the standard error of the jth coefficient, b j MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamOyamaaBa aaleaacaWGQbaabeaaaaa@37F8@ , is given by

               se( b j )=s 1 1 R j 2 1 i=1 n ( x ij x ¯ j ) 2 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaam4Caiaadw gadaqadaqaaiaadkgadaWgaaWcbaGaamOAaaqabaaakiaawIcacaGL PaaacaaMf8Uaeyypa0JaaGzbVlaadohadaGcaaqaamaalaaabaGaaG ymaaqaaiaaigdacqGHsislcaWGsbWaa0baaSqaaiaadQgaaeaacaaI YaaaaaaakiabgwSixpaalaaabaGaaGymaaqaamaaqahabaWaaeWaae aacaWG4bWaaSbaaSqaaiaadMgacaWGQbaabeaakiabgkHiTmaanaaa baGaamiEaaaadaWgaaWcbaGaamOAaaqabaaakiaawIcacaGLPaaada ahaaWcbeqaaiaaikdaaaaabaGaamyAaiabg2da9iaaigdaaeaacaWG UbaaniabggHiLdaaaaWcbeaaaaa@575E@             (4.2)

Where, as before, R j 2 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamOuamaaDa aaleaacaWGQbaabaGaaGOmaaaaaaa@38A5@  is the R 2 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamOuamaaCa aaleqabaGaaGOmaaaaaaa@37B6@  value obtained from regressing the jth predictor variable on all other predictors? Equation (4.2) shows that, with respect to collinearity, the standard error will be smallest when R j 2 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamOuamaaDa aaleaacaWGQbaabaGaaGOmaaaaaaa@38A5@ zero, that is, is the jth predictor is not linearly related to the other predictors. Conversely, if R j 2 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamOuamaaDa aaleaacaWGQbaabaGaaGOmaaaaaaa@38A5@  is near 1, then the standard error of b j MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamOyamaaBa aaleaacaWGQbaabeaaaaa@37F8@  is large and the estimate is much more likely to be far from the true value of β j MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaeqOSdi2aaS baaSqaaiaadQgaaeqaaaaa@38B2@ .
The quantity

                    VI F j = 1 1 R j 2 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamOvaiaadM eacaWGgbWaaSbaaSqaaiaadQgaaeqaaOGaaGzbVlabg2da9iaaywW7 daWcaaqaaiaaigdaaeaacaaIXaGaeyOeI0IaamOuamaaDaaaleaaca WGQbaabaGaaGOmaaaaaaaaaa@42D3@             (4.3)

Is called the variance inflation factor (VIF). The large the value of VIF for a predictor x j MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamiEamaaBa aaleaacaWGQbaabeaaaaa@380E@ , the more severe the collinearity problem. As a guideline, many authors recommend that a VIF greater than 10 suggests a collinearity difficulty worthy of further study. This is equivalent to flagging predictors with R j 2 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamOuamaaDa aaleaacaWGQbaabaGaaGOmaaaaaaa@38A5@  greater than 90%.

Table 3 below presents the results of the collinearity diagnostics for the regression model outlined in Equation 3.5 (using PROC REG in SAS). From Tables 4 and Table 5 above it is obvious that there is a strong liner relationship between the predictor variables w and l in the regression model in Equation (3.5) and they are collinear. To overcome this collinearity problem the predictor variable l (length) will be omitted from the regression model.

Variance Inflation Factors

cc

7.36028546

d

1.54999525

ps

3.07094954

w

221.98482270

l

246.25396926

pst

4.73427617

abs

7.87431815

ab

3.28577252

Table 3: Variance Inflation Factors (VIP).

Table 4: Correlation table for predictor variables

 

cc

d

ps

w

l

pst

abs

d

0.42754

 

 

 

 

 

 

ps

0.75827

0.23098

 

 

 

 

 

w

0.90567

0.42623

0.78991

 

 

 

 

l

0.91509

0.40526

0.78197

0.98931

 

 

 

pst

0.59539

0.43901

0.48691

0.67540

0.60361

 

 

abs

0.82684

0.2129

0.63492

0.80978

0.86680

0.34752

 

ab

0.55255

0.7412

0.46523

0.52271

0.58908

0.34995

0.64500

Table 4: Correlation table for predictor variables.

 

Dep Var

Predict

 

Standard

t-

Hat Diag

Cook's

Obs

P

Value

Residual

Residual

Residual

h ii MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamiAamaaBa aaleaacaWGPbGaamyAaaqabaaaaa@38EB@

D

1

9.5000

9.4673

0.03270

0.772

0.7644

0.1408

0.012

2

9.6800

9.6865

-0.00647

-0.156

-0.1518

0.1704

0.001

3

9.2900

9.2477

0.04230

1.207

1.2213

0.4099

0.126

4

9.1900

9.1546

0.03540

0.981

0.9799

0.3767

0.073

5

9.6700

9.6622

0.00780

0.188

0.1833

0.1739

0.001

6

9.4000

9.4753

-0.07530

-1.814

-1.9346

0.1730

0.086

7

9.7000

9.6775

0.02250

0.554

0.5441

0.2094

0.010

8

9.4700

9.4466

0.02340

0.571

0.5615

0.1974

0.010

9

9.2000

9.2328

-0.03280

-1.022

-1.0232

0.5059

0.134

10

9.6900

9.6865

0.00353

0.085

0.0827

0.1704

0.000

11

9.1900

9.1663

0.02370

0.608

0.5978

0.2712

0.017

12

9.5100

9.5122

-0.00216

-0.053

-0.0512

0.1870

0.000

13

9.7100

9.6865

0.02350

0.566

0.5558

0.1704

0.008

14

9.0800

9.1886

-0.10860

-2.706

-3.3131

0.2279

0.270

15

9.4700

9.4466

0.02340

0.571

0.5615

0.1974

0.010

16

9.6200

9.6222

-0.00220

-0.083

-0.0809

0.6615

0.002

17

9.4500

9.4447

0.00528

0.136

0.1321

0.2707

0.001

18

9.8000

9.7690

0.03100

0.877

0.8714

0.4005

0.064

19

9.4800

9.4237

0.05630

1.389

1.4241

0.2106

0.064

20

9.7200

9.7536

-0.03360

-0.830

-0.8228

0.2132

0.023

21

9.1900

9.1828

0.00721

0.207

0.2021

0.4189

0.004

22

9.4200

9.4677

-0.04770

-1.125

-1.1329

0.1367

0.025

23

9.7100

9.7310

-0.02100

-0.503

-0.4938

0.1646

0.006

24

9.4600

9.4864

-0.02640

-0.725

-0.7160

0.3618

0.037

25

9.3000

9.2998

0.000171

0.006

0.0055

0.5679

0.000

26

9.6800

9.7051

-0.02510

-0.592

-0.5821

0.1409

0.007

27

9.5900

9.6226

-0.03260

-1.302

-1.3267

0.6988

0.492

28

 

9.5800

9.5041

0.07590

1.826

1.9500

0.1723

0.087

Table 5: Diagnostic Statistics for Regression Model.

Variance Inflation Factors

cc

7.28722659

d

1.45581431

ps

3.01093302

w

10.43520369

pst

2.68458544

abs

5.83911068

ab

1.92580528

Table 6: Variance Inflation Factors (VIP).

Effects on inference
If collinearity affects parameter estimates and their standard error then it follows that t- ratios will also be affected.

Effects on prediction
The effect of collinearity on prediction depends on the particular values specified for the predictors. If the relationship among the predictors used in fitting the model is preserved in the predictor values used for prediction, then the predictions will be little affected by collinearity. ON the other hand, if the specified predictor values are contrary to the observed relationships among the predictors in the model, then the predictions will be poor.

What to do about collinearity
The best defence against the problems associated with collinear predictors is to keep the models as simple as possible. Variables that add little to the usefulness of a regression model should be deleted from the model. When collinearity is detected among variables, none of which can reasonably be deleted from a regression model, avoid extrapolation and beware of inference on individual regression coefficients.

Table 6 below presents the results of the collinearity diagnostics for the regression model outlined in Equation 3.7 (using PROC REG in SAS).

Note that the predictor variable w (weight)does nothave a VIP value sufficiently greater than 10to warrant exclusion from the model. Table 4 and Table 6 above indicate that the regression model as described in Equation (3.7) does not have a problem with collinearity among the variables.

Model Diagnostics

All the regression theory and methods presented above rely to a certain extent on the standard regression assumptions. In particular it was assumed that the data were generated by a process that could be modelled according to

y i = β 0 + β 1 x 1 + β 2 x 2 + + β k x i k + e i MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamyEamaaBa aaleaacaWGPbaabeaakiaaywW7cqGH9aqpcaaMf8UaeqOSdi2aaSba aSqaaiaaicdaaeqaaOGaey4kaSIaeqOSdi2aaSbaaSqaaiaaigdaae qaaOGaamiEamaaBaaaleaacaaIXaaabeaakiabgUcaRiabek7aInaa BaaaleaacaaIYaaabeaakiaadIhadaWgaaWcbaGaaGOmaaqabaGccq GHRaWkcqWIVlctcqGHRaWkcqaHYoGydaWgaaWcbaGaam4AaaqabaGc caWG4bWaaSbaaSqaaiaadMgacaWGRbaabeaakiabgUcaRiaadwgada WgaaWcbaGaamyAaaqabaaaaa@5601@ For i = 1,2, ….,n (5.1)

Where the error terms e 1 , e 2 ,, e n MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamyzamaaBa aaleaacaaIXaaabeaakiaacYcacaWGLbWaaSbaaSqaaiaaikdaaeqa aOGaaiilaiablAciljaacYcacaWGLbWaaSbaaSqaaiaad6gaaeqaaa aa@3EE8@ are independent of one another and are each normally distributed with mean 0 and common standard deviation. But in any practical situation, assumptions are always in doubt and can only hold approximately at best. The second part of any statistical analysis is to stand back and criticize the model and its assumptions. This phase is frequently called model diagnosis. If under close scrutiny the assumptions seem to be approximately satisfied, then the model can be used to predict and to understand the relationship between response and predictors. Otherwise, ways to improve the model are sought, once more checking the assumptions of the new model. This process is continued until either a satisfactory model is found or it is determined that none of the models are completely satisfactory. Ideally, the adequacy of the model is assessed by checking it with a new set of data. However, that is a rare luxury; most often diagnostics based on the original set must suffice. The study of diagnostics begins with the important topic of residuals (Model 1).

 

Analysis of Variance

 

Source

DF

Sum of Squares

Mean
Square

F Value

Prob>F

 

Model

7

1.04297

0.14900

71.462

0.0001

 

Error

20

0.04170

0.00208

 

 

 

C Total

27

1.08467

 

 

 

 

Root MSE

0.04566

R-square

0.9616

Dep Mean

9.49107

Adj R-sq

0.9481

C.V.

0.48110

 

 

Model 1: Analysis of Variance.

 

Parameter Estimates

 

Variable

 

DF

Parameter Estimate

Standard Error

T for H0: Parameter=0

 

Prob > |T|

Intercep

1

8.425031

0.14860470

56.694

0.0001

CC

1

0.000077042

0.00009113

0.845

0.4079

D

1

0.015308

0.01319688

1.160

0.2597

PS

1

0.002427

0.00073364

3.309

0.0035

W

1

0.000487

0.00017666

2.758

0.0121

PST

1

0.106869

0.03691626

2.895

0.0090

ABS

1

0.079148

0.04351762

1.819

0.0840

AB

1

-0.033942

-0.02419823

1.403

0.1761

Variable

DF

Variance Inflation

 

 

 

Intercep

1

0.00000000

 

 

 

CC

1

7.28722659

 

 

 

D

1

1.45581431

 

 

 

PS

1

3.01093302

 

 

 

W

1

10.43520369

 

 

 

PST

1

2.68458544

 

 

 

ABS

1

5.83911068

 

 

 

AB

1

1.92580528

 

 

 

Model 2: Parameter Estimates.

Residuals – standardised residuals

Most of the regression assumptions apply to the error terms. However the error terms cannot be obtained and the assessment of the errors must be based on the residuals obtained as the actual value minus the fitted value that the model predicts with all unknown parameters estimated (Model 2) for the data. Recall that in symbols the ith residual is

e ^ i = y i b 0 b 1 x 1 b 2 x 2 b k x ik MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaWaaCbiaeaaca WGLbaaleqabaGaaiOxaaaakmaaBaaaleaacaWGPbaabeaakiaaywW7 cqGH9aqpcaaMf8UaamyEamaaBaaaleaacaWGPbaabeaakiabgkHiTi aadkgadaWgaaWcbaGaaGimaaqabaGccqGHsislcaWGIbWaaSbaaSqa aiaaigdaaeqaaOGaamiEamaaBaaaleaacaaIXaaabeaakiabgkHiTi aadkgadaWgaaWcbaGaaGOmaaqabaGccaWG4bWaaSbaaSqaaiaaikda aeqaaOGaeyOeI0IaeS47IWKaeyOeI0IaamOyamaaBaaaleaacaWGRb aabeaakiaadIhadaWgaaWcbaGaamyAaiaadUgaaeqaaaaa@5485@ For i = 1,2, ….,n (5.2)

To analyse residuals (or any other diagnostic statistic), their behaviour when the model assumption do hold and, if possible, when at least some of the assumptions do not hold must be understood. If the regression assumptions all hold, it may be shown that the residuals have normal distributions with 0 means. It may also be shown that the distribution of the ith residual has the standard deviation σ 1 h ii MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaeq4Wdm3aaO aaaeaacaaIXaGaeyOeI0IaamiAamaaBaaaleaacaWGPbGaamyAaaqa baaabeaaaaa@3C66@ , where h ii MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamiAamaaBa aaleaacaWGPbGaamyAaaqabaaaaa@38EB@  is the ith diagonal element of the “hat matrix” determined by the values of the set of predictor variables. (See Appendix I), but the particular formula given there is not needed here. In the simple case of a single predictor model it may be shown that

h i i = 1 n + ( x i x ¯ ) 2 j = 1 n ( x j x ¯ ) 2 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamiAamaaBa aaleaacaWGPbGaamyAaaqabaGccqGH9aqpdaWcaaqaaiaaigdaaeaa caWGUbaaaiabgUcaRmaalaaabaWaaeWaaeaacaWG4bWaaSbaaSqaai aadMgaaeqaaOGaeyOeI0Yaa0aaaeaacaWG4baaaaGaayjkaiaawMca amaaCaaaleqabaGaaGOmaaaaaOqaamaaqahabaWaaeWaaeaacaWG4b WaaSbaaSqaaiaadQgaaeqaaOGaeyOeI0Yaa0aaaeaacaWG4baaaaGa ayjkaiaawMcaamaaCaaaleqabaGaaGOmaaaaaeaacaWGQbGaeyypa0 JaaGymaaqaaiaad6gaa0GaeyyeIuoaaaaaaa@4FAC@ (5.3)

Note in particular that the standard deviation of the distribution of the ith residual is not σ MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaeq4Wdmhaaa@37B9@ , the standard deviation of the distribution of the ith error term e i MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamyzamaaBa aaleaacaWGPbaabeaaaaa@37FA@  . It may be shown that, in general,

1 n h i i 1 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaWaaSaaaeaaca aIXaaabaGaamOBaaaacaaMf8UaeyizImQaaGzbVlaadIgadaWgaaWc baGaamyAaiaadMgaaeqaaOGaeyizImQaaGzbVlaaigdaaaa@4382@ (5.4)

So that

0σ 1 h ii σ 1 1 n σ MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaaGimaiaayw W7cqGHKjYOcaaMf8Uaeq4Wdm3aaOaaaeaacaaIXaGaeyOeI0IaamiA amaaBaaaleaacaWGPbGaamyAaaqabaaabeaakiaaywW7cqGHKjYOca aMf8Uaeq4Wdm3aaOaaaeaacaaIXaGaeyOeI0YaaSaaaeaacaaIXaaa baGaamOBaaaaaSqabaGccaaMf8UaeyizImQaaGzbVlabeo8aZbaa@52AE@  is at its minimum value, 1/n, when the predictors are all equal to their mean values. On the other hand, h ii MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamiAamaaBa aaleaacaWGPbGaamyAaaqabaaaaa@38EB@ approaches its maximum value, 1, when the predictors are very far from their mean values. Thus residuals obtained from data points that are far from the centre of the data set will tend to be smaller than the corresponding error terms. Curves fit by least squares will usually fit better at extreme values for the predicators than in the central part of the data.

Table 5 below, displays the h ii MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamiAamaaBa aaleaacaWGPbGaamyAaaqabaaaaa@38EB@  values (along with many other diagnostic statistics that will be discussed) for the regression of log (price) on the seven predicators described above (cc, d, ps, w, l, pst, abs and ab). To compensate for the differences in dispersion among the distributions of the different residuals, it is usually better to consider the standardized residuals defined by
ith standardized residual =     e ^ s 1 h ii MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaWaaSaaaeaada WfGaqaaiaadwgaaSqabeaacaGGEbaaaaGcbaGaam4CamaakaaabaGa aGymaiabgkHiTiaadIgadaWgaaWcbaGaamyAaiaadMgaaeqaaaqaba aaaaaa@3DCA@           for i = 1, 2, …, n (5.6)

Notice that the unknown σ MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaeq4Wdmhaaa@37B9@ has been estimated by s. If n is large and if the regression assumptions are all approximately satisfied, then the standardized residuals should behave about like standard normal variables. Table 5 also lists the residuals and standardized residuals for all 28 observations.

Even if all the regression assumptions are met, the residuals (and the standardized residuals) are not independent. For example, the residuals for a model that includes an intercept term always add to zero. This alone implies they are negatively correlated. It may be shown that, in fact, the theoretical correlation coefficient between the ith and jth residuals (or standard residuals) is

                             h ij ( 1 h ii )( i h jj ) MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaWaaSaaaeaacq GHsislcaWGObWaaSbaaSqaaiaadMgacaWGQbaabeaaaOqaamaakaaa baWaaeWaaeaacaaIXaGaeyOeI0IaamiAamaaBaaaleaacaWGPbGaam yAaaqabaaakiaawIcacaGLPaaadaqadaqaaiaadMgacqGHsislcaWG ObWaaSbaaSqaaiaadQgacaWGQbaabeaaaOGaayjkaiaawMcaaaWcbe aaaaaaaa@46A3@    (5.7)

Where h ij MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamiAamaaBa aaleaacaWGPbGaamOAaaqabaaaaa@38EC@ is the ijth element of the hat matrix? Again, the general formula for these elements is not needed here. For the simple single-predictor case it may be shown that

                 h ij = 1 n + ( x i x ¯ )( x i x ¯ ) i=1 n ( x i x ¯ ) 2 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamiAamaaBa aaleaacaWGPbGaamOAaaqabaGccqGH9aqpdaWcaaqaaiaaigdaaeaa caWGUbaaaiabgUcaRmaalaaabaWaaeWaaeaacaWG4bWaaSbaaSqaai aadMgaaeqaaOGaeyOeI0Yaa0aaaeaacaWG4baaaaGaayjkaiaawMca amaabmaabaGaamiEamaaBaaaleaacaWGPbaabeaakiabgkHiTmaana aabaGaamiEaaaaaiaawIcacaGLPaaaaeaadaaeWbqaamaabmaabaGa amiEamaaBaaaleaacaWGPbaabeaakiabgkHiTmaanaaabaGaamiEaa aaaiaawIcacaGLPaaadaahaaWcbeqaaiaaikdaaaaabaGaamyAaiab g2da9iaaigdaaeaacaWGUbaaniabggHiLdaaaaaa@545D@ (5.8)

From Equations (5.3), (5.7) and (5.8) (and in general) we see that the correlations will be small except for small data sets and/or residuals associated with data points very far from the central part of the predictor values. From a practical point of view this small correlation can usually be ignored, and the assumptions on the error terms can be assessed by comparing the properties of the standardized residuals to those of independent, standard normal variables.

Residual plots
Plots of the standardized residuals against other variables are very useful in detecting departures from the standard regression assumptions. Many of the most common problems may be seen by plotting (standardized) residuals against the corresponding fitted values. In this plot, residuals associated with approximately equal-sized fitted values are visually grouped together. In this way it is relatively easy to see if mostly negative (or mostly positive) residuals are associated with the largest and smallest fitted values. Such a plot would indicate curvature that the chosen regression curve did not capture. Figure 5 displays the plot of Standardized Residuals versus fitted values for the new car data.

Another important use for the plot of residuals versus fitted values is to detect lack of common standard deviation among different error terms. Contrary to the assumption of common standard deviation, it is not uncommon for variability to increase as the values for response variables increase. This situation does not occur for the data contained in (Figure 5).

Figure 5: Scatter plot of Standardized residuals versus Fitted from new car data.

Outliers

In regression analysis the model is assumed to be appropriate for all the observations. However, it is not unusual for one or two cases to be inconsistent with the general pattern of the data in one way or another. When a single predictor is used such cases may be easily spotted in the scatter plot data. When several predictors are employed such cases will be much more difficult to detect. The non conforming data points are usually called outliers. Sometimes it is possible to retrace the steps leading to the suspect data point and isolate the reason for the outlier. For example, it could be the result of a recording error, If this is the case the data can be corrected. At other times the outlier may be due to a response obtained when variables not measured were quite different than when the rest of the data were obtained. Regardless of the reason for the outlier, its effect on regression analysis can be substantial.

Outliers that have unusual response values are the focus here. Unusual responses should be detectable by looking for unusual residuals preferably by checking for unusually large standardized residuals. If the normality of the error terms is not in question, then a standard residual larger than 3 in magnitudes certainly is unusual and the corresponding case should be investigated for a special cause for this value.

Studentized residuals (t – residuals)

A difficulty with looking at standardized residuals is that an outlier, if present, will also affect the estimate of σ MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaeq4Wdmhaaa@37B9@  that enters into the denominator of the standardized residual. Typically, an outlier will inflate s and thus deflate the standardized residual and mask the outlier. One way to circumvent this problem is to estimate the value of σ MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaeq4Wdmhaaa@37B9@  use in calculating the ith standard residual using all the data except the ith case. Let s ( i ) MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaam4CamaaBa aaleaadaqadaqaaiaadMgaaiaawIcacaGLPaaaaeqaaaaa@3991@  denote such an estimate where the subscript (i) indicates that the ith case has been deleted. This leads to the Studentized residual defined by

ith Studentized residual =             e ^ i s ( i ) 1 h ii MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaWaaSaaaeaada WfGaqaaiaadwgaaSqabeaacaGGEbaaaOWaaSbaaSqaaiaadMgaaeqa aaGcbaGaam4CamaaBaaaleaadaqadaqaaiaadMgaaiaawIcacaGLPa aaaeqaaOWaaOaaaeaacaaIXaGaeyOeI0IaamiAamaaBaaaleaacaWG PbGaamyAaaqabaaabeaaaaaaaa@419B@ for i = 1, 2, …, n    (5.6)

The next question to be asked is “how do these diagnostic methods work the new car data?” Table 5 lists diagnostic statistics for the regression model as applied to the new car data. Notice that there is only case (observation No. 14) where the standardized and studentized residuals have a significant difference and where the standard or studentized residuals are above 3 in magnitude. These results indicate that, in general, there are no outlier problems associated with the regression model described in (3.5) above.

Influential observations
The principle of ordinary least squares gives equal weight to each case. On the other hand, each case does not have the same effect on the fitted regression curve. For example, observations with extreme predictor values can have substantial influence on the regression analysis. A number of diagnostic statistics have been invented to quantify the amount of influence (or at least potential influence) that individual cases have in a regression analysis. The first measure of influence is provided by the diagonal elements of the hat matrix.

Leverage
When considering the influence of individual cases on regression analysis, the ith diagonal element of the hat matrix h ii MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamiAamaaBa aaleaacaWGPbGaamyAaaqabaaaaa@38EB@  is often called the leverage for the ith case, which means a measure of the ith data point’s influence in a regression with respect to the predicator variables. In what sense does h ii MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamiAamaaBa aaleaacaWGPbGaamyAaaqabaaaaa@38EB@  measure influence? It may be shown that y ^ i = h ii y i + ji h ij y j   so that  δ y ^ i / δ y i = h ii MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaWaaCbiaeaaca WG5baaleqabaGaaiOxaaaakmaaBaaaleaacaWGPbaabeaakiabg2da 9iaadIgadaWgaaWcbaGaamyAaiaadMgaaeqaaOGaamyEamaaBaaale aacaWGPbaabeaakiabgUcaRmaaqababaGaamiAamaaBaaaleaacaWG PbGaamOAaaqabaGccaWG5bWaaSbaaSqaaiaadQgaaeqaaOGaaeiiai aabccacaqGZbGaae4BaiaabccacaqG0bGaaeiAaiaabggacaqG0bGa aeiiamaalyaabaGaeqiTdq2aaCbiaeaacaWG5baaleqabaGaaiOxaa aakmaaBaaaleaacaWGPbaabeaaaOqaaiabes7aKjaadMhadaWgaaWc baGaamyAaaqabaGccqGH9aqpcaWGObWaaSbaaSqaaiaadMgacaWGPb aabeaaaaaabaGaamOAaiabgcMi5kaadMgaaeqaniabggHiLdaaaa@5FFA@  , that is h ii MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamiAamaaBa aaleaacaWGPbGaamyAaaqabaaaaa@38EB@ is the rate of change of the ith fitted value with respect to the ith response value. If h ii MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamiAamaaBa aaleaacaWGPbGaamyAaaqabaaaaa@38EB@  is small, then a small change in the ith response results in a small change in the corresponding fitted value. However, if  is large, then a small change in the ith response produces a large change in the corresponding y ^ i MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaWaaCbiaeaaca WG5baaleqabaGaaiOxaaaakmaaBaaaleaacaWGPbaabeaaaaa@3943@ .

Further interpretation of h ii MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamiAamaaBa aaleaacaWGPbGaamyAaaqabaaaaa@38EB@  as leverage is based on the discussion in Section 5.1. There is shown that the standard deviation of the sampling distribution of the ith residual is not σ MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaeq4Wdmhaaa@37B9@  but σ 1 h ii MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaeq4Wdm3aaO aaaeaacaaIXaGaeyOeI0IaamiAamaaBaaaleaacaWGPbGaamyAaaqa baaabeaaaaa@3C66@ . Furthermore, h ii MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamiAamaaBa aaleaacaWGPbGaamyAaaqabaaaaa@38EB@  is equal to its smallest value 1/n, when all the predicators are equal to their mean values. These are the values for the predictors that have the least influence on the regression curve and imply, in general, the largest residuals. On the other hand, if the predictors are far from their means, then approaches its largest value of 1 and the standard deviation of such residuals are quite small. In turn this implies a tendency for small residuals, and the regression curve is pulled toward these influential observations.

How large might a leverage value be before a case is considered to have large influence? It may be shown algebraically that the average leverage over all cases is (k+1)/n, that is,

                      1 n i=1 n h ii = k+1 n MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaWaaSaaaeaaca aIXaaabaGaamOBaaaadaaeWbqaaiaadIgadaWgaaWcbaGaamyAaiaa dMgaaeqaaaqaaiaadMgacqGH9aqpcaaIXaaabaGaamOBaaqdcqGHri s5aOGaaGzbVlabg2da9iaaywW7daWcaaqaaiaadUgacqGHRaWkcaaI XaaabaGaamOBaaaaaaa@483E@      (5.10)

Where k is the number of predictors in the model. On the basis of this result, many authors suggest making cases as influential if their leverage exceeds two or three time (k+1)/n.

For the new car data displayed in Table 5and using the regression model as described in Equation (3.7) we estimate;

  1. k = 7
  2. (k+1)/n = 8/28 = 0.2857
  3. 2 x (k+1)/n = 0.5714
  4. 3 x (k+1)/n = 0.8571

In Table 5 only two observations (No.’s 16 and 27) are above the 2 x (k+1)/n threshold and none of observations are above the 3 x (k+1)/n threshold. This result indicates that there are no observations with extreme predictor’s value impacting on the slope of the fitted values and hence none of the observation has undue influence on the regression results.

Cook’s distance
As good as large leverage values are in detecting cases influential on the regression analysis, this criterion is not without faults. Leverage values are completely determined by the values of the predictor variables and do not involve the response values at all. A data point that possesses large leverage but also lies close to the trend of the other data will not have undue influence on the regression results.

Several statistics have been proposed to better measure the influence of individual cases. One of the most popular is called Cook’s Distance, which is a measure of a data point’s influence on regression results that considers both the predictor variables and the response variables. The basic idea is to compare the predictions of the model when the ith case is and is not included in the calculations.

In particular, Cook’s Distance, D i MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamiramaaBa aaleaacaWGPbaabeaaaaa@37D9@ for the ith case is defined to be

                                              D i = j=1 n ( y ^ j y ^ j( i ) ) 2 ( k+1 ) s 2 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamiramaaBa aaleaacaWGPbaabeaakiaaywW7cqGH9aqpcaaMf8+aaSaaaeaadaae WbqaamaabmaabaWaaCbiaeaacaWG5baaleqabaGaaiOxaaaakmaaBa aaleaacaWGQbaabeaakiabgkHiTmaaxacabaGaamyEaaWcbeqaaiaa c6faaaGcdaWgaaWcbaGaamOAamaabmaabaGaamyAaaGaayjkaiaawM caaaqabaaakiaawIcacaGLPaaaaSqaaiaadQgacqGH9aqpcaaIXaaa baGaamOBaaqdcqGHris5aOWaaWbaaSqabeaacaaIYaaaaaGcbaWaae WaaeaacaWGRbGaey4kaSIaaGymaaGaayjkaiaawMcaaiaadohadaah aaWcbeqaaiaaikdaaaaaaaaa@548B@ (5.11)

Where y ^ j( i ) MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaWaaCbiaeaaca WG5baaleqabaGaaiOxaaaakmaaBaaaleaacaWGQbWaaeWaaeaacaWG PbaacaGLOaGaayzkaaaabeaaaaa@3BBB@  is the predicted or fitted value for case j using the regression curve obtained when i is omitted. Large values of D i MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamiramaaBa aaleaacaWGPbaabeaaaaa@37D9@  indicate that case i has large influence on the regression results, as then y ^ j  and   y ^ j( i ) MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaWaaCbiaeaaca WG5baaleqabaGaaiOxaaaakmaaBaaaleaacaWGQbaabeaakiaabcca caqGHbGaaeOBaiaabsgacaqGGaGaaeiiamaaxacabaGaamyEaaWcbe qaaiaac6faaaGcdaWgaaWcbaGaamOAamaabmaabaGaamyAaaGaayjk aiaawMcaaaqabaaaaa@43B8@  differ substantially for many cases. The deletion of a case with a large value of D i MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamiramaaBa aaleaacaWGPbaabeaaaaa@37D9@ will alter conclusions substantially. If D i MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamiramaaBa aaleaacaWGPbaabeaaaaa@37D9@  is not large, regression results will not change dramatically even if the leverage for the ith case is large. In general, if the largest value of D i MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamiramaaBa aaleaacaWGPbaabeaaaaa@37D9@ is substantially less than 1, then no cases are especially influential. On the other hand, cases with D i MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamiramaaBa aaleaacaWGPbaabeaaaaa@37D9@ greater than 1 should certainly be investigated further to more carefully assess their influence on the regression analysis results.

In Table 5 only one observation No. 27 has the largest value of Cook’s Distance, 0.722 and leverage value, 0.6988. However, neither value is high enough to influence the regression analysis results.

What are next once influential observations have been detected? If the influential observation is due to incorrect recording of the data point, an attempt to correct that observation should be made and the regression analysis rerun. If the data point is known to be faulty but cannot be corrected, then that observation should be excluded for the data set. If it is determined that the influential data point is indeed accurate, it is likely that the proposed regression model is not appropriate for the problem at hand. Perhaps an important predictor variable has been neglected or the form of the regression curve is not adequate.

Transformations
So far a variety of methods for detecting the failure of some of the underlying assumptions of regression analysis have been discussed. Transformations of the data, either of the response and/or the predictor variables, provide a powerful method for turning marginally useful regression models into quire valuable models in which the assumptions are much more credible and hence the predictions much more reliable. Some of the most common and most useful transformations include logarithms, square roots, and reciprocals. Careful consideration of various transformations for data can clarify and simplify the structure of relationships among variables.

Sometimes transformations occur “naturally” in the ordinary reporting of data. As an example, consider a bicycle computer that displays, among other things, the current speed of the bicycle in miles per hour. What is really measured is the time it takes for each revolution of the wheel. Since the exact circumference of the tire is stored in the computer, the reported speed is calculated as a constant divided by the measured time per revolution of the wheel. The speed reported is basically a reciprocal transformation of the measured variable.

As a second example, consider petrol consumption in a car. Usually these values are reported in miles per gallon. However, they are obtained by measuring the fuel consumption on a test drive of fixed distance. Miles per gallon are then calculated by computing the reciprocal of the gallons per mile figure.

A very common transformation is the logarithm transformation. It may be shown that a logarithm transformation will tend to correct the problem of non constant standard deviation in case the standard deviation of e i MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamyzamaaBa aaleaacaWGPbaabeaaaaa@37FA@  is proportional to the mean of y i MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamyEamaaBa aaleaacaWGPbaabeaaaaa@380E@ . If the mean of y doubles, then so does the standard deviation of e and so forth.

Model Application

Once a regression model has successfully passed the set of diagnostic tests presents in Section 5, one can then apply the model with confidence in the construction a price index, the new car price index in this paper. The prices that are constructed from the Hedonic Regression Model are referred to as the predicted prices.

An illustrative example follows of how hedonic based quality adjustment can be applied in a situation when the price an individual car model (Toyota Corolla 1.3L saloon) was available in a January of a particular year, but was not available in the February of the same year. A replacement car model is available and is close in quality, but has three changes in specification: an increase in cylinder capacity from 1,300cc to 1,400cc, with the inclusion of ABS and air bags.

The first stage of applying the model to the price index is the calculation of predicated old and new prices. The second stage is to adjust the base price to reflect new features (Model 3).

Model 3: Scatter plot of price (£) versus cylinder capacity and OLS line.

Change to January due to changes in quality

= Predicted price new model Predicted price old model = 13,494 12,594 =1.0715 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaeyypa0JaaG zbVpaalaaabaGaaeiuaiaabkhacaqGLbGaaeizaiaabMgacaqGJbGa aeiDaiaabwgacaqGKbGaaeiiaiaabchacaqGYbGaaeyAaiaabogaca qGLbGaaeiiaiaab6gacaqGLbGaae4DaiaabccacaqGTbGaae4Baiaa bsgacaqGLbGaaeiBaaqaaiGaccfacaGGYbGaaeyzaiaabsgacaqGPb Gaae4yaiaabshacaqGLbGaaeizaiaabccacaqGWbGaaeOCaiaabMga caqGJbGaaeyzaiaabccacaqGVbGaaeiBaiaabsgacaqGGaGaaeyBai aab+gacaqGKbGaaeyzaiaabYgaaaGaaGzbVlabg2da9iaaywW7daWc aaqaaiaaigdacaaIZaGaaiilaiaaisdacaaI5aGaaGinaaqaaiaaig dacaaIYaGaaiilaiaaiwdacaaI5aGaaGinaaaacaaMf8Uaeyypa0Ja aGzbVlaaigdacaGGUaGaaGimaiaaiEdacaaIXaGaaGynaiaaywW7aa a@7BFF@  

New base price =base priceof old model ×quality change=13,390×1.0715=14,347 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaeyypa0JaaG zbVlaabkgacaqGHbGaae4CaiaabwgacaqGGaGaaeiCaiaabkhacaqG PbGaae4yaiaabwgacaaMe8Uaae4BaiaabAgacaqGGaGaae4BaiaabY gacaqGKbGaaeiiaiaab2gacaqGVbGaaeizaiaabwgacaqGSbGaaeii aiaaysW7cqGHxdaTcaaMe8UaaeyCaiaabwhacaqGHbGaaeiBaiaabM gacaqG0bGaaeyEaiaabccacaqGJbGaaeiAaiaabggacaqGUbGaae4z aiaabwgacaaMe8UaaGzbVlabg2da9iaaywW7caaIXaGaaG4maiaacY cacaaIZaGaaGyoaiaaicdacaaMe8Uaey41aqRaaGjbVlaaigdacaGG UaGaaGimaiaaiEdacaaIXaGaaGynaiaaywW7cqGH9aqpcaaMf8UaaG ymaiaaisdacaGGSaGaaG4maiaaisdacaaI3aGaaGzbVdaa@7DD2@

The third stage is to compare current price with new price,

New Car Index = 14,000 14,347 ×100=97.6 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaaeOtaiaabw gacaqG3bGaaeiiaiaaboeacaqGHbGaaeOCaiaabccacaqGjbGaaeOB aiaabsgacaqGLbGaaeiEaiaabccacaaMf8Uaeyypa0JaaGzbVpaala aabaGaaGymaiaaisdacaGGSaGaaGimaiaaicdacaaIWaaabaGaaGym aiaaisdacaGGSaGaaG4maiaaisdacaaI3aaaaiaaywW7cqGHxdaTca aMf8UaaGymaiaaicdacaaIWaGaaGzbVlabg2da9iaaywW7caaI5aGa aG4naiaac6cacaaI2aGaaGzbVdaa@5EAE@

The value of 97.6 indicates that the there has been a reduction in the price index for new cars over the one month period, January to February.

Conclusion

The unadjusted index is 105.6 (Appendix I), indicating a price increase, which, if it is used, would provide an incorrect measurement of the price changes of new cars over the period.

Unadjusted New Car Index = 14,000 13,390 ×100=105.6 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaaeyvaiaab6 gacaqGHbGaaeizaiaabQgacaqG1bGaae4CaiaabshacaqGLbGaaeiz aiaabccacaqGobGaaeyzaiaabEhacaqGGaGaae4qaiaabggacaqGYb GaaeiiaiaabMeacaqGUbGaaeizaiaabwgacaqG4bGaaeiiaiaaywW7 cqGH9aqpcaaMf8+aaSaaaeaacaaIXaGaaGinaiaacYcacaaIWaGaaG imaiaaicdaaeaacaaIXaGaaG4maiaacYcacaaIZaGaaGyoaiaaicda aaGaaGzbVlabgEna0kaaywW7caaIXaGaaGimaiaaicdacaaMf8Uaey ypa0JaaGzbVlaaigdacaaIWaGaaGynaiaac6cacaaI2aGaaGzbVdaa @6933@

References

  1. Lancaster K (1966) A New Approach to Consumer Theory. The Journal of Political Economy 74(2): 132-157.
  2. Griliches Z (1961) Hedonic price indexes for automobiles: An econometric analysis of quality change. In The Price Statistics of the Federal Government. National Bureau of Economic Research, New York, USA, pp. 173-196.
  3. Griliches Z (1971) Hedonic Price Indexes Revisited. In: Griliches et al. (Eds), Price Indexes and Quality Change. Harvard University Press, Cambridge, Massachusetts, USA, p. 3-15.
  4. Diewert WE (2001) Hedonic Prices: A consumer theory approach, Department of Economics, University of British Columbia, Vancouver, Canada, p. 1-12.
  5. Diewert WE (2002) Harmonized Indexes of Consumer Prices: Their conceptual foundations. Working European Central Bank, Vancouver, Canada, p. 1-89.
  6. Diewert WE (2002) Hedonic Producer Price Indexes and Quality Adjustment. Department of Economics, University of British Columbia, Vancouver, Canada, p. 2-14.
  7. Moulten BR (2001) The Expanding Role of Hedonic methods in the Official Statistics of the United States, Bureau of Economic Analysis, US Department of Commerce, Washington DC, USA, p. 1-16.
  8. Council Regulation (EC) (1995) No 2494/95 of 23 October 1995, concerning harmonized indices of consumer prices, Official Journal L 257, p. 0001-0004.
  9. Allen RGD (1975) Index Numbers in Theory and Practice. The Macmillan Press Ltd., Basingstoke, UK, p. 25.
© 2014-2016 MedCrave Group, All rights reserved. No part of this content may be reproduced or transmitted in any form or by any means as per the standard guidelines of fair use.
Creative Commons License Open Access by MedCrave Group is licensed under a Creative Commons Attribution 4.0 International License.
Based on a work at http://medcraveonline.com
Best viewed in Mozilla Firefox | Google Chrome | Above IE 7.0 version | Opera |Privacy Policy