Tuesday, 9 July 2013

Assessment of Scale Reliability and Validity

OUTLINE
1-Concept of the reliability of a scale
2-Sources of error
3-Estimate (cronbach) of various types of reliability
4-Internal consistency/ coefficient alpha
5-Test retest reliability, Split half reliability
6-uses of the reliability coefficient
7-analysis of various (ANOVA)  Approach to reliability
8-generalizibility theory
9-Scale validity – Basic concept of general consideration
10-Types of validity
11-Explication of construct                                                                                                                       12-Issues concerning validity (Relation among various types, nomenclature / different         names and place of factor analysis).
NATURE OF RELIABILITY
RELIABILITY refers to the consistency of measurement, that is, how consistence test scores or other assessment results are from one measurement to another.
Any particular instrument may have a number of different reliabilities
Assessment results are not reliable in general. They are reliable over different periods of time, over different samples of tasks, over different raters.
Reliability is a necessary but not sufficient condition for validity
Reliability is assessed primarily with statistical indicater.                                                       Determining reliability by correlation methods
Sources of errors in reliability and validity
(A) Systematic variance common with other tests.
(B) Systematic variance specific only for the given test.
(C) Error variance (random).
Reliability = A + B (systematic variance in opposition to the random variance).
Validity = A (common variance in opposition to the unique test variance = B + C).
Sources of variance in reliability and validity
Reliability refers to the systematic variance in opposition to the random variance.
Validity refers to the common variance in opposition to the unigue variance (unexplained, random and systematic, but specific for the test).
Extraversion and neuroticism in Big Five inventories
Neuroticism:
Anxiety
Angry hostility
Depression
Impulsiveness
Vulnerability
Self-consciousness
Extraversion:
Gregariousness
Warmth
Assertiveness
Activity
Excitement-seeking
Positive emotions
Standard error of measurement                                                                      Amount of variation in the scores would be directly related the reliability of the assessment procedures. It is possible to estimate the amount of variation to be expected in the scores. This estimate is called the standard error of measurement.

 FACTORS MAY INFLUENCE ASSESSMENT RESULTS
 1*close succession (shorter time between tests)
2* longer time between tests
3* different sample of tasks in the second assessment                                                                          4* error under different condition
Estimate of various types of reliability
                                                    Methods of estimating reliability
Method                                   Type of reliability measure            Procedure
Test retest
Measure of stability
Give the same test twice to the same group with some time interval between tests, from several minutes to several years
Equivalent forms
Measure of equivalence
Give two forms of the test to the same group in close succession
Test retest with equivalent forms
Measure of stability and equivalence
 Give two forms of the test to the same group with an increased time interval between forms
Split half
Measure of internal consistency
Give test one; score two equivalent halves of test correct correlation between halves to fit whole test by spearman brown formula
Kuder-Richardson and coefficient alpha Interrater
Measure of internal consistency
Measure of consistency of ratings
Give test one; score total test and apply kuder-Richardson formula
Give a set of student responses requiring judgmental scoring to two or more raters and have them independently score the responses

1.                  Test-retest reliability is a measure of reliability obtained by administering the same test twice over a period of time to a group of individuals.  The scores from Time 1 and Time 2 can then be correlated in order to evaluate the test for stability over time. Test-retest reliability is described as the correlation between the distribution of scores on one administration and the distribution of scores on a subsequent administration. Test-retest reliability is also an important factor in some experimental designs in which the treatment group is administered a pretest and posttest with treatment in between and the control group only receives the pretest and the posttest. Any analysis of the difference noted in the results of the posttest (compared to the pretest) of the treatment group is confounded unless there is a strong reliability between the pretest and posttest of the control group.

2.                  Equivalent forms method: The equivalent forms method for estimating reliability uses two different but equivalent forms of an assessment. Equivalent forms are built to the same set of specification but are constructed independently. The two forms of the assessment is administered to the same group of students in close succession, and the resulting assessment scores are correlated. The equivalent forms method of estimating reliability is widely used in standardized testing.
3.                  Inter-rater reliability is a measure of reliability used to assess the degree to which different judges or raters agree in their assessment decisions. 
4.                  Internal consistency reliability is a measure of reliability used to evaluate the degree to which different test items that probe the same construct produce similar results. 
5.                  Split-half reliability
This is a type of internal consistency reliability.  The     process of obtaining split-half reliability is begun by “splitting in half” all items of a test that are intended to probe the same area of knowledge (e.g., World War II) in order to form two “sets” of items.  The entire test is administered to a group of individuals, the total score for each “set” is computed, and finally the split-half reliability is obtained by determining the correlation between the two total “set” scores.
Split half reliability is similar to parallel forms except that the two forms are both incorporated into one test. After the test is administered, the scores are divided into the two forms and the correlation between the two distributions of scores is calculated.
                                                                                   2 times correlation
                                       Reliability on                     between half assessments
                                       Full assessment                 1 plus correlation
                                                                                 Between half assessment

6.                  Kuder-Richardson Method and Coefficient Alpha
Another method of estimating the reliability of assessment scores from a single administration is by means of formulas such as those developed by kuder-Richardson . As with the spilt half method, these formulas provide an index of internal consistency but do not require splitting the assessment in half for scoring purposes.
7.                  Inter-rater reliability is a measure of reliability used to assess the degree to which different judges or raters agree in their assessment decisions.  Inter-rater reliability is useful because human observers will not necessarily interpret answers the same way; raters may disagree as to how well certain responses or material demonstrate knowledge of the construct or skill being assessed.
 Inter-rater reliability is increased if the observers have appropriate training. The training should focus on what exactly is meant to be observed. The raters need to be given a clear description of the event to be observed. The classroom observers would need to know what is and is not appropriate behavior.
                                                                                                                                                                                                                                  TYPE OF CONSISTENCY INDICATED BY EACH OF THE METHODS FOR ESTIMATING RELIABILITY
Methods of estimating reliability
Consistency of testing procedure
Constancy of student characteristics
Consistency over different samples of items
Consistency of judgmental scores
Test –retest (immediate)
                ×



Test –retest        (time interval)
                ×
                  ×


Equivalent forms (immediate)
                ×

                  ×

Equivalent forms(time interval)
                ×
                  ×
                  ×

Spilt-half
                ×

                  ×

Kuder-Richardson (Coefficient Alpha)
                ×

                  ×

Interrater



                 ×

Uses of the reliability coefficient
Summated scales are often used in survey instruments to probe underlying constructs that the researcher wants to measure. These may consist of indexed responses to dichotomous or multi-point questionnaires, which are later summed to arrive at a resultant score associated with a particular respondent. Usually, development of such scales is not the end of the research itself, but rather a means to gather predictor variables for use in objective models. However, the question of reliability rises as the function of scales is stretched to encompass the realm of prediction. One of the most popular reliability statistics in use today is Cronbach's alpha (Cronbach, 1951). Cronbach's alpha determines the internal consistency or average correlation of items in a survey instrument to gauge its reliability. This paper will illustrate the use of the ALPHA option of the PROC CORR procedure from SAS(R) to assess and improve upon the reliability of variables derived from summated scales.  
If you were giving an evaluation survey, would it not be nice to know that the instrument you are using will always elicit consistent and reliable response even if questions were replaced with other similar questions? When you have a variable generated from such a set of questions that return a stable response, then your variable is said to be reliable. Cronbach's alpha is an index of reliability associated with the variation accounted for by the true score of the "underlying construct." Construct is the hypothetical variable that is being measured (Hatcher, 1994).
Alpha coefficient ranges in value from 0 to 1 and may be used to describe the reliability of factors extracted from dichotomous (that is, questions with two possible answers) and/or multi-point formatted questionnaires or scales (i.e., rating scale: 1 = poor, 5 = excellent). The higher the score, the more reliable the generated scale is. Nunnaly (1978) has indicated 0.7 to be an acceptable reliability coefficient but lower thresholds are sometimes used in the literature.
For this demonstration, observed variables were used that precipitated a latent construct earlier labeled "REGULATE" to run on Cronbach's alpha analysis. The following SAS statements initiated the procedure:
PROC CORR ALPHA NOMISS;
VAR SB2 SB3 SB4 SB8 SF1 SF2 SG2;
RUN;
Where:
Label
Description
SB2
==>
Continuation of conservation benefits
SB3
==>
Continuation of government regulation
on water quality
SB4
==>
Require farmers to plant grass strips
SB8
==>
Require farmers to keep pesticide application
records
SF1
==>
Storage & cooking instructions for meat products
SF2
==>
Strengthen food inspection
SG2
==>
More nutritional information on food label
The first statement invoked the procedure PROC CORR that implements the option ALPHA to do Cronbach's alpha analysis on all observations with no missing values (dictated by the NOMISS option). The VAR statement lists down all the variables to be processed for the analysis. Incidentally, the listed variables, except SB8, were the ones that loaded high (i.e., showed high positive correlation) in factor analysis. The output from the analysis is shown in Table 1.
Table 1
Output of alpha analysis for the items included in the "REGULATE" construct
                 Correlation Analysis

              Cronbach Coefficient Alpha
          for RAW variables         :  0.76729
          for STANDARDIZED variables:  0.77102

          Raw Variables              Std. Variables

Deleted    Correlation              Correlation
Variable   with Total      Alpha    with Total      Alpha
---------------------------------------------------------
SB2        0.365790     0.764471    0.358869     0.772209
SB3        0.356596     0.765262    0.350085     0.772623
SB4        0.444259     0.779964    0.434180     0.781626
SB8        0.185652     0.808962    0.176243     0.816080
SF1        0.426663     0.761443    0.443533     0.769178
SF2        0.401001     0.763201    0.418211     0.773390
SG2        0.419384     0.762229    0.434247     0.770623
The raw variable columns were used instead of the standardized columns since the variances showed a limited spread (data not shown). Had there been a mixture of dichotomous and multi-point scales in the survey, we would have had relatively heterogeneous variances in which case the use of standardized variables would have been more appropriate. As it is, the procedure output has an overall raw alpha of .77 (rounded from .76729 from the top of table) which is good considering that .70 is the cutoff value for being acceptable.
The printed output facilitates the identification of dispensable variable(s) by listing down the deleted variables in the first column together with the expected resultant alpha in the same row in the third column. For this example, the table indicates that if SB8 were to be deleted then the value of raw alpha will increase from the current .77 to .81. Note that the same variable has the lowest item-total correlation value (.185652). This indicates that SB8 is not measuring the same construct as the rest of the items in the scale are measuring. With this process alone, not only was the author able to come up with the reliability index of the "REGULATE" construct but he also managed to improve on it. What this means is that removal SB8 from the scale will make the construct more reliable for use as a predictor variable.

analysis of various (ANOVA)  Approach to reliability                             Analysis of Variance (or ANOVA) refers to the procedure of partitioning the variability of a data set to conduct various significance tests. In experiments where only a single factor is investigated, the analysis of variance is referred to as one-way ANOVA. The basic assumption in applying ANOVA is that the response is normally distributed. The variance of the response is divided into the variance that can be attributed to the investigated characteristic (or factor) and the variance that can be attributed to the randomness that is seen to occur naturally for the response. The former is referred to as the treatment mean square, MSTR, while the latter is referred to as the error mean square, MSE. A ratio of the two terms is used to conduct the F test.
http://www.reliasoft.com/newsletter/v9i1/doe_eqn1.gif
(1)
The ratio of Eqn. (1) is referred to as the F ratio. It is assumed that if the investigated factor does not affect the response, then MSTR will not be significantly different from MSE. In such cases, the F ratio will be close to a value of 1 and will follow the F distribution. On the other hand, if the investigated factor does affect the response, then the F ratio will not follow the Fdistribution and the p value corresponding to the F ratio will indicate this.  


References
1.                  American Educational Research Association, American Psychological Association.
2.                  National Council on Measurement in Education. (1985). Standards for educational and psychological testing. Washington, DC: Authors.                                                                
3.                  Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.). Educational
4.                  Measurement (2nd ed.). Washington, D. C.: American Council on Education.
5.                  Moskal, B.M., & Leydens, J.A. (2000). Scoring rubric development: Validity and reliability. Practical Assessment, Research & Evaluation, 7(10). [Available online:


No comments:

Post a Comment