Teaching Approach: Assessment of Scale Reliability and Validity

OUTLINE

1-Concept of the reliability of a scale

2-Sources of error

3-Estimate (cronbach) of various types of reliability

4-Internal consistency/ coefficient alpha

5-Test retest reliability, Split half reliability

6-uses of the reliability coefficient

7-analysis of various (ANOVA) Approach to reliability

8-generalizibility theory

9-Scale validity – Basic concept of general consideration

10-Types of validity

11-Explication of construct 12-Issues concerning validity (Relation among various types, nomenclature / different names and place of factor analysis).

NATURE OF RELIABILITY

RELIABILITY refers to the consistency of measurement, that is, how consistence test scores or other assessment results are from one measurement to another.

Any particular instrument may have a number of different reliabilities

Assessment results are not reliable in general. They are reliable over different periods of time, over different samples of tasks, over different raters.

Reliability is a necessary but not sufficient condition for validity

Reliability is assessed primarily with statistical indicater. Determining reliability by correlation methods

Sources of errors in reliability and validity
(A) Systematic variance common with other tests.
(B) Systematic variance specific only for the given test.
(C) Error variance (random).
Reliability = A + B (systematic variance in opposition to the random variance).
Validity = A (common variance in opposition to the unique test variance = B + C).
Sources of variance in reliability and validity
Reliability refers to the systematic variance in opposition to the random variance.
Validity refers to the common variance in opposition to the unigue variance (unexplained, random and systematic, but specific for the test).
Extraversion and neuroticism in Big Five inventories
Neuroticism:
Anxiety
Angry hostility
Depression
Impulsiveness
Vulnerability
Self-consciousness
Extraversion:
Gregariousness
Warmth
Assertiveness
Activity
Excitement-seeking
Positive emotions

Standard error of measurement Amount of variation in the scores would be directly related the reliability of the assessment procedures. It is possible to estimate the amount of variation to be expected in the scores. This estimate is called the standard error of measurement.

FACTORS MAY INFLUENCE ASSESSMENT RESULTS

1*close succession (shorter time between tests)

2* longer time between tests

3* different sample of tasks in the second assessment 4* error under different condition

Estimate of various types of reliability

Methods of estimating reliability

Method Type of reliability measure Procedure
Test retest	Measure of stability	Give the same test twice to the same group with some time interval between tests, from several minutes to several years
Equivalent forms	Measure of equivalence	Give two forms of the test to the same group in close succession
Test retest with equivalent forms	Measure of stability and equivalence	Give two forms of the test to the same group with an increased time interval between forms
Split half	Measure of internal consistency	Give test one; score two equivalent halves of test correct correlation between halves to fit whole test by spearman brown formula
Kuder-Richardson and coefficient alpha Interrater	Measure of internal consistency Measure of consistency of ratings	Give test one; score total test and apply kuder-Richardson formula Give a set of student responses requiring judgmental scoring to two or more raters and have them independently score the responses

1. Test-retest reliability is a measure of reliability obtained by administering the same test twice over a period of time to a group of individuals. The scores from Time 1 and Time 2 can then be correlated in order to evaluate the test for stability over time. Test-retest reliability is described as the correlation between the distribution of scores on one administration and the distribution of scores on a subsequent administration. Test-retest reliability is also an important factor in some experimental designs in which the treatment group is administered a pretest and posttest with treatment in between and the control group only receives the pretest and the posttest. Any analysis of the difference noted in the results of the posttest (compared to the pretest) of the treatment group is confounded unless there is a strong reliability between the pretest and posttest of the control group.

2. Equivalent forms method: The equivalent forms method for estimating reliability uses two different but equivalent forms of an assessment. Equivalent forms are built to the same set of specification but are constructed independently. The two forms of the assessment is administered to the same group of students in close succession, and the resulting assessment scores are correlated. The equivalent forms method of estimating reliability is widely used in standardized testing.

3. Inter-rater reliability is a measure of reliability used to assess the degree to which different judges or raters agree in their assessment decisions.

4. Internal consistency reliability is a measure of reliability used to evaluate the degree to which different test items that probe the same construct produce similar results.

5. Split-half reliability

This is a type of internal consistency reliability. The process of obtaining split-half reliability is begun by “splitting in half” all items of a test that are intended to probe the same area of knowledge (e.g., World War II) in order to form two “sets” of items. The entire test is administered to a group of individuals, the total score for each “set” is computed, and finally the split-half reliability is obtained by determining the correlation between the two total “set” scores.

Split half reliability is similar to parallel forms except that the two forms are both incorporated into one test. After the test is administered, the scores are divided into the two forms and the correlation between the two distributions of scores is calculated.

2 times correlation

Reliability on between half assessments

Full assessment 1 plus correlation

Between half assessment

6. Kuder-Richardson Method and Coefficient Alpha

Another method of estimating the reliability of assessment scores from a single administration is by means of formulas such as those developed by kuder-Richardson . As with the spilt half method, these formulas provide an index of internal consistency but do not require splitting the assessment in half for scoring purposes.

7. Inter-rater reliability is a measure of reliability used to assess the degree to which different judges or raters agree in their assessment decisions. Inter-rater reliability is useful because human observers will not necessarily interpret answers the same way; raters may disagree as to how well certain responses or material demonstrate knowledge of the construct or skill being assessed.

Inter-rater reliability is increased if the observers have appropriate training. The training should focus on what exactly is meant to be observed. The raters need to be given a clear description of the event to be observed. The classroom observers would need to know what is and is not appropriate behavior.

TYPE OF CONSISTENCY INDICATED BY EACH OF THE METHODS FOR ESTIMATING RELIABILITY

Methods of estimating reliability	Consistency of testing procedure	Constancy of student characteristics	Consistency over different samples of items	Consistency of judgmental scores
Test –retest (immediate)	×
Test –retest (time interval)	×	×
Equivalent forms (immediate)	×		×
Equivalent forms(time interval)	×	×	×
Spilt-half	×		×
Kuder-Richardson (Coefficient Alpha)	×		×
Interrater				×

Uses of the reliability coefficient

Summated scales are often used in survey instruments to probe underlying constructs that the researcher wants to measure. These may consist of indexed responses to dichotomous or multi-point questionnaires, which are later summed to arrive at a resultant score associated with a particular respondent. Usually, development of such scales is not the end of the research itself, but rather a means to gather predictor variables for use in objective models. However, the question of reliability rises as the function of scales is stretched to encompass the realm of prediction. One of the most popular reliability statistics in use today is Cronbach's alpha (Cronbach, 1951). Cronbach's alpha determines the internal consistency or average correlation of items in a survey instrument to gauge its reliability. This paper will illustrate the use of the ALPHA option of the PROC CORR procedure from SAS(R) to assess and improve upon the reliability of variables derived from summated scales.

If you were giving an evaluation survey, would it not be nice to know that the instrument you are using will always elicit consistent and reliable response even if questions were replaced with other similar questions? When you have a variable generated from such a set of questions that return a stable response, then your variable is said to be reliable. Cronbach's alpha is an index of reliability associated with the variation accounted for by the true score of the "underlying construct." Construct is the hypothetical variable that is being measured (Hatcher, 1994).

Alpha coefficient ranges in value from 0 to 1 and may be used to describe the reliability of factors extracted from dichotomous (that is, questions with two possible answers) and/or multi-point formatted questionnaires or scales (i.e., rating scale: 1 = poor, 5 = excellent). The higher the score, the more reliable the generated scale is. Nunnaly (1978) has indicated 0.7 to be an acceptable reliability coefficient but lower thresholds are sometimes used in the literature.

For this demonstration, observed variables were used that precipitated a latent construct earlier labeled "REGULATE" to run on Cronbach's alpha analysis. The following SAS statements initiated the procedure:

PROC CORR ALPHA NOMISS;

VAR SB2 SB3 SB4 SB8 SF1 SF2 SG2;

RUN;

Where:

Label		Description
SB2	==>	Continuation of conservation benefits
SB3	==>	Continuation of government regulation on water quality
SB4	==>	Require farmers to plant grass strips
SB8	==>	Require farmers to keep pesticide application records
SF1	==>	Storage & cooking instructions for meat products
SF2	==>	Strengthen food inspection
SG2	==>	More nutritional information on food label

The first statement invoked the procedure PROC CORR that implements the option ALPHA to do Cronbach's alpha analysis on all observations with no missing values (dictated by the NOMISS option). The VAR statement lists down all the variables to be processed for the analysis. Incidentally, the listed variables, except SB8, were the ones that loaded high (i.e., showed high positive correlation) in factor analysis. The output from the analysis is shown in Table 1.

Table 1
Output of alpha analysis for the items included in the "REGULATE" construct

Correlation Analysis

Cronbach Coefficient Alpha

for RAW variables : 0.76729

for STANDARDIZED variables: 0.77102

Raw Variables Std. Variables

Deleted Correlation Correlation

Variable with Total Alpha with Total Alpha

---------------------------------------------------------

SB2 0.365790 0.764471 0.358869 0.772209

SB3 0.356596 0.765262 0.350085 0.772623

SB4 0.444259 0.779964 0.434180 0.781626

SB8 0.185652 0.808962 0.176243 0.816080

SF1 0.426663 0.761443 0.443533 0.769178

SF2 0.401001 0.763201 0.418211 0.773390

SG2 0.419384 0.762229 0.434247 0.770623

The raw variable columns were used instead of the standardized columns since the variances showed a limited spread (data not shown). Had there been a mixture of dichotomous and multi-point scales in the survey, we would have had relatively heterogeneous variances in which case the use of standardized variables would have been more appropriate. As it is, the procedure output has an overall raw alpha of .77 (rounded from .76729 from the top of table) which is good considering that .70 is the cutoff value for being acceptable.

The printed output facilitates the identification of dispensable variable(s) by listing down the deleted variables in the first column together with the expected resultant alpha in the same row in the third column. For this example, the table indicates that if SB8 were to be deleted then the value of raw alpha will increase from the current .77 to .81. Note that the same variable has the lowest item-total correlation value (.185652). This indicates that SB8 is not measuring the same construct as the rest of the items in the scale are measuring. With this process alone, not only was the author able to come up with the reliability index of the "REGULATE" construct but he also managed to improve on it. What this means is that removal SB8 from the scale will make the construct more reliable for use as a predictor variable.

analysis of various (ANOVA) Approach to reliability Analysis of Variance (or ANOVA) refers to the procedure of partitioning the variability of a data set to conduct various significance tests. In experiments where only a single factor is investigated, the analysis of variance is referred to as one-way ANOVA. The basic assumption in applying ANOVA is that the response is normally distributed. The variance of the response is divided into the variance that can be attributed to the investigated characteristic (or factor) and the variance that can be attributed to the randomness that is seen to occur naturally for the response. The former is referred to as the treatment mean square, MS_TR, while the latter is referred to as the error mean square, MS_E. A ratio of the two terms is used to conduct the F test.

(1)

The ratio of Eqn. (1) is referred to as the F ratio. It is assumed that if the investigated factor does not affect the response, then MS_TR will not be significantly different from MS_E. In such cases, the F ratio will be close to a value of 1 and will follow the F distribution. On the other hand, if the investigated factor does affect the response, then the F ratio will not follow the Fdistribution and the p value corresponding to the F ratio will indicate this.

References

1. American Educational Research Association, American Psychological Association.

2. National Council on Measurement in Education. (1985). Standards for educational and psychological testing. Washington, DC: Authors.

3. Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.). Educational

4. Measurement (2nd ed.). Washington, D. C.: American Council on Education.

5. Moskal, B.M., & Leydens, J.A. (2000). Scoring rubric development: Validity and reliability. Practical Assessment, Research & Evaluation, 7(10). [Available online:

Teaching Approach

Tuesday, 9 July 2013

Assessment of Scale Reliability and Validity

No comments:

Post a Comment