OUTLINE
1-Concept of the reliability of a scale
2-Sources of error
3-Estimate (cronbach) of various types of
reliability
4-Internal consistency/ coefficient alpha
5-Test retest reliability, Split half reliability
6-uses of the reliability coefficient
7-analysis of various (ANOVA) Approach to reliability
8-generalizibility theory
9-Scale validity – Basic concept of general
consideration
10-Types of validity
11-Explication of construct
12-Issues
concerning validity (Relation among various types, nomenclature /
different names and place of
factor analysis).
NATURE OF RELIABILITY
RELIABILITY refers to the consistency of
measurement, that is, how consistence test scores or other assessment results
are from one measurement to another.
Any particular instrument may have a number of
different reliabilities
Assessment results are not reliable in general. They
are reliable over different periods of time, over different samples of tasks,
over different raters.
Reliability is a necessary but not sufficient
condition for validity
Reliability is assessed primarily with statistical
indicater.
Determining reliability by correlation methods
Sources of errors in reliability and
validity
(A) Systematic variance common with other tests.
(B) Systematic variance specific only for the given test.
(C) Error variance (random).
Reliability = A + B (systematic variance in opposition to the random variance).
Validity = A (common variance in opposition to the unique test variance = B + C).
Sources of variance in reliability and validity
Reliability refers to the systematic variance in opposition to the random variance.
Validity refers to the common variance in opposition to the unigue variance (unexplained, random and systematic, but specific for the test).
Extraversion and neuroticism in Big Five inventories
Neuroticism:
Anxiety
Angry hostility
Depression
Impulsiveness
Vulnerability
Self-consciousness
Extraversion:
Gregariousness
Warmth
Assertiveness
Activity
Excitement-seeking
Positive emotions
(A) Systematic variance common with other tests.
(B) Systematic variance specific only for the given test.
(C) Error variance (random).
Reliability = A + B (systematic variance in opposition to the random variance).
Validity = A (common variance in opposition to the unique test variance = B + C).
Sources of variance in reliability and validity
Reliability refers to the systematic variance in opposition to the random variance.
Validity refers to the common variance in opposition to the unigue variance (unexplained, random and systematic, but specific for the test).
Extraversion and neuroticism in Big Five inventories
Neuroticism:
Anxiety
Angry hostility
Depression
Impulsiveness
Vulnerability
Self-consciousness
Extraversion:
Gregariousness
Warmth
Assertiveness
Activity
Excitement-seeking
Positive emotions
Standard error of measurement
Amount of variation in the scores would be directly related the
reliability of the assessment procedures. It is possible to estimate the amount
of variation to be expected in the scores. This estimate is called the standard
error of measurement.
FACTORS MAY INFLUENCE ASSESSMENT RESULTS
1*close
succession (shorter time between tests)
2* longer time between tests
3* different sample of tasks in the second
assessment
4* error under different condition
Estimate of various types of reliability
Methods of estimating reliability
|
Method Type of
reliability measure
Procedure
|
||
|
Test retest
|
Measure of stability
|
Give the same test twice to the same group with
some time interval between tests, from several minutes to several years
|
|
Equivalent forms
|
Measure of equivalence
|
Give two forms of the test to the same group in
close succession
|
|
Test retest with equivalent forms
|
Measure of stability and equivalence
|
Give two
forms of the test to the same group with an increased time interval between
forms
|
|
Split half
|
Measure of internal consistency
|
Give test one; score two equivalent halves of test
correct correlation between halves to fit whole test by spearman brown
formula
|
|
Kuder-Richardson and coefficient alpha Interrater
|
Measure of internal consistency
Measure of consistency of ratings
|
Give test one; score total test and apply
kuder-Richardson formula
Give a set of student responses requiring
judgmental scoring to two or more raters and have them independently score
the responses
|
1.
Test-retest
reliability is a measure of
reliability obtained by administering the same test twice over a period of time
to a group of individuals. The scores from Time 1 and Time 2 can then be
correlated in order to evaluate the test for stability over time. Test-retest
reliability is described as the correlation between the distribution of scores
on one administration and the distribution of scores on a subsequent
administration. Test-retest reliability is also an important factor in some
experimental designs in which the treatment group is administered a pretest and
posttest with treatment in between and the control group only receives the pretest
and the posttest. Any analysis of the difference noted in the results of the
posttest (compared to the pretest) of the treatment group is confounded unless
there is a strong reliability between the pretest and posttest of the control
group.
2.
Equivalent
forms method: The equivalent forms method for
estimating reliability uses two different but equivalent forms of an
assessment. Equivalent forms are built to the same set of specification but are
constructed independently. The two forms of the assessment is administered to
the same group of students in close succession, and the resulting assessment
scores are correlated. The equivalent forms method of estimating reliability is
widely used in standardized testing.
3.
Inter-rater reliability is a measure of reliability used to assess the
degree to which different judges or raters agree in their assessment
decisions.
4.
Internal
consistency reliability
is a measure of reliability used to evaluate the degree to which different test
items that probe the same construct produce similar results.
5.
Split-half
reliability
This is a type of
internal consistency reliability. The
process of obtaining split-half reliability is begun by “splitting in
half” all items of a test that are intended to probe the same area of knowledge
(e.g., World War II) in order to form two “sets” of items. The entire
test is administered to a group of individuals, the total score for each “set”
is computed, and finally the split-half reliability is obtained by determining the
correlation between the two total “set” scores.
Split
half reliability is similar to parallel forms except that the two forms are
both incorporated into one test. After the test is administered, the scores are
divided into the two forms and the correlation between the two distributions of
scores is calculated.
2 times correlation
Full
assessment 1 plus
correlation
Between half assessment
6.
Kuder-Richardson Method and Coefficient
Alpha
Another method of estimating the reliability of
assessment scores from a single administration is by means of formulas such as
those developed by kuder-Richardson . As with the spilt half method, these
formulas provide an index of internal consistency but do not require splitting
the assessment in half for scoring purposes.
7.
Inter-rater
reliability is a measure of
reliability used to assess the degree to which different judges or raters agree
in their assessment decisions. Inter-rater reliability is useful because
human observers will not necessarily interpret answers the same way; raters may
disagree as to how well certain responses or material demonstrate knowledge of
the construct or skill being assessed.
Inter-rater
reliability is increased if the observers have appropriate training. The
training should focus on what exactly is meant to be observed. The raters need
to be given a clear description of the event to be observed. The classroom
observers would need to know what is and is not appropriate behavior.
TYPE
OF CONSISTENCY INDICATED BY EACH OF THE METHODS FOR ESTIMATING RELIABILITY
|
Methods of estimating reliability
|
Consistency of testing procedure
|
Constancy of student characteristics
|
Consistency over different samples of items
|
Consistency of judgmental scores
|
|
Test –retest (immediate)
|
×
|
|
|
|
|
Test –retest
(time interval)
|
×
|
×
|
|
|
|
Equivalent forms (immediate)
|
×
|
|
×
|
|
|
Equivalent forms(time interval)
|
×
|
×
|
×
|
|
|
Spilt-half
|
×
|
|
×
|
|
|
Kuder-Richardson (Coefficient Alpha)
|
×
|
|
×
|
|
|
Interrater
|
|
|
|
×
|
Uses of the reliability coefficient
Summated scales are often used in survey instruments to probe
underlying constructs that the researcher wants to measure. These may consist
of indexed responses to dichotomous or multi-point questionnaires, which are
later summed to arrive at a resultant score associated with a particular
respondent. Usually, development of such scales is not the end of the research
itself, but rather a means to gather predictor variables for use in objective
models. However, the question of reliability rises as the function of scales is
stretched to encompass the realm of prediction. One of the most popular
reliability statistics in use today is Cronbach's alpha (Cronbach, 1951).
Cronbach's alpha determines the internal consistency or average correlation of
items in a survey instrument to gauge its reliability. This paper will
illustrate the use of the ALPHA option of the PROC CORR procedure from SAS(R)
to assess and improve upon the reliability of variables derived from summated
scales.
If you were giving an
evaluation survey, would it not be nice to know that the instrument you are
using will always elicit consistent and reliable response even if questions
were replaced with other similar questions? When you have a variable generated
from such a set of questions that return a stable response, then your variable
is said to be reliable. Cronbach's alpha is an index of reliability associated
with the variation accounted for by the true score of the "underlying
construct." Construct is the hypothetical variable that is being measured
(Hatcher, 1994).
Alpha coefficient ranges
in value from 0 to 1 and may be used to describe the reliability of factors
extracted from dichotomous (that is, questions with two possible answers)
and/or multi-point formatted questionnaires or scales (i.e., rating scale: 1 =
poor, 5 = excellent). The higher the score, the more reliable the generated
scale is. Nunnaly (1978) has indicated 0.7 to be an acceptable reliability
coefficient but lower thresholds are sometimes used in the literature.
For this demonstration,
observed variables were used that precipitated a latent construct earlier
labeled "REGULATE" to run on Cronbach's alpha analysis. The following
SAS statements initiated the procedure:
PROC CORR ALPHA NOMISS;
VAR SB2 SB3 SB4 SB8 SF1 SF2 SG2;
RUN;
Where:
|
Label
|
Description
|
|
|
SB2
|
==>
|
Continuation of conservation
benefits
|
|
SB3
|
==>
|
Continuation of government
regulation
on water quality |
|
SB4
|
==>
|
Require farmers to plant grass
strips
|
|
SB8
|
==>
|
Require farmers to keep pesticide
application
records |
|
SF1
|
==>
|
Storage & cooking instructions
for meat products
|
|
SF2
|
==>
|
Strengthen food inspection
|
|
SG2
|
==>
|
More nutritional information on
food label
|
The first statement
invoked the procedure PROC CORR that implements the option ALPHA to do
Cronbach's alpha analysis on all observations with no missing values (dictated
by the NOMISS option). The VAR statement lists down all the variables to be
processed for the analysis. Incidentally, the listed variables, except SB8,
were the ones that loaded high (i.e., showed high positive correlation) in
factor analysis. The output from the analysis is shown in Table 1.
Table 1
Output of alpha analysis for the items included in the "REGULATE" construct
Output of alpha analysis for the items included in the "REGULATE" construct
Correlation Analysis
Cronbach Coefficient Alpha
for RAW variables : 0.76729
for STANDARDIZED variables:
0.77102
Raw Variables Std.
Variables
Deleted
Correlation
Correlation
Variable
with Total Alpha with Total Alpha
---------------------------------------------------------
SB2
0.365790 0.764471 0.358869
0.772209
SB3
0.356596 0.765262 0.350085
0.772623
SB4
0.444259 0.779964 0.434180
0.781626
SB8 0.185652
0.808962 0.176243 0.816080
SF1
0.426663 0.761443 0.443533
0.769178
SF2
0.401001 0.763201 0.418211
0.773390
SG2
0.419384 0.762229 0.434247
0.770623
The raw variable columns
were used instead of the standardized columns since the variances showed a
limited spread (data not shown). Had there been a mixture of dichotomous and
multi-point scales in the survey, we would have had relatively heterogeneous
variances in which case the use of standardized variables would have been more
appropriate. As it is, the procedure output has an overall raw alpha of .77
(rounded from .76729 from the top of table) which is good considering that .70
is the cutoff value for being acceptable.
The printed output
facilitates the identification of dispensable variable(s) by listing down the
deleted variables in the first column together with the expected resultant
alpha in the same row in the third column. For this example, the table
indicates that if SB8 were to be deleted then the value of raw alpha will
increase from the current .77 to .81. Note that the same variable has the
lowest item-total correlation value (.185652). This indicates that SB8 is not
measuring the same construct as the rest of the items in the scale are
measuring. With this process alone, not only was the author able to come up
with the reliability index of the "REGULATE" construct but he also
managed to improve on it. What this means is that removal SB8 from the scale
will make the construct more reliable for use as a predictor variable.
analysis of various (ANOVA) Approach to reliability Analysis of Variance (or ANOVA) refers to the
procedure of partitioning the variability of a data set to conduct various significance
tests. In experiments where only a single factor is investigated, the analysis
of variance is referred to as one-way ANOVA. The basic assumption in
applying ANOVA is that the response is normally distributed. The variance of
the response is divided into the variance that can be attributed to the
investigated characteristic (or factor) and the variance that can be attributed
to the randomness that is seen to occur naturally for the response. The former
is referred to as the treatment mean square, MSTR, while the
latter is referred to as the error mean square, MSE. A ratio of
the two terms is used to conduct the F test.
![]() |
(1)
|
The ratio of Eqn. (1) is referred to as the F ratio.
It is assumed that if the investigated factor does not affect the response,
then MSTR will not be significantly different from MSE.
In such cases, the F ratio will be close to a value of 1 and
will follow the F distribution. On the other hand, if the
investigated factor does affect the response, then the F ratio
will not follow the Fdistribution and the p value
corresponding to the F ratio will indicate this.
References
1.
American Educational
Research Association, American Psychological Association.
2.
National Council on
Measurement in Education. (1985). Standards for educational and psychological
testing. Washington, DC: Authors.
3.
Cronbach, L. J. (1971). Test
validation. In R. L. Thorndike (Ed.). Educational
4.
Measurement (2nd ed.).
Washington, D. C.: American Council on Education.
5.
Moskal, B.M., & Leydens,
J.A. (2000). Scoring rubric development: Validity and reliability. Practical
Assessment, Research & Evaluation, 7(10). [Available online:

No comments:
Post a Comment