Teaching Approach: July 2013

Tuesday, 9 July 2013

Assessment of Scale Reliability and Validity

OUTLINE

1-Concept of the reliability of a scale

2-Sources of error

3-Estimate (cronbach) of various types of reliability

4-Internal consistency/ coefficient alpha

5-Test retest reliability, Split half reliability

6-uses of the reliability coefficient

7-analysis of various (ANOVA) Approach to reliability

8-generalizibility theory

9-Scale validity – Basic concept of general consideration

10-Types of validity

11-Explication of construct 12-Issues concerning validity (Relation among various types, nomenclature / different names and place of factor analysis).

NATURE OF RELIABILITY

RELIABILITY refers to the consistency of measurement, that is, how consistence test scores or other assessment results are from one measurement to another.

Any particular instrument may have a number of different reliabilities

Assessment results are not reliable in general. They are reliable over different periods of time, over different samples of tasks, over different raters.

Reliability is a necessary but not sufficient condition for validity

Reliability is assessed primarily with statistical indicater. Determining reliability by correlation methods

Sources of errors in reliability and validity
(A) Systematic variance common with other tests.
(B) Systematic variance specific only for the given test.
(C) Error variance (random).
Reliability = A + B (systematic variance in opposition to the random variance).
Validity = A (common variance in opposition to the unique test variance = B + C).
Sources of variance in reliability and validity
Reliability refers to the systematic variance in opposition to the random variance.
Validity refers to the common variance in opposition to the unigue variance (unexplained, random and systematic, but specific for the test).
Extraversion and neuroticism in Big Five inventories
Neuroticism:
Anxiety
Angry hostility
Depression
Impulsiveness
Vulnerability
Self-consciousness
Extraversion:
Gregariousness
Warmth
Assertiveness
Activity
Excitement-seeking
Positive emotions

Standard error of measurement Amount of variation in the scores would be directly related the reliability of the assessment procedures. It is possible to estimate the amount of variation to be expected in the scores. This estimate is called the standard error of measurement.

FACTORS MAY INFLUENCE ASSESSMENT RESULTS

1*close succession (shorter time between tests)

2* longer time between tests

3* different sample of tasks in the second assessment 4* error under different condition

Estimate of various types of reliability

Methods of estimating reliability

Method Type of reliability measure Procedure
Test retest	Measure of stability	Give the same test twice to the same group with some time interval between tests, from several minutes to several years
Equivalent forms	Measure of equivalence	Give two forms of the test to the same group in close succession
Test retest with equivalent forms	Measure of stability and equivalence	Give two forms of the test to the same group with an increased time interval between forms
Split half	Measure of internal consistency	Give test one; score two equivalent halves of test correct correlation between halves to fit whole test by spearman brown formula
Kuder-Richardson and coefficient alpha Interrater	Measure of internal consistency Measure of consistency of ratings	Give test one; score total test and apply kuder-Richardson formula Give a set of student responses requiring judgmental scoring to two or more raters and have them independently score the responses

1. Test-retest reliability is a measure of reliability obtained by administering the same test twice over a period of time to a group of individuals. The scores from Time 1 and Time 2 can then be correlated in order to evaluate the test for stability over time. Test-retest reliability is described as the correlation between the distribution of scores on one administration and the distribution of scores on a subsequent administration. Test-retest reliability is also an important factor in some experimental designs in which the treatment group is administered a pretest and posttest with treatment in between and the control group only receives the pretest and the posttest. Any analysis of the difference noted in the results of the posttest (compared to the pretest) of the treatment group is confounded unless there is a strong reliability between the pretest and posttest of the control group.

2. Equivalent forms method: The equivalent forms method for estimating reliability uses two different but equivalent forms of an assessment. Equivalent forms are built to the same set of specification but are constructed independently. The two forms of the assessment is administered to the same group of students in close succession, and the resulting assessment scores are correlated. The equivalent forms method of estimating reliability is widely used in standardized testing.

3. Inter-rater reliability is a measure of reliability used to assess the degree to which different judges or raters agree in their assessment decisions.

4. Internal consistency reliability is a measure of reliability used to evaluate the degree to which different test items that probe the same construct produce similar results.

5. Split-half reliability

This is a type of internal consistency reliability. The process of obtaining split-half reliability is begun by “splitting in half” all items of a test that are intended to probe the same area of knowledge (e.g., World War II) in order to form two “sets” of items. The entire test is administered to a group of individuals, the total score for each “set” is computed, and finally the split-half reliability is obtained by determining the correlation between the two total “set” scores.

Split half reliability is similar to parallel forms except that the two forms are both incorporated into one test. After the test is administered, the scores are divided into the two forms and the correlation between the two distributions of scores is calculated.

2 times correlation

Reliability on between half assessments

Full assessment 1 plus correlation

Between half assessment

6. Kuder-Richardson Method and Coefficient Alpha

Another method of estimating the reliability of assessment scores from a single administration is by means of formulas such as those developed by kuder-Richardson . As with the spilt half method, these formulas provide an index of internal consistency but do not require splitting the assessment in half for scoring purposes.

7. Inter-rater reliability is a measure of reliability used to assess the degree to which different judges or raters agree in their assessment decisions. Inter-rater reliability is useful because human observers will not necessarily interpret answers the same way; raters may disagree as to how well certain responses or material demonstrate knowledge of the construct or skill being assessed.

Inter-rater reliability is increased if the observers have appropriate training. The training should focus on what exactly is meant to be observed. The raters need to be given a clear description of the event to be observed. The classroom observers would need to know what is and is not appropriate behavior.

TYPE OF CONSISTENCY INDICATED BY EACH OF THE METHODS FOR ESTIMATING RELIABILITY

Methods of estimating reliability	Consistency of testing procedure	Constancy of student characteristics	Consistency over different samples of items	Consistency of judgmental scores
Test –retest (immediate)	×
Test –retest (time interval)	×	×
Equivalent forms (immediate)	×		×
Equivalent forms(time interval)	×	×	×
Spilt-half	×		×
Kuder-Richardson (Coefficient Alpha)	×		×
Interrater				×

Uses of the reliability coefficient

Summated scales are often used in survey instruments to probe underlying constructs that the researcher wants to measure. These may consist of indexed responses to dichotomous or multi-point questionnaires, which are later summed to arrive at a resultant score associated with a particular respondent. Usually, development of such scales is not the end of the research itself, but rather a means to gather predictor variables for use in objective models. However, the question of reliability rises as the function of scales is stretched to encompass the realm of prediction. One of the most popular reliability statistics in use today is Cronbach's alpha (Cronbach, 1951). Cronbach's alpha determines the internal consistency or average correlation of items in a survey instrument to gauge its reliability. This paper will illustrate the use of the ALPHA option of the PROC CORR procedure from SAS(R) to assess and improve upon the reliability of variables derived from summated scales.

If you were giving an evaluation survey, would it not be nice to know that the instrument you are using will always elicit consistent and reliable response even if questions were replaced with other similar questions? When you have a variable generated from such a set of questions that return a stable response, then your variable is said to be reliable. Cronbach's alpha is an index of reliability associated with the variation accounted for by the true score of the "underlying construct." Construct is the hypothetical variable that is being measured (Hatcher, 1994).

Alpha coefficient ranges in value from 0 to 1 and may be used to describe the reliability of factors extracted from dichotomous (that is, questions with two possible answers) and/or multi-point formatted questionnaires or scales (i.e., rating scale: 1 = poor, 5 = excellent). The higher the score, the more reliable the generated scale is. Nunnaly (1978) has indicated 0.7 to be an acceptable reliability coefficient but lower thresholds are sometimes used in the literature.

For this demonstration, observed variables were used that precipitated a latent construct earlier labeled "REGULATE" to run on Cronbach's alpha analysis. The following SAS statements initiated the procedure:

PROC CORR ALPHA NOMISS;

VAR SB2 SB3 SB4 SB8 SF1 SF2 SG2;

RUN;

Where:

Label		Description
SB2	==>	Continuation of conservation benefits
SB3	==>	Continuation of government regulation on water quality
SB4	==>	Require farmers to plant grass strips
SB8	==>	Require farmers to keep pesticide application records
SF1	==>	Storage & cooking instructions for meat products
SF2	==>	Strengthen food inspection
SG2	==>	More nutritional information on food label

The first statement invoked the procedure PROC CORR that implements the option ALPHA to do Cronbach's alpha analysis on all observations with no missing values (dictated by the NOMISS option). The VAR statement lists down all the variables to be processed for the analysis. Incidentally, the listed variables, except SB8, were the ones that loaded high (i.e., showed high positive correlation) in factor analysis. The output from the analysis is shown in Table 1.

Table 1
Output of alpha analysis for the items included in the "REGULATE" construct

Correlation Analysis

Cronbach Coefficient Alpha

for RAW variables : 0.76729

for STANDARDIZED variables: 0.77102

Raw Variables Std. Variables

Deleted Correlation Correlation

Variable with Total Alpha with Total Alpha

---------------------------------------------------------

SB2 0.365790 0.764471 0.358869 0.772209

SB3 0.356596 0.765262 0.350085 0.772623

SB4 0.444259 0.779964 0.434180 0.781626

SB8 0.185652 0.808962 0.176243 0.816080

SF1 0.426663 0.761443 0.443533 0.769178

SF2 0.401001 0.763201 0.418211 0.773390

SG2 0.419384 0.762229 0.434247 0.770623

The raw variable columns were used instead of the standardized columns since the variances showed a limited spread (data not shown). Had there been a mixture of dichotomous and multi-point scales in the survey, we would have had relatively heterogeneous variances in which case the use of standardized variables would have been more appropriate. As it is, the procedure output has an overall raw alpha of .77 (rounded from .76729 from the top of table) which is good considering that .70 is the cutoff value for being acceptable.

The printed output facilitates the identification of dispensable variable(s) by listing down the deleted variables in the first column together with the expected resultant alpha in the same row in the third column. For this example, the table indicates that if SB8 were to be deleted then the value of raw alpha will increase from the current .77 to .81. Note that the same variable has the lowest item-total correlation value (.185652). This indicates that SB8 is not measuring the same construct as the rest of the items in the scale are measuring. With this process alone, not only was the author able to come up with the reliability index of the "REGULATE" construct but he also managed to improve on it. What this means is that removal SB8 from the scale will make the construct more reliable for use as a predictor variable.

analysis of various (ANOVA) Approach to reliability Analysis of Variance (or ANOVA) refers to the procedure of partitioning the variability of a data set to conduct various significance tests. In experiments where only a single factor is investigated, the analysis of variance is referred to as one-way ANOVA. The basic assumption in applying ANOVA is that the response is normally distributed. The variance of the response is divided into the variance that can be attributed to the investigated characteristic (or factor) and the variance that can be attributed to the randomness that is seen to occur naturally for the response. The former is referred to as the treatment mean square, MS_TR, while the latter is referred to as the error mean square, MS_E. A ratio of the two terms is used to conduct the F test.

(1)

The ratio of Eqn. (1) is referred to as the F ratio. It is assumed that if the investigated factor does not affect the response, then MS_TR will not be significantly different from MS_E. In such cases, the F ratio will be close to a value of 1 and will follow the F distribution. On the other hand, if the investigated factor does affect the response, then the F ratio will not follow the Fdistribution and the p value corresponding to the F ratio will indicate this.

References

1. American Educational Research Association, American Psychological Association.

2. National Council on Measurement in Education. (1985). Standards for educational and psychological testing. Washington, DC: Authors.

3. Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.). Educational

4. Measurement (2nd ed.). Washington, D. C.: American Council on Education.

5. Moskal, B.M., & Leydens, J.A. (2000). Scoring rubric development: Validity and reliability. Practical Assessment, Research & Evaluation, 7(10). [Available online:

An Overview of Psychometric Properties of a Scale

Psychometrics:

Psychometrics is the field of study concerned with the theory and technique of psychological measurement which includes the measurement of knowledge, abilities, attitudes, personality traits, and educational measurement. The field is primarily concerned with the construction and validation of measurement instruments such as questionnaires, tests, and personality assessments. It involves two major research tasks, namely: (i) the construction of instruments and procedures for measurement; and (ii) the development and refinement of theoretical approaches to measurement.

Psychometric Properties:

The psychometric properties of a psychological test relate to the data that has been collected on the test to determine how well it measures the construct of interest. In order to develop a good psychological test, the new test is subjected to statistical analyses to ensure that it has good psychometric properties.

There are two broad types of psychometric properties that a test must have in order to be considered a good measure of a particular construct.

1-Reliability."This is the test's ability to measure the construct of interest consistently.

Types of Reliability:

1-Test –Retest Reliability: the same test twice a group and correlate the two set of score. 2- Equivalent- Forms Reliability: correlation between scores on two forms of the same test taken concurrently, at different times

3- Split-Half Reliability: measure of internal consistency of a test, correlation between scores on the odd numbered and even number items of a single test. 4- Kuder- Richardson which measure the internal consistency, used to estimate test reliability from one administration of test. 5- Scorer Reliability compares the two scores of the same test by two independent.

2-validity Validity refers to how well the test accurately measures the construct of interest. "It is an agreement/relationship between a test score and the quality it is intended to measure”

Types of validity:

1- Face Validity, the "appearance" of validity. If all you have is face validity, no inferences or generalizations can be made of actual behavior.Need to compare their test responses to actual behavior(i.e., aggressiveness in school). By giving the test-item to a panel of experts and asking for their judgment onproximity of the test content with construct of the test. (Face validity) 2-Content-Related Validity: test possesses content validity to the extent that it provides an adequate representation of construct you are trying to measure. 3- Criterion-Related Validity: how well test can predict relevant aspects of future behavior you are interested in. Two types of Criterion-Related Validity: (a)Predictive Validity the "forecasting" function of a test. How well can test predict future performance ?Predictor: is score you obtain on test; (b) Concurrent Validity. Want to know how person is behaving now. Tests are taken at same time; designed to predict person’s current performance (assess simultaneous relationship between test and criterion) .4- Construct Validity : Construct validity looks into the agreement between a theoretical concept and a specific measuring procedure. For example, where a researcher invents a new instrument that intends to measure IQ capabilities, the researcher might need to spend time attempting to “define” intelligence in order to achieve an acceptable level of construct validity. Construct validity can further be subdivided into convergent validity and discriminate validity. Convergent validity is a general agreement between measures where theoretically they should be related. On the other hand, discriminate validity is a general disagreement between measures where theoretically they should not be related [16]. Both convergent and discriminate validities will be examined by using the item-scale correlations; convergent validity indicates correlation between an item and its own scale, while discriminate validity indicates correlation between an item and any of the other scales [17].

Construct Validity conducted when you first create a test to measure a construct of interest .Two processes used to determine construct validity: a) Convergent Validity new test needs to correlate well with other tests believed to be measuring same construct. By administering the test along with other established tests developed on theoretically similar constructs and examining the correlation between the two (Convergent validity) .b) Discriminant Validity (Divergent Validity)you need to show that your test measures something a bit different from other tests that purport to measure the same construct (Uniqueness of test).-Why devise new test if there is already one around. By administering the test along with theoretically opposite tests and exploring the correlation (Divergent validity).

3-Internal consistency:

It indicates the extent to which items on a test measure the same construct. A high internal consistency reliability coefficient indicates that the items on the test are very similar to each other in content (homogeneous). It is important to note that the length of a test also affects internal consistency. A very long test, therefore, can spuriously have inflated reliability coefficient. Internal consistency is commonly measured as Cronbach’sAlpha which is between 0 (low) and 1 (high).

4-Dimensionality:

a scale’s dimensionality, or factor structure, refers to the number and nature of the variables reflected in its items. First, they must understand the number of psychological variables, or dimensions, reflected in its items. A scale’s items might be uni-dimensional, all reflecting a single common psychological variable or they might be multidimensional, reflecting two or more psychological variables. The second core dimensionality issue is, if a scale is multidimensional, whether the dimensions are correlated with each other. The third dimensionality issue is, again if a scale is multidimensional, the psychological meaning of the dimensions. Researchers must identify the nature of the psychological variables reflected by the dimensions.

Psychometric Properties Evaluation

Psychometric properties are defined as the elements that contribute to the statistical adequacy of

the instrument in terms of reliability and validity [23]. When both validity and reliability

analyses produce reasonably good results, then, the translated questionnaire can be concluded

and declared to have acceptable psychometric properties.

5- Test length

Test construction is not just a simple matter of throwing any kind of item into a main batch. It is crucial to decide what type of item format is required.

The length of the test should be suitable for that particular group. It would be a waste of time to test someone with a major depressive disorder on a test which requires 3 hours of heavy concentration. In other words, test items can come in different formats and styles such as multiple-choice questionnaires, true-false items, forced-choice, closed/open-response and so forth.

Validation: evidence for the validity of a test comes from demonstrating relationship/correlation between the test and other attributes it purports to measure. Three types of validity that need to be demonstrated ;content, criterion & construct. We must have convincing proof that there is a relationship between our test and what it claims to measure before we are justified in saying there is this connection and that this test is a valid measure of what we are interested in.

6-Standardization:

Item selection is never a perfect system. It always involves an item measurement error in assessment tests .That is why careful consideration is applied to the planning and implementation of item selection from the beginning to the end stages to avoid as little as possible too much measurement error.

1. Item analysis phase: The item analysis phase involves 3 different types of item statistics:1.Item difficulty value 2.discrminationvalue 3.Item total correlation. Item statistics help the researcher to choose the most suitable items. It is always wise to try and adapt tests on a homogeneous basis, which means taking into account many different demographic features of the test person, such as age, sex, social/economic background, educational status, and most important cultural differences.

References:-

1-http://www.centraltest.com/ct_us/upload/documents/psychometricValidation.pdf

2.http://bpd.about.com/od/glossary/g/Psychometric-Properties.htm

3 . http://courseweb.edteched.uottawa.ca/PSY1102C/Lectures/psychometric.htm

4 . http://psych.wfu.edu/furr/716/Furr%20SC&P%20Ch%204%20- %20Dimensionality%20and%20Reliability.pdf

5 . http://www.fldoe.org/pdf/TestScoreValidationProcess_v3.pdf

6. http://www.onlinereview.segi.edu.my/chapters/vol2_chap5.pdf

7. http://www.sciencedaily.com/articles/p/psychometrics.htm

Paradigms of Approaches to Scales Development

1. Definition of Paradigm

1. example, pattern; especially : an outstandingly clear or typical example or archetype

2. a philosophical and theoretical framework of a scientific school or discipline within which theories, laws, and generalizations and the experiments performed in support of them are formulated; broadly : a philosophical or theoretical framework of any kind

3. A typical example or pattern of something; a model.

4. A worldview underlying the theories and methodology of a particular scientific subject.

http://oxforddictionaries.com/definition/english/paradigm April 23, 2013

5. Definition of approach

1. A way of dealing with a situation or problem

2. Start to deal with (a situation or problem) in a certain way

http://www.thefreedictionary.com/approach April 23, 2013

3. Measurement Scales

Scaling defined:

Procedures for assigning numbers (or other symbols) to properties of an object in order to impart some numerical characteristics to the properties in question.

4. Scaling Approaches:

1. Unidimensional:

Measures only one attribute of a concept, respondent, or object.

2. Multidimensional:

Measures several dimensions of a concept, respondent, or object.

3. Types of Scales:

1. Noncomparative Scale:

Scales in which judgment is made without reference to another object, concept, or person.

2. Comparative Scale:

Scales in which one object, concept, or person is compared with another on a scale.

3. Churchill’s paradigms

Here is Figure from "A Paradigm for Developing Better Measures for Marketing Constructs". Gilbert A. Churchill, Jr., Journal of Marketing Research, 16:1. (Feb., 1979)

(Churchill, 1979)(A Paradigm for Developing Better Measures of marketing Constructs)

1. Introduction

1. Measurements are “rules for assigning numbers to objects to represent qualities of attributes”.

2. What is measured? ATTRIBUTES of objects. NOT objects themselves.

3. What is the goal? To have measures that are RELIABLE and VALID

4. Construct

1. Construct, e.g. customer satisfaction

2. True level of satisfaction (True score) denoted Xt

3. Observed score X0, rarely similar to Xt due to differences in stable characteristics, transient personal factors, situational factors etc.

4. Validity and Reliability

X0 = Xt + Xs + Xr, where

1. Xs – systematic source of error

2. Xr – random source of error

Validity: X0 = Xt

Perfect reliability: Xr = 0

1. Validity => Reliability

2. Reliability is necessary but not sufficient for Validity

3. Validity and Reliability (2)

1. Objective: find X0 that approximate Xt

2. Measures are inferences, their “goodness” is supported by the evidence, that is based on reliability or validity index

3. Reliability forms: split-half, test-retest etc.

4. Validity forms: face, content, predictive, concurrent, pragmatic, construct, convergent, discriminant.

5. Specify domain of the construct

1. Exactly defining what is included in the definition and what is excluded

2. Consulting the literature

3. Widely varying definitions should be avoided

4. Example: to measure customer satisfaction

1. Measure both expectations at the time of purchase and reactions at some time after the purchase

2. Expectations: cost, durability, quality, operating performance, aesthetic features, sales assistance, advertising, availability of competitor’s alternatives,

3. Generate sample of items

1. Literature searches

2. Experience surveys

3. Insight-stimulating examples

4. Critical incidents and focus groups

5. Purify the measure

1. Domain sampling model: purpose of any particular measurement is to estimate the score that would be obtained if all the items in the domain were used

2. In practice use of SAMPLE of items

3. Measurement error due to inadequate sampling

4. Correlation matrix of the items in the domain

1. Average correlation in the matrix

2. Dispersion of the correlation about the average

1. Assumption: all items, “if they belong to the domain of the concept, have an equal amount of common core”

2. Coefficient Alpha

1. Measure of internal consistency of a set of items

2. Low coefficient alpha indicates that the sample of items badly describes the construct which motivated the measure

3. Procedure by low alpha: some items should be eliminated.

1. Calculate correlation of each item with total score

2. Plot the correlations by decreasing order of magnitude

3. Items with correlations near zero should be eliminated

4. Items of substantial drop in the item-to-total correlations also deleted

1. Mistake to do split-half reliability

2. Purify the measure (2)

1. Desirable outcome: high coefficient alpha, dimensions agree with the conceptualized. Then, additional testing with a new sample of data.

2. Second outcome: Factor analysis suggests the overlapping dimensions. Items with pure loadings on the new factor are retained, new alpha calculated.

3. Non-desirable outcome: alpha coefficient is low and restructuring of items forming each dimension is unproductive. Loop back to 1. and 2.

4. Assess Reliability with new Data

1. Source of error within a test or measure is the sampling of items.

2. Coefficient alpha is the basic statistic for determining the reliability of a measure based on internal consistency, but it does not estimate errors external to the instrument.

3. Collect additional data to rule our the chance possibility of previous findings

4. Do not use test-retest reliability

5. Assess Construct Validity

1. Face or content valid measure has an appropriate sample

2. To establish construct validity

1. Determine the extent to which the measure correlates with other measures designed to measure the same thing

2. Determine whether the measure behaves as expected

3. Correlations with Other Measures

1. Any construct or trait should be measurable by at least two different methods

2. Convergent validity – extent to which it correlates highly with other methods designed to measure the same construct

3. Discriminant validity – the extent to which a measure a novel

4. Multitrait-multimethod matrix: methods and traits generating it should be as independent as possible

5. Multitrait-multimethod matrix

6. Does the measure behave as expected?

1. Internal consistency is insufficient condition for construct validity

2. Assess whether scale correctly predicts criterion measure (criterion validity)

1. The constructs job satisfaction (A) and likelihood of quitting the job (B) are related.

2. The scale X provides a measure of A.

3. Y provides a measure of B.

4. X and Y correlate positively.

1. Establish the validity by relating the measure to a number of other constructs and not only one

2. Developing Norms

1. Assessing the position of the individual on the characteristic is to compare the person’s score with the scores achieved by other people

2. Norm quality depends on both the number of cases on which the average is based and their representativeness

3. Anderson (1977) and Garbing’s (1988) paradigms

The purpose of measurement in theory testing and development research is to provide an empirical estimate of each theoretical construct of interest. Because of the limitations inherent in single-item measures (cf. Churchill 1979), respondents usually are administered two or more measures, often referred to as a scale, that are intended to be altemative indicators of the same underlying construct. A composite score defined by the respondent's scores on these measures, generally calculated as an unweighted sum, provides an estimate of the corresponding construct. Our central thesis is that the computation of this composite score is meaningful only if each of the measures is acceptably unidimensional. Unidimensionality refers to the existence of a single trait or construct underlying a set of measures (Hattie 1985; McDonald 1981).' The importance of unidimensionality has been stated succinctly by Hattie (1985 p. 49): "That a set of items forming an instrument all measure just one thing in common is a most critical and basic assumption of measurement theory."

Because the meaning of a measure intended by the researcher may not be the same as the meaning imputed to it by the respondents, the scale development process must include an assessment of whether the multiple measures that defme a scale can be acceptably regarded as altemative indicators of the same construct. Building on the earlier work of Churchill (1979) and Peter (1979, 1981), we outline an updated paradigm for scale development that incorporates confirmatory factor analysis (cf. Bentler 1985; Joreskog and Sorbom 1984) for the evaluation of unidimensionality. The key aspect of this updated paradigm is that confimiatory factor analysis affords a stricter interpretation of unidimensionality than can be provided by more traditional methods such as coefficient alpha, item-total correlations, and exploratory factor analysis and thus generally will provide different conclusions about the acceptability of a scale.

Contributing to the tradition of articles by Churchill (1979) and Peter (1979,1981). We outline an updated Paradigm for scale development that incorporates a more recent methodological development: confirmatory factor analysis. In doing so, we attempt to provide a better understanding of the concept of unidimensional measurement and the ways in which it can be assessed and, in particular, to demonstrate that an explicit evaluation of unidimensionaiity is accomplished with a confirmatory factor analysis of the individual measures as specified by a multiple-indicator measurement model. Coefficient alpha is important in the assessment of reliability, but it does not assess dimensionality. Though item-total correlations and exploratory factor analysis can provide useful preliminary analyses, particularly in the absence of sufficiently detailed theory, they do not directly assess unidimensionality. There as on is that a confirmatory factor analysis makes possible an assessment of the intemal consistency and extemal consistency criteria of unidimensionality implied by the multiple indicator measurement model.

Following the paradigm of scale development outlined here, after the unidimensionality of a set of scales has been acceptably established, one would assess its reliability. Even a perfectly unidimensional scale will not be useful in practice if the resultant scale score has unacceptably low reliability. Because most measures in marketing are administered at a single point in time, coefficient alpha or some other coefficient of equivalence reliability would probably be used for this assessment.

The goal of most research projects is not just to develop unidimensional and reliable measurement scales, but to build and test theory. Essential to this undertaking is the assessment of construct validity. A construct achieves its meaning in two ways (Anderson1987; CronbachandMeehl1955): (1) through observed indicators for which it is posited to be causally antecedent (and through observed measures for which it is not) and (2) through the set of relationships of the construct with other constructs as specified by some theory (the nomoiogical network). Unidimensionality, then, is necessary but not sufficient for construct validity. Not only should all the indicators that define a scale provide estimates of exactly one factor, but the meaning of the underlying factors should correspond to the construct of interest.

The nomological network can be explored with in the

Context of the full structural equation model. One means for accomplishing this is the approach developed byAnderson and Gerbing(1988) that allows an assessment of nomological validity that is asymptotically independent of the assessment of the measurement model. It is called a "two-step" approach because the measurement model first is developed and evaluated separately from the full structural equation model that simultaneously models measurement and structural relations. The measurement model in conjunction with the structural model makes possible a comprehensive confirmatory assessment of construct validity (Bentler1978). Hence, the assessment of unidimensionality provided by a confirmatory factor analysis represents but a first step in the establishment of meaning for the estimated factors.

(Gerbing & Anderson, 1988)

4. Loewenthal (1996) Approach

1. Features of Good Psychological Measures

1. A statement of what the scale measure;

2. Justification for the scale___ its uses and advantages over existing measures

3. A description of how the pool of items was drawn up

4. A description of the sample used for testing

5. An indication of the population (kind of people) for whom the measure would be appropriate

6. Descriptive statistics (norms) means, standard deviations, ranges, different sub scales

7. Reliability statistics

8. Validity statistics

9. The scale itself (introduction, items or examples of items)

10. Writing

1. Defining what you want to measure

The first and very important step is to work out and then write down exactly what you want to measure

2. Collecting items

3. Producing the preliminary questions or test

4. Testing

1. Deciding on a sample and reducing sample bias

2. Reducing methods

3. Testing

4. Data and preliminary analysis

1. Coding, scoring and data entry

2. Selecting reliable items

3. Descriptive statistics (norms) for final scale

4. Steps for data entry and reliability

5. Factor and principal factor analysis

6. The final scale and its validation

1. Descriptive statistics (norms)

2. Validity

3. Presenting the scale

9. Eclectic Approach

1. Definition

Selecting or choosing from various systems, methodologies, etc.; not following any one system.

Made up of elements selected from various sources: an eclectic philosophy.

http://www.thefreedictionary.com/eclectic April 23, 2013

2. Eclecticism is a conceptual approach that does not hold rigidly to a single paradigm or set of assumptions, but instead draws upon multiple theories, styles, or ideas to gain complementary insights into a subject, or applies different theories in particular cases. It can sometimes seem inelegant or lacking in simplicity, and eclectics are sometimes criticized for lack of consistency in their thinking. It is, however, common in many fields of study.

http://en.wikipedia.org/wiki/Eclecticism April 23, 2013

The theory of internalization itself is based on the transaction cost theory. This theory says that transactions are made within an institution if the transaction costs on the free market are higher than the internal costs. This process is called internalization.

4. Theory

The idea behind the Eclectic Paradigm is to merge several isolated theories of international economics in one approach.

5. Mixed-Method Design for Scale Development

Scale development guidelines by Churchill (1979) and DeVellis (2003) clearly involve two phases, the exploratory (quantitative and qualitative data gathering) structure, followed by a confirmatory phase (quantitative) involving purification and confirmation of the scale. The mixed-method approach is a suitable research design to be applied in scale development research especially when the objective is to discover in-depth knowledge of a complex phenomenon in a social context (Creswell, 2008; Bryman, 2008) and test hypotheses (Creswell, 2008). In exploratory MMR, there are two distinct types of qualitative methods being employed.

i. The first are studies that review the literature togather dimensions or indicators of service quality and present it to focus groups to revise the existing instruments.

ii. The second is the use of qualitative data gathering through focus group interviews to discover new indicators or dimensions of service quality.

1. References

1. Churchill, G. A., Jr., (1979). A paradigm for developing better measures of marketing constructs. Journal of Marketing Research, 16(February), 64-73.

2. Gerbing, D. W., & Anderson, J. C. (1988). An updated paradigm for scale development incorporating unidimensionality and its assessment. Journal of Marketing Research, 25(2), 186-192.

3. Loewenthal, K. M. (2001).An introduction of psychological tests and scales._2^nd ed. psychology press Ltd. 27 Church Road, Hove, East Sussex, BN3 2FA, ISBN 1-84169-106-2 (hbk)

4. http://www.rasch.org/rmt/rmt222c.htm April 16, 2013

5. http://en.wikipedia.org/wiki/Eclectic_paradigm April 16, 2013

6. www.psypress.co.uk April 16, 2013

7. http://oxforddictionaries.com/definition/english/paradigm April 23, 2013

8. http://www.thefreedictionary.com/approachApril 23, 2013

9. http://www.thefreedictionary.com/eclectic April 23, 2013

10. http://en.wikipedia.org/wiki/Eclecticism April 23, 2013