Intra-subject Paired Comparison

Contrasting Ways to Appraise Improvement in a Writing Course:
Paired Comparison and Holistic

Richard H. Haswell
Department of English
Washington State University

Talk given at the
College Composition and Communication Conference
April 1988, St. Louis, Missouri

ERIC Document Reproduction Service, ED 294 215

“The altering eye alters all”--William Blake

Blake’s old truth, that the way we go about seeing something alters what we see of it, is one we know full well but forget full often. It is a truth that those involved in evaluation of writing must think about constantly. As new methods of assessment slowly and surely proliferate, we ought to be comparing the different images of writing that the different methods produce.

It is one such comparison that I will present here. I will describe two different formal assessments of one sample of pre/post university-freshman, composition-course writing, and will compare the results. Of pragmatic interest is how the two methods compare in terms of cost, reliability, and validity. Of theoretical interest is how the two distinct methods of evaluation generate different pictures of the same pieces of writing. Each picture records some outcomes which the other misses.

The general aim of this report is to introduce a new procedure of assessing the effect of a writing course on students--a procedure I will call Intrapersonal Paired Comparison. The report, both the study and its findings, should be considered preliminary.

Background

Since one of the two assessment procedures--the holistic--is familiar, I will start by describing the other. I should begin by noting that the circumstances for developing this new procedure nonetheless grew out of a familiar situation, namely English department paranoia--that periodic crisis formed of mutterings from department members that the freshman writing course is a waste of resources, of committee-room complaints from the agronomy faculty that English is not teaching their students to write, of informal concerns expressed formally by deans, etc. In sum, this new procedure grew out of a political need to show that students in an English department’s one-semester freshman writing course (at a large, land-grant state university) really did improve their writing.

At the time I was aware that holistic methods of assessing writing improvement sometimes had not detected that improvement in similar courses in the past, or had placed improvement not very convincingly within the parameters of acceptable statistical confidence. I feared, as others had, that holistic scales were not fine enough to capture the small but meaningful progress which many teachers see their students achieving during the course. I also knew that a major problem with holistic information is that it provides, finally, little that is diagnostically useful to teachers. Compared to first semester essays, end-of-the-semester may rise significantly 1.2 points on a 6-point scale, but that tells us nothing about what students have learned or not learned (Edward M. White summarizes these problems well in Chapter 2 of Teaching and Assessing Writing, San Francisco: Jossey-Bass, 1986).

So I devised the following scheme. It hopes to measure even small increases in individual improvement by forcing a comparison between two essays, one composed early and the other late in the course. It departs from the holistic method by making this comparison analytically, between separate writing subskills. It also departs from the holistic in that it directly compares the writings of the same student, not of different students. For that reason I will call the method “intra-subject paired comparison,” or IPC for short.

The Intra-Subject Paired Comparison Method (IPC)

The IPC evaluator rates essays in batches of two. The rater knows that each pair of essays was written by the same student, pre and post compositions written on switched topics, but does not know which is pre and which post. On the scoring sheet for each pair of essays (Figure 1), then, will be recorded the rater’s impressions not of one essay but of two essays. The rater’s first task is to compare the companion essays in terms of “Ideas” (or content). If the left-hand essay (position is awarded by chance) is greatly better in terms of its ideas, the left box marked “GB” is marked; if just obviously better, the left box “OB”; if only a little better, even if only a minim better, the left box “LB”. If, on the other hand, the right-hand essay is a touch better, or obviously so, or greatly so, then the appropriate boxes on the right will be marked. Note that the two essays are ranked, then, in terms of ideas. Also note that the “Little Better” box fits the situation where a rater can only intuit a difference. There is no box letting the rater off the hook by declaring that no difference exists between the two essays in any aspect.

Such forced-choice comparisons have been used before--for instance in Andrew Kerek, Donald A. Daiker, and Max Morenberg’s 1976-7 University of Miami study (see Sentence Combining and College Composition, Monograph Supplement 1-V51 of Perceptual and Motor Skills, 1980, pp. 1109, 1117-1119)--but not, as far as I know, between subskills of pieces of writing produced by the same writer.

The rater continues the assessment by making a similar comparison in terms of Support, then Organization, Diction (or word choice), Sentence Structure (or syntax), and Mechanics (or surface error). This categorization of subskills, incidentally, is based in part on Paul B. Diederich’s factoring of teacher responses to writing (there is a convenient summary in Chapter 2 of Measuring Growth in English, Urbana: National Council of Teachers of English, 1974). My first four subskills repeat his factors. But in place of what he calls “flavor,” I have Sentence Structure and Mechanics, as categories more easily distinguished by raters and more useful for teachers.

The rater ends by making two more comparisons. One is in terms of overall quality of the essay. This is an acknowledgment of the holistic premise that the artistic whole of an essay may be greater than the sum of its writerly parts, or different than an averaging of these parts. Finally, the rater judges, separately for each essay, whether the essay is of passing quality for the course. This is an acknowledgment that the IPC scheme fails, rather blatantly, to be criterion referenced. It judges whether a sample of one student’s writing has improved or regressed from an earlier sample, and roughly how much it has improved or regressed. But it does not judge the quality of either sample relative to any outside standard. Take one rater’s assessment of one pair of essays (Figure 2). The writing here may represent a student progressing from F to D work, or from B to A work. In our original evaluation using the IPC method, each pair of essays was assessed this way by three trained raters, working independently. Figure 3 shows a scoring sheet combining these three assessments and determining the final assessment for this pair. (The samples of writing, it should be understood, were unrehearsed, 50-minute, in-class essays.) The unadorned X’s represent the independent judgments, the circled X the final decision. Notice that here, with two raters selecting Essay B as a little better organized than Essay A, but another rater selecting Essay A as obviously better, a fourth rater was needed. Essays were re-read when there was one Obviously Better or Greatly Better mark on one side with other marks on the other side. Essay B, in fact, is the post essay--a fact the teacher in us is happy to see.

It may help to visualize the IPC procedure by comparing its assessment of the two essays rated in Figure 3 with the assessment recorded by an independent holistic procedure (a general comparison of these two assessments will be taken up below). When these two essays were read holistically, independently of each other, they achieved the identical summed score, a score in the lowest category of the holistic scale. So the holistic shows what this part of the IPC could not, that these are two very poor essays; and the IPC shows what the holistic could have shown but did not, that in some ways the end-of-the-semester essay is an improvement over the beginning essay. That the holistic rating finds it hard to distinguish differences among very poor essays is a point I will return to. For the moment, notice another difference here. Since the lowest category of the holistic assessment was defined as “failing,” the raters looking at these two essays through the holistic method must have seen pieces of writing that would not pass the course. But the raters looking at them through the IPC method saw only one as failing, the other as passing. (The student, incidentally, got a C+ for the semester.)

Only 12 of the 40 students we assessed showed, as in Figure 3, course improvement in all 6 subskills. Another pair of essays is more typical (Figure 4). Essay B is the end-of-the-semester piece. The holistic rated it a complete failure, all 4 raters giving it the bottom rate (summed score = 4). They rated Essay A here as weak but passing, giving it a score 2 1/2 times as high as Essay B (summed score = 10). The IPC evaluation, on the other hand, saw Essay B as better in the majority of subskills and, perhaps consequently, as the better essay overall.

Results of the IPC Assessment

Before proceeding to a full comparison of the holistic and the IPC methods, it is worth showing how the IPC assessed the composition course as a whole. We randomly selected 40 students from a number of sections of the course to write on pre/post switched topics. More precisely, all students in the freshman course wrote on the pre topic during the second meeting of class, but only four sections were selected, toward the end of the semester, to write on the post topic. The choice of sections was random, but it was checked to make sure we had a reasonable representation in terms of teacher experience and skill. Nineteen percent of the students in these sections failed to take the post-test. It can be argued that this attrition may have helped the ultimate finding of writing improvement in the course, but it should be remembered that generally the poorest students progress the most during a writing course.

Figure 5 presents the IPC summary data. Pre essays are on the left, post on the right. The percentages here show how often a category was picked among the 40 pairs of essays, so rows add up to 100%. To test statistically for overall improvement, at least two procedures are appropriate and familiar. Individual preferences for pre or post in all categories might be summed, and a sign test administered: here IPC found 156 out of 240 choices (65%) favoring post essays (p < .001). Or a chi square could be run, which here produced a chi-square value of 26.61 (p <.05). For individual subskills, the less powerful statistic was applied here, the sign test, which simply uses the number of times a post essay was judged an improvement at all (collapsing the LB, OB, and GB distinctions) over a pre essay, in each category. The sign test has the advantage of being easily gasped and easily computed (one hardly needs even a hand calculator). As Figure 5 indicates, 4 of the 6 categories show significant improvement.

For teachers of the course, the diagnostic information here was enlightening. In two areas where they expected to get improvement--in organization and in support--they did not; but in areas where they thought they had little effect, they did--in ideas, vocabulary, and syntax. Added support for the course may be seen in the “Passing Level” decisions. Nearly half of the pre essays, in contrast to only 15% of the post essays, earned one or more rates of fail from the three raters. There is one other result that does not show up on the summary sheet in Figure 5: the “Overall” decision in effect was superfluous. Only once with the 40 pairs of essays did the “overall” judgment reverse the trend of the categories. It may be pointed out too that the “Greatly Better” category did not add much information. It was picked only 38 times out of 737 individual rater choices (5%), and ended as a final decision only 2 times out of 280. It was, however, a category that evaluators said they appreciated having available.

Results of a Holistic Assessment of the Same Essays

Later, these 80 essays were rated by a formal holistic assessment. Raters, none of whom had participated in the IPC assessment, were trained on a holistic ranking with 4 basic categories (low or failing, medium low, medium high, and high) with each category divided to make an 8-point scale (Figure 6). Essays, of course, were not read in pairs but individually. Each essay received 4 independent ratings, generating a possible range of scores from 4 to 32. Inter-rater reliability was .88 (Cronbach’s alpha); the average correlation between two raters was .64.

Now if we use these holistic scores to assess improvement in the course, we in fact get it. Pre essays averaged 16.2 on the scale, post essays 19.4. A correlated T-test finds this difference significant at the .0l level of confidence (p .005, DF 39, t -3.01). I think it should be pointed out that this happy result is achieved in part by using 4 independent raters and an 8-point scale, which produced a wide spread of data points. If we reconvert the data to the more customary 3-rater and 4-point scale system, the difference between pre and post essays is much less convincing (7.0 and 7.9), a difference that barely scrapes by under the .05 level of confidence (p .039, DF 39, t -2.14). Here the degree of improvement recorded during a one-semester freshman course and the results from statistical confidence-testing are comparable to similar holistic assessments in the past.

Comparison of the IPC and the Holistic Assessments

We are now in a position to compare these two systems of assessment. In terms of cost, the holistic was more expensive, 44 person-hours as opposed to 33 for the IPC. The holistic used 8 readers requiring 3 hours of training and 2.5 hours to rate the 80 essays (I am not calculating the time spent developing the anchor essays). The IPC with 6 readers took 1.5 hours of training and 4 hours rating. Readers averaged 2 minutes assessing each essay holistically (about average, according to White), and 4 minutes for each essay by forced comparison. The time saved for the IPC was in the training, where suitable concordance was achieved in comparing a subskill of the two essays much more quickly than in placing an essay in a holistic category.

Perhaps, however, this training should have been more extensive. It looks as though rater reliability was lower for the IPC. If we treat the six IPC choices as ranks (Pre “GB” as lowest, Post “GB” as highest), just as in a holistic ranking, and then run correlations between pairs of raters, the median correlation hovers around .50 (compared to .65 for the holistic). One reason this is so low is because very rarely were the two extreme ranks (GB) chosen, and it is difficult to get a high correlation on a scale of only four. On the other hand, with 3 raters, a choice had to be submitted to a fourth reader only 9.6% of the time, which would be, according to White, a reasonable inter-rater reliability on the holistic. This meant that l5 out of the 40 pairs of essays had to be re-read (often to decide on only one or two subskills, of course). In terms of categories, something of the relative difficulty in getting raters to agree can be seen by looking at how many instances of each subskill required a re-reading: Support 9 times, Organization 6, Ideas and Sentences 5 each, Mechanics 2, Diction and Overall none. I would hazard that the IPC rater reliability can be raised considerably with a better organized training session, in particular with distinctions among the ranks (LB, OB, GB) more precisely made.

Incidentally, it can be argued that handwriting must influence the IPC decisions much less than holistic decisions, at least to the extent that the holistic is norm and criterion referenced. Since the IPC measures value distance between two writing samples from the same student, handwriting effects should balance out. But with the holistic, handwriting will influence where a particular essay stands in relation to other essays in the sample. This perhaps will balance out in calculating individual student progress by comparing the pre and post holistic scores, but may affect how these essays stand in relation to any outside criterion (as where, for example, a score of 2 reflects passing level).

The two methods seem equally sensitive in detecting overall improvement in the course. Both methods found exactly the same percent of students advancing from pre to post: 68%, or 27 out of 40 (a gain comparable to other assessments of freshman composition). Individually, however, the IPC method recorded more improvement, in part because it rated subskills one by one instead of all together. So whereas the holistic found 9 students regressing from pre to post (4 others earned the same post holistic summed score as pre), the IPC recorded only 3 students regressing in all six categories, the other 37 students showing improvement in some aspect of their writing.

The IPC also may be more sensitive to improvement at the ends of the quality spectrum, with the very poor and very good writers. We have already seen several examples of poor essays showing little difference holistically but a substantial difference by forced comparison. The same seems to be true at the upper end. Table 1 compares the two methods, dividing the 40 students into quartiles by initial writing ability (as judged on the holistic).

_______________________________________________________________________

Table 1: Course Gain by Initial Writing Performance, with Holistic and IPC

Range of initial holistic summed scores (4 raters)	Mean pre-post difference on the holistic scores	Mean pre-post difference on the IPC scores
7-12 (N = 11)	+5.82	+2,27
13-16 (N = 9)	+5.56	+4.56
17-20 (N = 11)	+3.27	-0.18
21-28 (N = 9)	-3.11	+2.11

_____________________________________________________________________

To compute course progress in Table 1, for the holistic the sum of holistic rates was used; and for the IPC the summed accomplishment on all six of the subskills, with a count of 1 awarded for a decision of LB, 2 for OB, and 3 for GB. The holistic pattern--where the worse the writer stands initially, the more improvment that writer records--is a common finding in such evaluations. But the IPC shows the top quartile of students gaining as much as the bottom quartile, and the medium high or “B” student most unlikely to record gain. The two methods obviously are discovering improvement--or more basically, visualizing quality in writing--in some importantly different ways.

Some of these differences become obvious with a look at individual cases. For one difference, the holistic raters seem to have been more influenced by mere number of words in an essay. There were 12 pairs of essays where one essay is conspicuously shorter--over half a hand-written page shorter--than its companion. The holistic gave the longer essay a better rate in every case, by an average of 6.6 points (which is considerable, considering there were only 24 points on the scale). The IPC also preferred the longer essay, in 47 out of 72 subskill category choices, but also discovered some better writing qualities in the shorter piece 25 out of the 72 times. For 4 pairs out of the 12, the IPC awarded preference to the shorter essay in the majority of the categories and for 1 pair in an equal numer of categories. Since 8 of the 12 shorter essays were written at the beginning of the semester, this is an situation where the holistic may be more likely to record course progress than will the IPC, but it is a likelihood achieved, perhaps, at the expense of validity.

For another difference, the holistic method seems to put more weight on the subskills of Support and Mechanics than does the IPC. In those essay pairs where the holistic records substantial quality gain and the IFC little gain, it is most frequently those two subskills which the IPC finds gain in. Even more common are the essay pairs where the holistic recorded little difference but the IPC substantial improvement. In seven instances, the IPC showed gain in all subskills except Support. Generally it seems that without a strength in Support, holistic raters have trouble seeing other strengths (cf. Sarah W. Freedman, “How Characteristics of Student Essays Influence Teachers’ Evaluations,” Journal of Educational Psychology, 71, June 1979, pp. 328-338).

The holistic failed 7 essays and the IPC 6, so the two systems seem equivalent in lenience. But only two of these failing essays were the same--the systems disagreed on 11. Here, a couple of patterns are clear. The IPC failed an essay which the holistic passed when the essay did not surpass its companion on any or on only one of the subskills. And the holistic failed essays that the IPC passed when the essay was comparatively weak in only one or two subskills, usually Support or Ideas. In applying that outside criterion of “passing quality,” the IPC method is obviously affected by the conscious and direct comparison with the companion essay, swayed by the number of subskills showing comparative success or failure. The holistic, on the other hand--limited to a comparison, perhaps an intuitive comparison, of internal traits within one essay--seems especially swayed by the opposite situation, the powerful halo effect of one or two subskills.

One final comparison, I think, affords a special insight into the two methods. There were15 students where the two assessment techniques disagree on whether there was improvement or regression during the semester. Twelve of those 15 cases involved instances where the independent holistic rates showed the greatest variance, that is, where the four holistic raters had the greatest trouble agreeing among themselves. Typically involved are essays that the IPC shows radically divided in subskill strengths, with a few skills showing strong improvement and a few showing strong regression. The implication is that the uneven essay, which students produce especially toward the beginning of the course, is more difficult for the holistic scheme to handle and easier for the IPC.

Procedural Strategies of Raters

It is instructive to consider these differences in light of the distinct ways these two assessment have raters proceed. The IPC raters must take up, consider, and rate subskills one by one until the six are completed. Holistic raters supposedly involve in their judgment all major factors of writing, but since this is not done systematically they may be more susceptible to the halo effect of one or more strong factors. This is not to say that the halo effect cannot occur in the IPC, and indeed the set order of taking up the six categories may well have produced a systematic halo effect of the first categories. One reason I consider the findings in the present study preliminary is that I see a need to test this particular IPC method by having different raters for each category, which should reduce halo effects.

A second essential difference in rater procedure lies in the constant comparison with another essay written by the same writer. The peculiar influence of this method is totally unknown. Certainly the strain is less on the rater, partly because comparison is made with actual essays before the eye--essays carrying more features more akin, and essays endowed with a directly meaningful relationship (two products of the same person). The idealism of the holistic method is hard on raters, especially in trying to reduce an uneven essay to an abstract holistic ranking with its neat hierarchy of categories (scale points). Operating here is the fact that the holistic usually is at once partly norm referenced and partly criterion referenced. The holistic rater then operates by setting up an individual piece of writing against at least two abstractions, the hierarchy of writerly values in the pre-set holistic scale cross-referenced with “course standards” at the lower categories of the scale. One reason why holistic raters are required to make decisions rapidly is because this procedure becomes more problematic the longer it is indulged in. The IPC, on the other hand, is neither norm or criterion referenced (that is, if we disregard the “Overall” and “Passing Level” decisions, which I consider non-essential to the procedure). If anything, it is self referenced. One piece of writing is set up against another.

A third difference has to do with the fact that the holistic rater must function, as the name says, holistically, while the IPC rater functions analytically. In the subskill decisions, the IPC rater has no need to weigh factors, factors sometimes quite removed one from another, to arrive at a summative judgment, but rather takes up each factor one at a time. Herein lies an advantage of the holistic, which may allow exceptional or original papers to work because, despite the exception or the originality, the whole works. One quality--say a heavy focus on particulars--is allowed full rein and the assessment as a whole does not pay for it. So the holistic allows in the halo effect along with the individual or eccentric performance. It is interesting to note that primary trait or performative assessment, which adds diagnostic information by concentrating on isolated traits, may punish exceptional works. Criteria are defined so precisely that an eccentric piece will get marked down (say, with a one-sentence introduction). The halo effect is decreased and the diagnosis is improved, but individuality perhaps suffers. The IPC method lies somewhere between. It adds some diagnostic information, gives partial reward to exceptional effects, and dampens the halo effect.

Summary: IPC and Holistic

The methodology of the IPC recommends it for particular evaluative uses, which in turn suggest how the present preliminary form of it might be adapted and developed. Just as the holistic, with its testing of individual performances against an absolute system of writing values, seems best fit to compare groups (e.g., 14-year-olds against 17-year-olds) or to rank an individual within a group (e.g., placing an entering freshman), so the IPC, with its comparison of two performances of the same writer, seems best fit to assess the achievement of an individual within a course of instruction. This assumes--and it is not an assumption all teachers and administrators necessarily hold--that the essential function of a writing course is to foster improvement in writing. The IPC may stand as a method of assessment most amenable to writing teachers who are concerned less about the level of skill a student has on entering their course and the level or grade that student has earned on leaving it, and more about how much the student has progressed during it. Not only may administrators assess whole sections of a particular course by the IPC method (as was attempted in this study), but individual teachers can assess their particular section. I have overseen some individual section-testing by means of the IPC, with beginning teaching assistants, and the results ranged from around 90% of the students achieving pre-post improvement in impromptu writing down almost to chance (50%). Teachers, of course, can also compare beginning and end of the semester performance of their section through a holistic assessment, but certain questions will still remain unanswered. Was the holistic sensitive enough to record a semester’s progress. Was the students’ entering level of accomplishment higher or lower than normal. And diagnostically where lie the qualities of writing the course seems to have affected? In particular the holistic will leave unanswered a question that seems both germane and pressing where teaching of novice writers is involved, namely whether any advance in writing may have been unilateral, whether indeed some aspects of writing may have not only progressed more rapidly in relation to others but also in despite or to the detriment of others.

The sensitivity and diagnostic specificity of the IPC suggest that it could not only helpfully assess course instruction as a whole, but also test particular components of the course, even particular lessons. One remembers that the essential method of the IPC, direct comparison of pre and post writing, has been used to evaluate instructional intervention under research conditions, for these same reasons of sensitivity and diagnosis. All this suggests one similarity of the IPC to the holistic, that both seem quite adaptable to particular circumstances. Just as the holistic can be modified along the lines of performative or primary-trait motives, so can the IPC. The number of comparative ranks could be modified (I suggest reducing the three here to two: an obvious difference in quality, and an intuited difference--but other situations might allow for an even more refined system than three). And the selection of subskills could be altered to match intentions of teachers or researchers or assignments, especially to answer questions about the interactions among instruction and various writing skills (e.g., does improvement in organization help or hinder improvement in support).

I hope this discussion has made it clear I am not arguing that either of the two evaluation systems is absolutely better. Compared to the holistic, the IPC does seem as capable and perhaps more easily capable of generating a convincing argument, certainly a more concrete argument, that a writing course fosters writing improvement. The IPC can do this with small number of essays and be as cost effective as the holistic. But the holistic has its own virtues, and I am not recommending that the IPC or any other scheme replace it. My comparison here tends to support the argument that there are vital differences in what an analytic approach, such as the IPC, and a holistic apoproach will see. “The altering eye alters all.” Both, I think, deserve use because they are less systems of rating and more systems of appraisal, in that word’s root sense of “moving toward praise.” Both the holistic and the IPC, that is, help us see virtues--different virtues--in student writing where we are otherwise apt to see defects, and I am for any method of evaluation which does that.

Uploaded May 17, 2005
Richard H. Haswell
Texas A&M University, Corpus Christi