I have Testing Stories from several universities--but I think my favourite testing story comes from my experience at Guelph.

I was hired as the English Language Services Co-ordinator at Guelph to co-ordinate all the testing and development of language services at Guelph--everything (including developing a WAC programme) that the English department didn't want to do (including teaching their own students how to teach writing and then admistrating and supervising them). Fun stuff--sort of like that Greek stable that never got cleaned.

Anyway -- the test. The test was designed by OISe and included reading comprehension, grammar and the essay.  I had little use for the reading and grammar parts of the test--the reading was based, for the most part, on a good understanding of economics (why? I don't know. ASk OISE) and the grammar part was plain silly--very, very minor rule infractions and some were under dispute anyway.

But I thought the essay part was ok. After all it was scored holistically, and the graders were trained (Princeton scoring method), and each paper was marked twice by two different graders. So what could go wrong?

Well you see (this gets complicated), each marker was provided with copies of the same ten papers in the first set of 100 papers that they graded. These papers created a numerical weight for each grader. So the computer would figure out that the graders gave paper #3 an average grade of  4 out of 10. Markers below the average were deemed harsh and their scores were added to in the process of producing the final score, and markers above the average were deemed too soft and their scores were reduced.

For the first two years that I ran the test, things seemed ok. Of course, the numbers of students who took it were enormous, and the admistration was hellish.  But if a student came in and complained, I could always pull out their paper and justify the score.  It all seemed rasonable.

Then around the middle of the second year, I got some complaints that I couldn't explain.  The markers--sometimes both markers--had passed the test but the final score was a fail. And I agreed with the markers!! These were pass papers.

So I knew I had both validity and reliability problems.

It took me almost 6 months to figure out what was wrong--because we had no documentation from OISE as to the algorythm that constructed the scores and no one on campus wanted to touch the test because it was so tied in to the registrar's system that they were afraid to touch it.

But finally with the Vice-President's aproval, I was able to hire a couple of computer gurus to take apart the program and see how it worked.

WEll-- unknown to anybody there was a bell curve in the programme that automatically FAiled 20% of the students who wrote it!! TAlk about a shock.  Further the students who tended to fail were borderline wrtiers marked by my best, most experienced markers.  They tended to be graded by the computer as too "soft" so their scores were lowered and then the bell kicked in.

What did I do? First I documented the whole problem with lots of bells and whistle so that nobody in the admistration could miss it.

Then, of course, as long as I was stuch with the test, I and my best other markers re- read every single, borderline test to make sure it really was a fail, according to the criteria we had established.

What a nightmare.

Next time, I might write about the writing courses that resulted -- also not good.  The test ate up all the funding for anything productive that might have been done.

I am now deeply dubious about any kind of mass testing like this.

Cathy Schryer