Help: SIMILARITY CHECKER (Scheck.exe) ------------------------- !!!!!!!!!!!!!! If your course files (.DAT, .ANS) are not on the same path as the programs, the first screen allows setting the path to these files. The Windows paste function could be used to enter a complicated path. Example: c:\mydirectory\myfiles\ !!!!!!!!!!!!!! This choice can be bypassed by clicking 'OK' or 'Cancel', or pressing . 1) FILES USED BY THE EXCESSIVE SIMILARITY CHECKER ---------------------------------------------- A .TXT file specifies correct answers and wrong answers without identifying the students. A .DAT file includes student numbers and names as well as the .TXT information. An .ANS file is the list of correct answers in a single column. The .TXT file is essential for running the similarity checker. The .DAT file allows identification of suspected pairs by name. If a .DAT file is present, a .TXT file will automatically be created . For full functioning of the program, the .DAT file is necessary. If a .ANS file is present, the answers will be listed in the .NAM output file. Select either a .TXT file or a .DAT file from the menu. If your file is in another directory select 'Cancel' and then fill in the complete path when asked. Example: c:\myfiles\mycourse DO NOT include the .DAT or .TXT. 2) NUMBER OF STUDENTS AND NUMBER OF QUESTIONS ------------------------------------------ The program tells you the number of students and the number of questions that it found in the exam. You can change the number of students to a number less than given, but this is not generally a useful feature. Changing the number of questions is quite useful, because you will subsequently be given the choice to start at a question other than one. This means that either the first or last question can be omitted if these are used for purposes such as version identification. SAMPLE0.DAT is a sample file where the first question was used for version identification and should be omitted. Also, if a question block is different (for example, the first 20 questions are true-false), these can be tested separately. Note, however, that this method, unlike some others, can deal simultaneously with questions that are true-false, four-choice, five-choice, etc.. 3) IDENTIFICATION OF STUDENTS -------------------------- This program is designed so that files (.TXT) that do not identify students by name or number can be used. This could be useful if studies on classes are carried out by persons who do not need to know the identities of students. The .TXT files are subsets of the .DAT files, which also have student names and numbers. The program will run with either .DAT files of .TXT files, creating the latter if only the former is present. If the .DAT file is present, the program can create a detailed similarity report, with names of suspected pairs; this will be in a .NAM file. Also, there will be an offer to provide a complete list of names and marks, and a list of responses. These are optionally included in the .NAM file. 4) EXCEL FILE ---------- There is an option to create a tab-delimited file with an .XLS suffix. This file can be double-clicked to enter EXCEL directly. It contains the marks and responses for each student as well as some simple statistical summaries. To use this option, one must elect to identify the students is a previous option screen. 5) IDENTIFICATION OF PRE-SELECTED PAIRS ------------------------------------ As is explained in the JAS paper by GOW, the standard of evidence for pairs suspected on other grounds (invigilator reports, for example) is at a lower level than pairs identified only by the 'data mining' of the program. If the DAT file is present, then a menu will allow selection of student pairs by name and number. If only the TXT file is present, one can force the program to give a report on any pair by entering the sequential numbers for the pair. The sequential numbers are simply the positions on the list in the .DAT file of these individuals. 6) BOUND ON SIGNIFICANCE --------------------- This topic is explained in the JAS paper by GOW. A default level of .1 means that fewer than 10% of classes will contain a falsely selected pair with that value of Zb or above. The actual level of significance is given for each pair and is usually much higher than the default level. 7) OPTIMIZING T ------------- The T parameter fine-tunes the shape of the probability function describing the probability of a correct answer. It is set to .13 as a default. If you select 'yes' to optimize on T, this may consume a couple of minutes or more, depending on the size of the class and the number of questions. For more information open the Acrobat file T.pdf. 8) QQ PLOT DIAGRAM --------------- The diagram that appears on the screen after the program is run is an approximate Q-Q plot, and is a useful diagnostic. The diagram plots the equivalent normal z's against the calculated z's. If the line is approximately straight, this indicates normality. If the slope of the plotted points is steeper than the diagonal straight line then the standard deviation of the Z's is less than one. Note that the Z's are not independent. The vertical dotted line on the right is the cutoff. If the class is "clean" and the default cutoff is used, there should be a gap between the last of the points on the right and this vertical line. For small numbers of questions (<20)) the plotted points may not be a straight line because the normal approximation loses accuracy. Unusual patterns on the left side of the plotted points may indicate some students with many unanswered questions. The diagram will disappear if the mouse is clicked while the pointer is on this screen. Note that the diagram is automatically written to the clip- board and hence is available for insertion into a word processing document. However, a mouseclick on the diagram will activate an option screen that will allow saving the diagram to a BMP file. 9) INTERPRETATION --------------- The program will open output files .out and .nam (optional) and list them in NOTEPAD windows. These contain various forms of analysis. SAMPLE: _______________________________________________________________ ** pair = 17 93 ** Harpp-Hogan stat = #wr.mat/#diff = 1.875 Zb = 5.032 'equivalent' z from the BVP model Significance of Zb on a pre-selected pair = 2.42E-07 Significance bound (Bonferroni) on program selected pairs = 1.11E-02 #matches = 42 | 50 (mu,s)=( 25.485 3.310) prop. right for 17 = 0.640 prop. right for 93 = 0.560 Quest. range = [ 1 50 ] #students = 303 ---------------------------------------------------------------- STUDENT 17 6003317 AHAOR .2........ 1.2..2.... ..1...1... .3.3.21544 .2..31.44. ----------------------------------------------------------------- STUDENT 93 6445908 HITOP .2...1..1. 1.2...2... ..1.5.24.. .3.3.21544 .2..31.54. ----------------------------------------------------------------- _________________________________________________________________ n = #students m = #matches in answers Zb is the standardized normal statistic equivalent derived from the number of matches, student performance, and question difficulty . It measures the degree of similarity between the answers of two students. Positive values mean above average similarity. The Harpp-Hogan statistic is an empirically justified statistic. It is the ratio of exact wrong matches over the number of differences. Values > 1 are very suspicious. This statistic, however, is used in conjunction with another Harpp-Hogan statistic, called SIGMA. It is not reliable by itself. See the Harpp-Hogan papers for interpretation. It is presented here only as "a second opinion". Significance = Prob(number of matches is >= m) = Prob( Z >= Zb)= probability that a pre-selected pair will be falsely accused if the Zb observed were to be used as a cutoff. This is the relevant significance if there is some prior reason for suspecting the pair. In this version of the program, this probability is calculated using probabilities for 'Bernoulli trials with varying success probabilities' (BVP), and not the normal approximation. Bonferroni bound = upper bound on the probability that a class will have a falsely accused pair(s) if Zb is used as a cutoff. This is relevant if there was no prior reason to suspect the pair. This bound is from the Bonferroni inequality. It is also calculated using BVP (compound binomial) probabilities. Reference: George O. Wesolowsky, 'Detecting Excessive Similarity in Answers on Multiple Choice Exams', Journal of Applied Statistics, Vol. 27, No. 7, 2000, pp. 909-921.