Professional Documents
Culture Documents
Agenda
Source code plagiarism Previous tools CodeMatch
Statement Matching Comment Matching Identifier Matching Partial Identifier Matching Instruction Sequence Matching
Conclusion
Universities Corporations Internet Search engines Open source movements Mobile employees
Reasons
Plague: Algorithm
Geoff Whale, University of New South Wales Three phases:
Create a sequence of tokens and a list of metrics to describe each program. Compare the structure metrics of files to find similar code structures. Compare token sequences within similar source code structures.
Plague: Example
if (x == 5) { // Loop on j here for (j = 0; j < Index; j++) printf("x = %i", j); } else while (i < 5) i++;
CONDITIONAL_BEGIN LOOP_BEGIN DISPLAY LOOP_END CONDITIONAL_END CONDITIONAL_BEGIN LOOP_BEGIN ARITHMETIC LOOP_END CONDITIONAL_END
Plague: Problems
Hard to adapt to new programming languages. The output needs interpretation. Uses slow UNIX shell tools for processing. Vulnerable to changing the order of code lines in the source code. Throws out useful information when it discards comments, variable names, function names, and other identifiers.
Remove whitespace, comments, and identifier names, replace language statements with tokens. Compare pairs of token files.
JPlag: Algorithm
Lutz Prechelt and Guido Malpohl, University Karlsruhe Michael Philippsen, University of ErlangenNuremberg Phases:
Remove whitespace, comments, and identifier names, replace language statements with tokens. Compare tokens in different files.
MOSS: Algorithm
Alex Aiken, Stanford University Phases:
Remove all whitespace and punctuation, convert characters to lower case. Divide remaining characters into k-grams, which are contiguous substrings of length k, by sliding a window of size k through the file. Hash each k-gram and select a subset of all k-grams to be the fingerprints of the file. Compare file fingerprints to find similar files.
MOSS: Example
She loves you yeah, yeah, yeah.
Some text
shelo helov elove loves ovesy vesyo esyou syouy youye ouyea uyeah yeahy eahye ahyea hyeah yeahy eahye ahyea hyeah
5-grams
77 72 42 17 98 50 23 55 6 66 34 24 39 11 84 24 39 11 84
Hypothetical hash
72 24 84 24 84
Fingerprint
MOSS: Problems
Structural information is lost (e.g., whitespace, punctuation, uppercase characters, nonalphanumeric symbols). Larger k-grams decrease execution time, but decrease sensitivity. Most k-grams are also thrown out for faster processing, reducing accuracy.
CodeMatch: Algorithms
Statement Matching Comment Matching Identifier Matching Partial Identifier Matching Instruction Sequence Matching
j = strlen(fname);
j = strlen(fname);
j = strlen(fname);
M a t c h in g c o m m e n t lin e s :
L o n g e s t m a tc h in g s e m a n tic s e q u e n c e :
M a tc h in g w o r d s :
s t d io W iN D I S w in d o w s
M a tc h in g p a r tia l w o r d s :
0x w in d o w s
Less than 100 lines Between 100 and 1000 lines Greater than 1000 lines Remove all comments Rename all identifiers Rearrange routines within the file Rearrange lines of code within routines in the file Do all of the above Remove all the code but leave the comments Create one file that has exactly one routine from each of the other files in the same category
Modify files
Conclusion
Previous tools miss important matches CodeMatch
Statement Matching Comment Matching Identifier Matching Partial Identifier Matching Instruction Sequence Matching
www.ZeidmanConsulting.com