You are on page 1of 25

Detecting Source Code Plagiarism with CodeMatch

Bob Zeidman Zeidman Consulting

Agenda
Source code plagiarism Previous tools CodeMatch

Statement Matching Comment Matching Identifier Matching Partial Identifier Matching Instruction Sequence Matching

Conclusion

Source Code Plagiarism


Entities

Universities Corporations Internet Search engines Open source movements Mobile employees

Reasons

Plague: Algorithm
Geoff Whale, University of New South Wales Three phases:

Create a sequence of tokens and a list of metrics to describe each program. Compare the structure metrics of files to find similar code structures. Compare token sequences within similar source code structures.

Plague: Example
if (x == 5) { // Loop on j here for (j = 0; j < Index; j++) printf("x = %i", j); } else while (i < 5) i++;
CONDITIONAL_BEGIN LOOP_BEGIN DISPLAY LOOP_END CONDITIONAL_END CONDITIONAL_BEGIN LOOP_BEGIN ARITHMETIC LOOP_END CONDITIONAL_END

Plague: Problems
Hard to adapt to new programming languages. The output needs interpretation. Uses slow UNIX shell tools for processing. Vulnerable to changing the order of code lines in the source code. Throws out useful information when it discards comments, variable names, function names, and other identifiers.

YAP, YAP2, YAP3: Algorithm


Michael Wise, University of Sydney, Australia Two phases:

Remove whitespace, comments, and identifier names, replace language statements with tokens. Compare pairs of token files.

JPlag: Algorithm
Lutz Prechelt and Guido Malpohl, University Karlsruhe Michael Philippsen, University of ErlangenNuremberg Phases:

Remove whitespace, comments, and identifier names, replace language statements with tokens. Compare tokens in different files.

YAP, JPlag: Problems


To decrease the run time, uses hashing and only considers matches of strings of a minimal length. Tokens are still dependent on knowledge of the programming language. Although less so than Plague, still vulnerable to changing the order of code lines. Throws out useful information when it discards comments, variable names, function names, and other identifiers.

MOSS: Algorithm
Alex Aiken, Stanford University Phases:

Remove all whitespace and punctuation, convert characters to lower case. Divide remaining characters into k-grams, which are contiguous substrings of length k, by sliding a window of size k through the file. Hash each k-gram and select a subset of all k-grams to be the fingerprints of the file. Compare file fingerprints to find similar files.

MOSS: Example
She loves you yeah, yeah, yeah.

Some text
shelo helov elove loves ovesy vesyo esyou syouy youye ouyea uyeah yeahy eahye ahyea hyeah yeahy eahye ahyea hyeah

5-grams
77 72 42 17 98 50 23 55 6 66 34 24 39 11 84 24 39 11 84

Hypothetical hash
72 24 84 24 84

Fingerprint

MOSS: Problems
Structural information is lost (e.g., whitespace, punctuation, uppercase characters, nonalphanumeric symbols). Larger k-grams decrease execution time, but decrease sensitivity. Most k-grams are also thrown out for faster processing, reducing accuracy.

CodeMatch: Algorithms
Statement Matching Comment Matching Identifier Matching Partial Identifier Matching Instruction Sequence Matching

CodeMatch: Statement Matching


File 1 1 2 3 4 5 6 7 8 9 10 while (1) j = strlen(fname); // find the file extension { int Index1, j; /* begin routine */ void fdiv( char *fname, // file name char *path) /* path */ 1 2 3 4 5 6 7 8 9 10 File 2 /* find the file extension */ void file_divide( char char { int i, j; // begin routine while (1) // loop here *fname, *path)

j = strlen(fname);

CodeMatch: Comment Matching


File 1 1 2 3 4 5 6 7 8 9 10 while (1) j = strlen(fname); // find the file extension { int Index1, j; /* begin routine */ void fdiv( char *fname, // file name char *path) /* path */ 1 2 3 4 5 6 7 8 9 10 File 2 /* find the file extension */ void file_divide( char char { int i, j; // begin routine while (1) // loop here *fname, *path)

j = strlen(fname);

CodeMatch: Identifier Matching


Counts the number of matching words that are not programming language keywords. Requires a list of keywords to exclude. Matching numerals given less weight that matching alphabeticals. Finds matching identifiers routines, variables, constants, etc.

CodeMatch: Partial Identifier Matching


Counts the number of partially matching words that are not programming language keywords. Requires a list of keywords to exclude. Matching numerals given less weight that matching alphabeticals. Finds disguised identifiers routines, variables, constants, etc. For example, abc partially matches abc1 and xxxabc.

CodeMatch: Instruction Sequence Matching


File 1 1 2 3 4 5 6 7 8 9 10 while (1) j = strlen(fname); // find the file extension { int Index1, j; /* begin routine */ void fdiv( char *fname, // file name char *path) /* path */ 1 2 3 4 5 6 7 8 9 10 File 2 /* find the file extension */ void file_divide( char char { int i, j; // begin routine while (1) // loop here *fname, *path)

j = strlen(fname);

CodeMatch: Total Match Score

t = kww +kp p +ks s +kc c +kq q

CodeMatch Basic Report


Comparing files in folder D:\CodeMatch\Code Development\test\C test 2\files 1 To files in folder D:\CodeMatch\Code Development\test\C test 2\files 2 D:\CodeMatch\Code Development\test\C test 2\files 1\bpf_dump.c
Match Score Compared To File 2910 374 374 374 D:\CodeMatch\Code Development\test\C test 2\files 2\bpf_dump.c D:\CodeMatch\Code Development\test\C test 2\files 2\W32NReg.c D:\CodeMatch\Code Development\test\C test 2\files 2\test\W32NReg (variable names changed).c D:\CodeMatch\Code Development\test\C test 2\files 2\test\W32NReg (no comments).c

D:\CodeMatch\Code Development\test\C test 2\files 1\bpf_filter.c


Match Score Compared To File 606 D:\CodeMatch\Code Development\test\C test 2\files 2\W32NReg.c 606 572 398 D:\CodeMatch\Code Development\test\C test 2\files 2\test\W32NReg (no comments).c D:\CodeMatch\Code Development\test\C test 2\files 2\test\W32NReg (variable names changed).c D:\CodeMatch\Code Development\test\C test 2\files 2\bpf_dump.c

CodeMatch Detailed Report


C o m p a r in g f ile 1 : D : \C o d e M a t c h \\t e s t\C t e s t 2 \f ile s 1 \b p f _ d u m p .c T o f ile 2 : D : \C o d e M a t c h \ C t e s t 2 \f ile s 2 \t e s t \W 3 2 N r e g .c M a tc h in g s o u r c e lin e s :
F ile 1 L in e # 21 22 24 F ile 1 L in e # 3 10 F ile 1 L in e # 21 F i le 2 L in e # 1 3 7 F i le 2 L in e # 3 5 F ile 2 L in e # 1 S o u r c e li n e # i n c lu d e < w i n d o w s . h > # i n c lu d e < s t d io . h > # i n c lu d e " W i N D I S . h " C o m m e n t lin e * T h e R e g e n t s o f t h e U n i v e r s it y o f C a l i f o r n i a . A l l r ig h t s r e s e r v e d . * R e d i s t r i b u t io n a n d u s e i n s o u r c e a n d b i n a r y f o r m s , w it h o r w it h o u t N u m b e r o f m a t c h i n g li n e s 3

M a t c h in g c o m m e n t lin e s :

L o n g e s t m a tc h in g s e m a n tic s e q u e n c e :

M a tc h in g w o r d s :
s t d io W iN D I S w in d o w s

M a tc h in g p a r tia l w o r d s :
0x w in d o w s

Competitive Evaluation: Test


GNU C compiler GCC version 3.3.2

Less than 100 lines Between 100 and 1000 lines Greater than 1000 lines Remove all comments Rename all identifiers Rearrange routines within the file Rearrange lines of code within routines in the file Do all of the above Remove all the code but leave the comments Create one file that has exactly one routine from each of the other files in the same category

Modify files

Competitive Evaluation: Accuracy


Program CodeMatch JPlag MOSS Copied files found 95% (200 of 210) 80% (169 of 210) 70% (146 of 210)

Conclusion
Previous tools miss important matches CodeMatch

Statement Matching Comment Matching Identifier Matching Partial Identifier Matching Instruction Sequence Matching

More accurate than other tools

Download CodeMatch For Free

www.ZeidmanConsulting.com

You might also like