You are on page 1of 32

CSCLab

BibPro: A Citation Parser System

Introduction
Integrating bibliographical information

Metadata author, title of the article, title of the book containing the paper, journal name, month and year of publication, etc. Citation string Thousands of variations

More than 2,000 formats in Endnote

Citation Parsing Problem

automatically recognize individual fields from a given citation string

A template citation parser

www.iis.sinica.edu.tw

Goal
Citation
Chomsky, Noam. 1956. Three models for the description of language. IRE Transactions on Information Theory. 2(3) 113--124.

BibPro

MetaData
Author: Chomsky, Noam Title: Three models for the description of language Venue: IRE Transactions on Information Theory Volumn: 2 Issue: 3 Page: 113-124 Date:
www.iis.sinica.edu.tw

Machine Learning
Condition Random Field

Support Vector Machine

F. Peng, A. McCallum. Accurate information extraction from research papers using conditional random fields. Proceedings of Human Language Technology Conference and North American Chapter of the Association for Computational Linguistics (HLT-NAACL), 2004, 329-336. Hui Han, Giles, C.L., Manavoglu, E., Hongyuan Zha, Zhenyue Zhang, Fox, E.A. Automatic document metadata extraction using support vector machines. Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital libraries, 2003, 37-48. K. Seymore, A. McCallum, R. Rosenfeld. Learning hiddenMarkov model structure for information extraction. AAAI-99Workshop on Machine Learning for Information Extraction, 1999, 37-42. Takasu, A. Bibliographic attribute extraction from erroneous references based on a statistical model. Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital libraries, 2003, 49-60. Erik Hetzner. A simple method for citation metadata extraction using hidden Markov models. JCDL 2008.

Hidden Markov Model

www.iis.sinica.edu.tw

Knowledge Base
A tree-like knowledge representation scheme that organizes the knowledge of reference concepts in a hierarchical fashion

Min-Yuh Day et al. Reference metadata extraction using a hierarchical knowledge representation framework. Decision Support Systems, 2006.

A knowledge base automatically constructed from an existing set of sample metadata records of a given area

E. Cortez, A. S. da Silva, M. A. Gon calves, F. Mesquita, and E. S. de Moura. FLUX-CiM: exible unsupervised extraction of citation metadata. In Proc. of the 7th ACM/IEEE Joint Conf. on Digital Libraries, pages 215{224, Vancouver, BC, Canada, 2007. ACM.

www.iis.sinica.edu.tw

Template Base
Keep citation style as a template

ParaCite http://paracite.eprints.org/ I-Ane Huang, Jan-Ming Ho, Hung-Yu Kao, and ShianHua Lin. Extracting citation metadata from online publication lists using BLAST. In PAKDD, 2004, 539548. Chien-Chih Chen, Kai-Hsiang Yang, Hung-Yu Kao, JanMing Ho, BibPro: A Citation Parser Based on Sequence Alignment Techniques, ainaw,pp.1175-1180, aina workshops 2008, 2008.

www.iis.sinica.edu.tw

Key Idea
Encode a citation string into a template for BLAST

Keeping citation style information into a protein sequence Utilizing bioinformatics sequence tools

BLAST

Using Domain Knowledge Reserved word Knowledge database (optional) Blocking rule (common sense knowledge)

www.iis.sinica.edu.tw

Question
How many symbols can be used in a protein sequence? 23 symbols used in BLAST How many fields should be extracted from a citation? choose the most common used field Which punctuation marks are treated as partition marks base on domain knowledge How do we transform a citation into a protein sequence and retain its structure feature? define a encode table
www.iis.sinica.edu.tw

Encoding Procedure
Encoding Citation
Regular Expression Extraction Http ISBN

Keyword Encoding Author Venue Editor Volume Page Date Publisher Instituation

Database Encoding (optional)

www.iis.sinica.edu.tw

Encoding Table
Classification Symbol A T L Extracted Field Contents V W P Y X Unknow Field B N R D G Partition Mark Punctuation Mark E C Z H Brackets Misc I K Q F Others S M Representation Author Title Venue (Journal, BookTitle, Technical Report) Volume Issue Page Date (Year Month) Single Unknow Continuous Unknow Numeral , . " ' : ; ([<{ )]>} / _ ! @ # $ % ^ & * + = \ | ? ~ Editor Institution Publisher

www.iis.sinica.edu.tw

Encoding Knowledge
A [AUTHOR]

T [TITLE]

Name abbreviation Length of Blocking booktitle(conference): Proceedings Proc Workshop Conf Conference Symposium Sympos Symp International Intern Annual Annu journaltitle: Transactions Trans Journal techtitle(thesis): Tech rep Rpt TR Master Masters Ph PhD Thesis thesis Dissertation dissertation volume: Volume volume Vol vol Vo vo issue: Number number Nr nr No no NO Nos page: pp page pages PP Page Pages pg PG

L [VENUE]

V [VOLUME]

P [PAGE]

www.iis.sinica.edu.tw

Encoding Knowledge
Y [DATE]

F [EDITOR]

month: January February March April May June July August September October November December Jan Feb Mar Apr Jun Jul Aug Sep Oct Nov Dec Sept year: 1900-2010 editor: eds Eds editors Editors editor Eds ED Ed ed edited

S [INSTITUTION]

M [PUBLISHER]

institution: University Univ Department Dept Corporation


publisher: Press Pub Publishers Inc Publications

www.iis.sinica.edu.tw

Tokenizing and Encoding Citation

M . Bianchini , P . Frasconi , and M . Gori , " Learning in multilayered networks used as autoassociators , " IEEE Transactions on Neural Networks , vol . 6 , pp . 512 515 , March 1995 .

M . A D ,

Bianchini X IEEE X

P .

Frasconi X on X

, R

and X

M . A D , R

Gori . X. vol V

Learning X , pp P .

in X

multilayered X 512 N H 515 N

newtorks X , R March Y

used X 1995 Y

as X . D

autoassociators X

R A D Transactions L

R G . 6

Neural X

Networks X

R G

D N R

www.iis.sinica.edu.tw

Blocking Mechanism
After encoding the citation, we can utilize semi-structured characteristic of citation by some special pattern of sequence

Using blocking rule to merge special pattern into a single unit e.g ADXRA A

www.iis.sinica.edu.tw

Blocking(1/2)
M . A D , Bianchini X IEEE X , P . Frasconi X on X , R Neural X and X M . A D , R Gori . X. vol V , Learning X , pp P . D in X 512 N multilayered X H 515 N , R newtorks X March Y used X 1995 Y as X . D autoassociators X

R A D Transactions L

R G . 6

Networks X

R G

D N R

M . Bianchini , P . Frasconi , and M . Gori . , Learning in multilayered newtorks used as autoassociators D A X R A D X R XA D X. R G X X X B X X X

, IEEE Transactions on Neural Networks , vol . 6 , pp . 512 - 515 , March 1995 . R G X LX X X R VD N R D N P C N R Y Y D

M . D ,

Bianchini X IEEE X

P .

Frasconi A X on L X

, R

and X

M . A D , R

Gori . X. vol

Learning

in X

multilayered X 512 N P C 515 N

newtorks X B , R March M Y

used X 1995

as X . D

autoassociators X

R A D Transactions

R G . 6 , pp . D

Neural X

Networks X

R G

VD N R

www.iis.sinica.edu.tw

Blocking(2/2)

M . D ,

Bianchini X IEEE X

P .

Frasconi A X on L X

, R

and X

M . A D , R

Gori . X. vol

Learning

in X

multilayered X 512 N P C 515 N

newtorks X B , R March M Y

used X 1995

as X . D

autoassociators X

R A D Transactions

R G . 6 , pp . D

Neural X

Networks X

R G

VD N R

Index Form

ARGBRGLRVRPRYD

(keep the blocking area information: start position and end position e.g. A start:0 end:11 )

www.iis.sinica.edu.tw

Template Database
A record in the Template Database

A citation item with both citation string and metadata Style Form Index Form

Once the template database has been constructed, BibPro can provide the citation parsing service on-the-fly

www.iis.sinica.edu.tw

Citation Style Template


Index Form (Unknown Answer)
M . D , Bianchini X , P . R A D Frasconi A X , R and M . Gori . , X A D , R X. vol Learning in multilayered X X 512 - 515 N P C N newtorks X B used as autoassociators X X X R G . 6 , pp . VD N R D

IEEE Transactions on Neural X LX X

Networks X

, March 1995 . R M Y D

R G

Style Form (Known Answer)


M . D , Bianchini X , P . R A D Frasconi A X , R and M . Gori . , X A D , R X. vol Learning in multilayered X X 512 - 515 N P C N newtorks X T used as autoassociators X X X R G . 6 , pp . VD N R D

IEEE Transactions on Neural X LX X

Networks X

, March 1995 . R M Y D

R G

www.iis.sinica.edu.tw

Parsing (Template Matching)

System Preprocess
STYLE FORM STYLE FORM STYLE FORM INDEX FORM INDEX FORM INDEX FORM

Online parsing
Citation String

Search Tool BLAST

INDEX FORM

. . . .

www.iis.sinica.edu.tw

Finding Citation Style Templates


Using Score mechanism

Finding similar citation style templates Blast Score Matrix


which fields exist in query citation the order of partition mark represent a citation style Choose by IndexForm

Align query citation with citation style template Score Matrix (dynamic programming)

Content Symbol map to Content Symbol Partition Mark map to Partition Mark Choose the most suitable citation style template according to alignment between IndexForm and StyleForm

www.iis.sinica.edu.tw

Classification

Symbol

Representation

A
T Extracted Field Contents L V P Y

Author
Title Venue (Journal, BookTitle, Technical Report Volume Issue Page Date (Year Month)

X
Unknow Field B N R D G

Single Unknow
Continuous Unknow Numeral , . "

Partition Mark
Punctuation Mark

E
C Z H

'
: ; ([<{ )]>}

Brackets

I K

Misc

Q
F

/ _ ! @ # $ % ^ & * + = \ | ? ~
Editor Institution Publisher

Others

S M W

www.iis.sinica.edu.tw

Parsing (Alignment Extraction)


(Query) Index Form ARGBRGLRVRPRYD |||:|||||||||| (Template) Style Form ARGTRGLRVRPRYD

ARGTRGLRVRPRYD

M . D ,

Bianchini X IEEE X

P .

Frasconi A X on L X

, R

and X

M . A D , R

Gori . X. vol

Learning

in X

multilayered X 512 N P C 515 N

newtorks X T , R March M Y

used X 1995

as X . D

autoassociators X

R A D Transactions

R G . 6 , pp . D

Neural X

Networks X

R G

VD N R

www.iis.sinica.edu.tw

System Architecture
Parsing System Template Generating System
BibTex Search Engine (CiteSeer) Citations MetaData

Query Citation

Citation PreProcessing

Encoding Citation

Encode Table (DataBase)

Encoding Citation

PreProcessing (Blocking)

Blocking Rule

PreProcessing (Blocking)

INDEX FORM

INDEX FORM

STYLE FORM

Alignment Extraction

Template Filter Template Matching BLAST

Post Processing

MetaData

STYLE FORM

Template DataBase

www.iis.sinica.edu.tw

Experiment
Dataset

INFOMAP Dataset A total of 160,000 citation records were collected from digital libraries on the Web Citation string data was generated for each of the six citation styles (APA, IEEE, ACM, MISQ, JMIS, and ISR) Cora Dataset 500 records with diversity citation style Flux-CIM Dataset 2000 HS-domain records 300 CS-domain records
www.iis.sinica.edu.tw

Experiment
Evaluation

Token-Level A is the number of true positive tokens B is the number of false negative tokens C is the number of false positive tokens D is the number of true negative tokens

Precision

A AC

Recall

A AB

F - measure

2 Precision Recall Precision Recall

Field-Level

Number of correctly extracted fields Accuracy Total number of fields

www.iis.sinica.edu.tw

INFOMAP Dataset Result


Token-Level Precision Author Title Venue Volume Issue Page Date 99.38% 99.58% 98.00% 96.44% 98.89% 99.58% 99.29% Recall 99.37% 97.23% 97.85% 97.83% 95.58% 98.93% 99.59% F-Measure 99.38% 98.39% 97.92% 98.90% 97.21% 99.25% 99.44% Field-Level Accuracy 98.14% 95.02% 96.36% 97.36% 94.04% 98.16% 99.16%

Average

98.74%

98.06%

98.64%

96.89%
www.iis.sinica.edu.tw

Cora Dataset Result


Token-Level Precision Author 95.87% Recall 97.75% F-Measure 96.79% Field-Level Accuracy 90.99%

Title
Venue

97.45%
91.77%

95.15%
91.21%

96.29%
91.47%

90.46%
79.77%

Volume
Issue Page Date Average

87.57%
79.96% 97.15% 96.91% 92.38%

87.94%
93.85% 97.01% 97.16% 94.30%

87.72%
86.22% 97.07% 97.03% 93.22%

85.64%
77.51% 95.78% 94.04% 87.74%
www.iis.sinica.edu.tw

Flux-CIM HS Domain Dataset


Token-Level Precision Author 93.45% Recall 99.74% F-Measure 96.49% Field-Level Accuracy 93.30%

Title
Venue Volume Issue Page Date Average

97.45%
96.21% 99.70% n/a 100.00% 98.38% 97.53%

95.11%
99.29% 98.64% n/a 97.07% 100.00% 98.31%

96.26%
97.71% 99.17% n/a 98.51% 99.18% 97.89%

92.09%
98.65% 98.85% n/a 96.99% 99.00% 96.48%
www.iis.sinica.edu.tw

Flux-CIM CS Domain Dataset


Token-Level Precision Recall F-Measure Field-Level Accuracy

Author
Title Venue Volume Issue Page Date Average

97.89%
99.30% 98.40% 83.71% 89.01% 99.47% 91.74% 94.22%

98.26%
98.12% 98.80% 85.77% 88.56% 96.91% 98.71% 95.02%

98.05%
98.70% 98.60% 84.71% 86.90% 98.17% 95.07% 94.31%

96.64%
96.32% 90.89% 82.13% 87.75% 96.17% 89.59% 91.35%
www.iis.sinica.edu.tw

Conclusion
Parsing citation is a challenging problem

Diversity in citation formats


Parser System http://csclws.iis.sinica.edu.tw:8080/input.jsp Template Generator System http://csclws.iis.sinica.edu.tw:8080/tpin.jsp

We present a template-based parser

www.iis.sinica.edu.tw

Reference
F. Peng, A. McCallum. Accurate information extraction from research papers using conditional random fields. Proceedings of Human Language Technology Conference and North American Chapter of the Association for Computational Linguistics (HLT-NAACL), 2004, 329-336. Hui Han, Giles, C.L., Manavoglu, E., Hongyuan Zha, Zhenyue Zhang, Fox, E.A. Automatic document metadata extraction using support vector machines. Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital libraries, 2003, 37-48. K. Seymore, A. McCallum, R. Rosenfeld. Learning hiddenMarkov model structure for information extraction. AAAI-99Workshop on Machine Learning for Information Extraction, 1999, 37-42. Takasu, A. Bibliographic attribute extraction from erroneous references based on a statistical model. Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital libraries, 2003, 49-60. Min-Yuh Day et al. Reference metadata extraction using a hierarchical knowledge representation framework. Decision Support Systems, 2006. E. Cortez, A. S. da Silva, M. A. Gon calves, F. Mesquita, and E. S. de Moura. FLUX-CiM: exible unsupervised extraction of citation metadata. In Proc. of the 7th ACM/IEEE Joint Conf. on Digital Libraries, pages 215{224, Vancouver, BC, Canada, 2007. ACM. I-Ane Huang, Jan-Ming Ho, Hung-Yu Kao, and Shian-Hua Lin. Extracting citation metadata from online publication lists using BLAST. In PAKDD, 2004, 539-548. Chien-Chih Chen, Kai-Hsiang Yang, Hung-Yu Kao, Jan-Ming Ho, BibPro: A Citation Parser Based on Sequence Alignment Techniques, ainaw,pp.1175-1180, aina workshops 2008, 2008. Erik Hetzner. A simple method for citation metadata extraction using hidden Markov models. JCDL 2008. S. F. Altschul, W. Gish, W. Miller, E. Myers and D. Lipman. A basic local alignment search tool. J. Mol. Biol., 215, 1990, 403-410. Needleman, S. B. and Wunsch, C. D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48, 1970, 443-453.

www.iis.sinica.edu.tw

CSCLab

You might also like