Professional Documents
Culture Documents
Introduction
Integrating bibliographical information
Metadata author, title of the article, title of the book containing the paper, journal name, month and year of publication, etc. Citation string Thousands of variations
www.iis.sinica.edu.tw
Goal
Citation
Chomsky, Noam. 1956. Three models for the description of language. IRE Transactions on Information Theory. 2(3) 113--124.
BibPro
MetaData
Author: Chomsky, Noam Title: Three models for the description of language Venue: IRE Transactions on Information Theory Volumn: 2 Issue: 3 Page: 113-124 Date:
www.iis.sinica.edu.tw
Machine Learning
Condition Random Field
F. Peng, A. McCallum. Accurate information extraction from research papers using conditional random fields. Proceedings of Human Language Technology Conference and North American Chapter of the Association for Computational Linguistics (HLT-NAACL), 2004, 329-336. Hui Han, Giles, C.L., Manavoglu, E., Hongyuan Zha, Zhenyue Zhang, Fox, E.A. Automatic document metadata extraction using support vector machines. Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital libraries, 2003, 37-48. K. Seymore, A. McCallum, R. Rosenfeld. Learning hiddenMarkov model structure for information extraction. AAAI-99Workshop on Machine Learning for Information Extraction, 1999, 37-42. Takasu, A. Bibliographic attribute extraction from erroneous references based on a statistical model. Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital libraries, 2003, 49-60. Erik Hetzner. A simple method for citation metadata extraction using hidden Markov models. JCDL 2008.
www.iis.sinica.edu.tw
Knowledge Base
A tree-like knowledge representation scheme that organizes the knowledge of reference concepts in a hierarchical fashion
Min-Yuh Day et al. Reference metadata extraction using a hierarchical knowledge representation framework. Decision Support Systems, 2006.
A knowledge base automatically constructed from an existing set of sample metadata records of a given area
E. Cortez, A. S. da Silva, M. A. Gon calves, F. Mesquita, and E. S. de Moura. FLUX-CiM: exible unsupervised extraction of citation metadata. In Proc. of the 7th ACM/IEEE Joint Conf. on Digital Libraries, pages 215{224, Vancouver, BC, Canada, 2007. ACM.
www.iis.sinica.edu.tw
Template Base
Keep citation style as a template
ParaCite http://paracite.eprints.org/ I-Ane Huang, Jan-Ming Ho, Hung-Yu Kao, and ShianHua Lin. Extracting citation metadata from online publication lists using BLAST. In PAKDD, 2004, 539548. Chien-Chih Chen, Kai-Hsiang Yang, Hung-Yu Kao, JanMing Ho, BibPro: A Citation Parser Based on Sequence Alignment Techniques, ainaw,pp.1175-1180, aina workshops 2008, 2008.
www.iis.sinica.edu.tw
Key Idea
Encode a citation string into a template for BLAST
Keeping citation style information into a protein sequence Utilizing bioinformatics sequence tools
BLAST
Using Domain Knowledge Reserved word Knowledge database (optional) Blocking rule (common sense knowledge)
www.iis.sinica.edu.tw
Question
How many symbols can be used in a protein sequence? 23 symbols used in BLAST How many fields should be extracted from a citation? choose the most common used field Which punctuation marks are treated as partition marks base on domain knowledge How do we transform a citation into a protein sequence and retain its structure feature? define a encode table
www.iis.sinica.edu.tw
Encoding Procedure
Encoding Citation
Regular Expression Extraction Http ISBN
Keyword Encoding Author Venue Editor Volume Page Date Publisher Instituation
www.iis.sinica.edu.tw
Encoding Table
Classification Symbol A T L Extracted Field Contents V W P Y X Unknow Field B N R D G Partition Mark Punctuation Mark E C Z H Brackets Misc I K Q F Others S M Representation Author Title Venue (Journal, BookTitle, Technical Report) Volume Issue Page Date (Year Month) Single Unknow Continuous Unknow Numeral , . " ' : ; ([<{ )]>} / _ ! @ # $ % ^ & * + = \ | ? ~ Editor Institution Publisher
www.iis.sinica.edu.tw
Encoding Knowledge
A [AUTHOR]
T [TITLE]
Name abbreviation Length of Blocking booktitle(conference): Proceedings Proc Workshop Conf Conference Symposium Sympos Symp International Intern Annual Annu journaltitle: Transactions Trans Journal techtitle(thesis): Tech rep Rpt TR Master Masters Ph PhD Thesis thesis Dissertation dissertation volume: Volume volume Vol vol Vo vo issue: Number number Nr nr No no NO Nos page: pp page pages PP Page Pages pg PG
L [VENUE]
V [VOLUME]
P [PAGE]
www.iis.sinica.edu.tw
Encoding Knowledge
Y [DATE]
F [EDITOR]
month: January February March April May June July August September October November December Jan Feb Mar Apr Jun Jul Aug Sep Oct Nov Dec Sept year: 1900-2010 editor: eds Eds editors Editors editor Eds ED Ed ed edited
S [INSTITUTION]
M [PUBLISHER]
www.iis.sinica.edu.tw
M . Bianchini , P . Frasconi , and M . Gori , " Learning in multilayered networks used as autoassociators , " IEEE Transactions on Neural Networks , vol . 6 , pp . 512 515 , March 1995 .
M . A D ,
Bianchini X IEEE X
P .
Frasconi X on X
, R
and X
M . A D , R
Gori . X. vol V
Learning X , pp P .
in X
newtorks X , R March Y
used X 1995 Y
as X . D
autoassociators X
R A D Transactions L
R G . 6
Neural X
Networks X
R G
D N R
www.iis.sinica.edu.tw
Blocking Mechanism
After encoding the citation, we can utilize semi-structured characteristic of citation by some special pattern of sequence
Using blocking rule to merge special pattern into a single unit e.g ADXRA A
www.iis.sinica.edu.tw
Blocking(1/2)
M . A D , Bianchini X IEEE X , P . Frasconi X on X , R Neural X and X M . A D , R Gori . X. vol V , Learning X , pp P . D in X 512 N multilayered X H 515 N , R newtorks X March Y used X 1995 Y as X . D autoassociators X
R A D Transactions L
R G . 6
Networks X
R G
D N R
M . D ,
Bianchini X IEEE X
P .
Frasconi A X on L X
, R
and X
M . A D , R
Gori . X. vol
Learning
in X
newtorks X B , R March M Y
used X 1995
as X . D
autoassociators X
R A D Transactions
R G . 6 , pp . D
Neural X
Networks X
R G
VD N R
www.iis.sinica.edu.tw
Blocking(2/2)
M . D ,
Bianchini X IEEE X
P .
Frasconi A X on L X
, R
and X
M . A D , R
Gori . X. vol
Learning
in X
newtorks X B , R March M Y
used X 1995
as X . D
autoassociators X
R A D Transactions
R G . 6 , pp . D
Neural X
Networks X
R G
VD N R
Index Form
ARGBRGLRVRPRYD
(keep the blocking area information: start position and end position e.g. A start:0 end:11 )
www.iis.sinica.edu.tw
Template Database
A record in the Template Database
A citation item with both citation string and metadata Style Form Index Form
Once the template database has been constructed, BibPro can provide the citation parsing service on-the-fly
www.iis.sinica.edu.tw
Networks X
, March 1995 . R M Y D
R G
Networks X
, March 1995 . R M Y D
R G
www.iis.sinica.edu.tw
System Preprocess
STYLE FORM STYLE FORM STYLE FORM INDEX FORM INDEX FORM INDEX FORM
Online parsing
Citation String
INDEX FORM
. . . .
www.iis.sinica.edu.tw
which fields exist in query citation the order of partition mark represent a citation style Choose by IndexForm
Align query citation with citation style template Score Matrix (dynamic programming)
Content Symbol map to Content Symbol Partition Mark map to Partition Mark Choose the most suitable citation style template according to alignment between IndexForm and StyleForm
www.iis.sinica.edu.tw
Classification
Symbol
Representation
A
T Extracted Field Contents L V P Y
Author
Title Venue (Journal, BookTitle, Technical Report Volume Issue Page Date (Year Month)
X
Unknow Field B N R D G
Single Unknow
Continuous Unknow Numeral , . "
Partition Mark
Punctuation Mark
E
C Z H
'
: ; ([<{ )]>}
Brackets
I K
Misc
Q
F
/ _ ! @ # $ % ^ & * + = \ | ? ~
Editor Institution Publisher
Others
S M W
www.iis.sinica.edu.tw
ARGTRGLRVRPRYD
M . D ,
Bianchini X IEEE X
P .
Frasconi A X on L X
, R
and X
M . A D , R
Gori . X. vol
Learning
in X
newtorks X T , R March M Y
used X 1995
as X . D
autoassociators X
R A D Transactions
R G . 6 , pp . D
Neural X
Networks X
R G
VD N R
www.iis.sinica.edu.tw
System Architecture
Parsing System Template Generating System
BibTex Search Engine (CiteSeer) Citations MetaData
Query Citation
Citation PreProcessing
Encoding Citation
Encoding Citation
PreProcessing (Blocking)
Blocking Rule
PreProcessing (Blocking)
INDEX FORM
INDEX FORM
STYLE FORM
Alignment Extraction
Post Processing
MetaData
STYLE FORM
Template DataBase
www.iis.sinica.edu.tw
Experiment
Dataset
INFOMAP Dataset A total of 160,000 citation records were collected from digital libraries on the Web Citation string data was generated for each of the six citation styles (APA, IEEE, ACM, MISQ, JMIS, and ISR) Cora Dataset 500 records with diversity citation style Flux-CIM Dataset 2000 HS-domain records 300 CS-domain records
www.iis.sinica.edu.tw
Experiment
Evaluation
Token-Level A is the number of true positive tokens B is the number of false negative tokens C is the number of false positive tokens D is the number of true negative tokens
Precision
A AC
Recall
A AB
F - measure
Field-Level
www.iis.sinica.edu.tw
Average
98.74%
98.06%
98.64%
96.89%
www.iis.sinica.edu.tw
Title
Venue
97.45%
91.77%
95.15%
91.21%
96.29%
91.47%
90.46%
79.77%
Volume
Issue Page Date Average
87.57%
79.96% 97.15% 96.91% 92.38%
87.94%
93.85% 97.01% 97.16% 94.30%
87.72%
86.22% 97.07% 97.03% 93.22%
85.64%
77.51% 95.78% 94.04% 87.74%
www.iis.sinica.edu.tw
Title
Venue Volume Issue Page Date Average
97.45%
96.21% 99.70% n/a 100.00% 98.38% 97.53%
95.11%
99.29% 98.64% n/a 97.07% 100.00% 98.31%
96.26%
97.71% 99.17% n/a 98.51% 99.18% 97.89%
92.09%
98.65% 98.85% n/a 96.99% 99.00% 96.48%
www.iis.sinica.edu.tw
Author
Title Venue Volume Issue Page Date Average
97.89%
99.30% 98.40% 83.71% 89.01% 99.47% 91.74% 94.22%
98.26%
98.12% 98.80% 85.77% 88.56% 96.91% 98.71% 95.02%
98.05%
98.70% 98.60% 84.71% 86.90% 98.17% 95.07% 94.31%
96.64%
96.32% 90.89% 82.13% 87.75% 96.17% 89.59% 91.35%
www.iis.sinica.edu.tw
Conclusion
Parsing citation is a challenging problem
www.iis.sinica.edu.tw
Reference
F. Peng, A. McCallum. Accurate information extraction from research papers using conditional random fields. Proceedings of Human Language Technology Conference and North American Chapter of the Association for Computational Linguistics (HLT-NAACL), 2004, 329-336. Hui Han, Giles, C.L., Manavoglu, E., Hongyuan Zha, Zhenyue Zhang, Fox, E.A. Automatic document metadata extraction using support vector machines. Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital libraries, 2003, 37-48. K. Seymore, A. McCallum, R. Rosenfeld. Learning hiddenMarkov model structure for information extraction. AAAI-99Workshop on Machine Learning for Information Extraction, 1999, 37-42. Takasu, A. Bibliographic attribute extraction from erroneous references based on a statistical model. Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital libraries, 2003, 49-60. Min-Yuh Day et al. Reference metadata extraction using a hierarchical knowledge representation framework. Decision Support Systems, 2006. E. Cortez, A. S. da Silva, M. A. Gon calves, F. Mesquita, and E. S. de Moura. FLUX-CiM: exible unsupervised extraction of citation metadata. In Proc. of the 7th ACM/IEEE Joint Conf. on Digital Libraries, pages 215{224, Vancouver, BC, Canada, 2007. ACM. I-Ane Huang, Jan-Ming Ho, Hung-Yu Kao, and Shian-Hua Lin. Extracting citation metadata from online publication lists using BLAST. In PAKDD, 2004, 539-548. Chien-Chih Chen, Kai-Hsiang Yang, Hung-Yu Kao, Jan-Ming Ho, BibPro: A Citation Parser Based on Sequence Alignment Techniques, ainaw,pp.1175-1180, aina workshops 2008, 2008. Erik Hetzner. A simple method for citation metadata extraction using hidden Markov models. JCDL 2008. S. F. Altschul, W. Gish, W. Miller, E. Myers and D. Lipman. A basic local alignment search tool. J. Mol. Biol., 215, 1990, 403-410. Needleman, S. B. and Wunsch, C. D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48, 1970, 443-453.
www.iis.sinica.edu.tw
CSCLab