Introduction To NLTK

Getting Started with NLTK
An Introduction to NLTK
Sreejith S
srssreejith@gmail.com
@tweet2sree
FOSSMeet 2011,NIC Calicut
06 February 2011
Sreejith S Getting Started with NLTK
Just a word about me !!
Working in Natural Language Processing (NLP), Machine Learning,
Text Mining
Active member of ilugcbe , http://ilugcbe.techstud.org
Works for 365Media Pvt. Ltd. Coimbatore India.
@tweet2sree , srssreejith@gmail.com
Introduction - NLP
Natural Language Processing
NLP is an inter-disciplinary subject
Computer Science
Linguistics
Statistics etc...
NLP is a sub eld of Articial Intelligence
NLP - Any kind of computer manipulation of natural language.
It is a rapidly developing eld of study
Everyday applications of NLP
Handwriting recognition,Machine translation,Question-answering
systems,Spell checkers,Grammer checkers etc...
Introduction - NLP
Computer Science
Linguistics
Statistics etc...
Introduction - NLP
Computer Science
Linguistics
Statistics etc...
Introduction - NLP
Computer Science
Linguistics
Statistics etc...
Introduction - NLP
Computer Science
Linguistics
Statistics etc...
Introduction - NLP
Computer Science
Linguistics
Statistics etc...
Introduction - NLP
Computer Science
Linguistics
Statistics etc...
Introduction - NLP
Computer Science
Linguistics
Statistics etc...
Introduction - NLP
Computer Science
Linguistics
Statistics etc...
Introduction - NLP
Computer Science
Linguistics
Statistics etc...
Natural Language Toolkit (NLTK)
A collection of Python programs, modules, data set and tutorial to
support research and development in Natural Language Processing
(NLP)
Written by Steven Bird, Edvard Loper and Ewan Klien
NLTK is
Free and Open source
Easy to use
Modular
Well documented
Simple and extensible
http://www.nltk.org
(NLP)
NLTK is
Easy to use
Modular
Well documented
http://www.nltk.org
(NLP)
NLTK is
Easy to use
Modular
Well documented
http://www.nltk.org
(NLP)
NLTK is
Easy to use
Modular
Well documented
http://www.nltk.org
(NLP)
NLTK is
Easy to use
Modular
Well documented
http://www.nltk.org
(NLP)
NLTK is
Easy to use
Modular
Well documented
http://www.nltk.org
(NLP)
NLTK is
Easy to use
Modular
Well documented
http://www.nltk.org
(NLP)
NLTK is
Easy to use
Modular
Well documented
http://www.nltk.org
(NLP)
NLTK is
Easy to use
Modular
Well documented
http://www.nltk.org
What You Will Learn
How simple programs can help you manipulate and analyze language
data, and how to write these programs
How key concepts from NLP and linguistics are used to describe and
analyze language
How data structures and algorithms are used in NLP
How language data is stored in standard formats, and how data can
be used to evaluate the performance of NLP techniques
What You Will Learn
analyze language
What You Will Learn
analyze language
What You Will Learn
analyze language
Installation of NLTK
Make sure that Ptyhon 2.4 or 2.5 or 2.6 is available in your system
Install Python Tkinter package
Install Numpy, Matplotlib, Prover9, MaltParse and MegaM
Download NLTK and Install it
If you are installing NLTK from source Download
http://nltk.googlecode.com/les/nltk-2.0b9.zip
Unzip it , It will create nltk-2.0b9 .
Open terminal and cd in to this folder, Be super user , python
setup.py install
To install data
Start python interpreter
>>> import nltk
>>> nltk.download()
Now you are ready to play with NLTK !!!
setup.py install
To install data
>>> import nltk
>>> nltk.download()
setup.py install
To install data
>>> import nltk
>>> nltk.download()
setup.py install
To install data
>>> import nltk
>>> nltk.download()
setup.py install
To install data
>>> import nltk
>>> nltk.download()
setup.py install
To install data
>>> import nltk
>>> nltk.download()
setup.py install
To install data
>>> import nltk
>>> nltk.download()
setup.py install
To install data
>>> import nltk
>>> nltk.download()
setup.py install
To install data
>>> import nltk
>>> nltk.download()
setup.py install
To install data
>>> import nltk
>>> nltk.download()
NLTK Modules
NLTK Modules Functionality
nltk.corpus Courpus
nltk.tokenize,nltk.stem Tokenizers,stemmers
nltk.collocations t-test,chi-squared,mutual-info
nltk.tag n-gram,backoff,Brill,HMM,TnT
nltk.classify,nltk.cluster Decision tree,Naive bayes,K-means
nltk.chunk Regex,n-gram,named entity
nltk.parsing Parsing
nltk.sem,nltk.interence Semantic interpretation
nltk.metrics Evaluation metrics
nltk.probability Probability & Estimation
nltk.app,nltk.chat Applications
NLTK Modules
nltk.corpus Courpus
NLTK Modules
nltk.corpus Courpus
NLTK Modules
nltk.corpus Courpus
NLTK Modules
nltk.corpus Courpus
NLTK Modules
nltk.corpus Courpus
NLTK Modules
nltk.corpus Courpus
NLTK Modules
nltk.corpus Courpus
NLTK Modules
nltk.corpus Courpus
NLTK Modules
nltk.corpus Courpus
NLTK Modules
nltk.corpus Courpus
NLTK Modules
nltk.corpus Courpus
NLTK Modules
nltk.corpus Courpus
Let us start the game
To access data for working out the example in the book
Some basic work outs from the book
Concordance
>>> from nltk.book import *
>>> text1.concordance("monstrous")
Similar
>>> text1.similar("monstrous")
Dispersion plot - Positional information
>>> text4.dispersion_plot(["citizens",
"democracy", "freedom", "duties", "America"])
>>> text4.dispersion_plot(["and",
"to", "of", "with", "the"])
What is it !!! Why ???
Concordance
Similar
Concordance
Similar
Concordance
Similar
Concordance
Similar
Concordance
Similar
Concordance
Similar
Concordance
Similar
Continued...
Generate
>>> text3.generate()
Counting Vocabulary
>>> len(text3)
List of distinct words ,sorted in dictionary order.
>>> sorted(set(text3))
Count occurrence of a particular word in a text
>>> text3.count("and")
What percentage of text it is taken by a specific word
>>> 100 * text3.count("and") / len(text3)
Continued...
Generate
Counting Vocabulary
>>> len(text3)
Continued...
Generate
Counting Vocabulary
>>> len(text3)
Continued...
Generate
Counting Vocabulary
>>> len(text3)
Continued...
Generate
Counting Vocabulary
>>> len(text3)
Continued...
Generate
Counting Vocabulary
>>> len(text3)
Continued...
Generate
Counting Vocabulary
>>> len(text3)
Continued...
Generate
Counting Vocabulary
>>> len(text3)
Continued...
Generate
Counting Vocabulary
>>> len(text3)
Collocation & Bigram
Collocation
A collocation is a sequence of words that occur together unusually often
e.g :- red wine , strong tea
But strong computer is not a collocation
>>> text4.collocations()
Bigrams
List of word pairs
>>> text = "sreejith is talking about NLTK"
>>> wordlist = text.split()
>>> bigrams(wordlist)
what will happen if i do like this
>>> bigrams(text)
Collocation
Bigrams
List of word pairs
>>> bigrams(text)
Collocation
Bigrams
List of word pairs
>>> bigrams(text)
Collocation
Bigrams
List of word pairs
>>> bigrams(text)
Collocation
Bigrams
List of word pairs
>>> bigrams(text)
Collocation
Bigrams
List of word pairs
>>> bigrams(text)
Work with our own data
Populate our own corpora with NLTK and analyse it
>>> from nltk.corpus import
PlaintextCorpusReader as ptr
>>> corpus = /home/developer/Desktop/Sreejith
>>> wordlist = ptr(corpus,.*)
>>> wordlist.fileids()
Let us try to nd it out how to count number of characters, words
and sentences in the corpus
>>> for fid in wordlist.fileids():
print len(wordlist.raw(fid))
print len(wordlist.words(fid))
print len(wordlist.sents(fid))
Continued...
Ploting conditional frquency distribution
>>> words = text.split()
>>> big = bigrams(words)
>>> gd = nltk.ConditionalFreqDist(big)
>>> gd.plot()
Tabulate CFD
>>> gd.tabulate()
Plot frequency distribution
>>> fdist = FreqDist(text1)
>>> fdist.plot(50,cumulative=True)
Continued...
>>> gd.plot()
Tabulate CFD
>>> gd.tabulate()
Continued...
>>> gd.plot()
Tabulate CFD
>>> gd.tabulate()
Continued...
>>> gd.plot()
Tabulate CFD
>>> gd.tabulate()
Continued...
>>> gd.plot()
Tabulate CFD
>>> gd.tabulate()
Continued...
>>> gd.plot()
Tabulate CFD
>>> gd.tabulate()
Normalizing Text
Stemming
Stemming is the process for reducing inected (or sometimes derived)
words to their stem, base or root form , generally a written word form
>>> porter = nltk.PorterStemmer()
>>> word = running
>>> porter.stem(word)
>>> lancaster = nltk.LancasterStemmer()
>>> lancaster.stem(tok[2])
Normalizing Text
Stemming
>>> word = running
Normalizing Text
Stemming
>>> word = running
Normalizing Text
Lemmatization
Stemming + make sure that the resulting form is a known word in a
dictionary
>>> wnl = nltk.WordNetLemmatizer()
>>> wnl.lemmatize(word)
Normalizing Text
Lemmatization
dictionary
Normalizing Text
Lemmatization
dictionary
POS Tagging
POS Tagging
The process of classifying words into their parts-of-speech and labeling
them accordingly is known as part-of-speech tagging, POS tagging
>>> text = nltk.word_tokenize("we are attending
FOSS meet at NIC calicut")
>>> nltk.pos_tag(text)
POS Tagging
POS Tagging
POS Tagging
POS Tagging
Parsing
Sentence Parsing
Analyzing sentence structures and create a Parse Tree
>>> sentence = [("the", "DT"), ("little", "JJ"),
("yellow", "JJ"),("dog", "NN"), ("barked", "VBD"),
("at", "IN"), ("the", "DT"), ("cat", "NN")]
>>> grammar = "NP: {<DT>?<JJ>*<NN>}"
>>> cp = nltk.RegexpParser(grammar)
>>> result = cp.parse(sentence)
>>> print result
>>> result.draw()
Parsing
Sentence Parsing
("at", "IN"), ("the", "DT"), ("cat", "NN")]
>>> print result
>>> result.draw()
Parsing
Sentence Parsing
("at", "IN"), ("the", "DT"), ("cat", "NN")]
>>> print result
>>> result.draw()
Machine Translation
Babelizer Shell
Translating a sentence from its source langauge to a specied language.
NLTK provides babelize shell
>>> babelize_shell()
Babel> hello how are you?
Babel> german
Babel> run
Just try Google Translator, Yahoo babelsh
Machine Translation
Babelizer Shell
Babel> german
Babel> run
Machine Translation
Babelizer Shell
Babel> german
Babel> run
Machine Translation
Babelizer Shell
Babel> german
Babel> run
What u can do??
Contribute to NLTK
GSOC
NLP Training
Real time research
Reference
Steven Bird, Edvard Loper and Ewan Klien
Natural Language Processing with Python
Jacob Perkins
Python Text Processing with NLTK2.0 Cookbook
http://www.nltk.org
Questions
And nally...
Sreejith.S

Introduction To NLTK

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To NLTK

Uploaded by

Copyright:

Available Formats

Getting Started with NLTK

You might also like