Key points of this chapter:
- 1. Access to text corpus
- 1.1 Gutenberg corpus
- 1.2 Internet and chat text
- 1.3 Brown corpus
- 1.4 Reuters corpus
- 1.5 inaugural speech corpus
- 1.6 annotated text corpus
- 1.7 loading your own corpus
- 2. Conditional frequency distribution
- 3. Dictionary resources
import nltk from nltk.book import * import numpy as np import matplotlib.pyplot as plt %matplotlib inline
NLTK contains a small selection of text from the Project Gutenberg electronic text archive. There are about 25000 (now 36000) free e-books on the project
from nltk.corpus import gutenberg gutenberg.fileids()
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']
emma = gutenberg.words('austen-emma.txt') len(emma)
Text retrieval: concordance()
emma = nltk.Text(emma) emma.concordance("surprize")
Displaying 25 of 25 matches: her father , was sometimes taken by surprize at his being still able to pity ` p them do the other any good ." " You surprize me ! Emma must do Harriet good : an Knightley actually looked red with surprize and displeasure , as he stood up , Mr . Elton , and found to his great surprize , that Mr . Elton was actually on h nd aid ." Emma saw Mrs . Weston ' s surprize , and felt that it must be great , father was quite taken up with the surprize of so sudden a journey , and his fe cy , in all the favouring warmth of surprize and conjecture . She was , moreover she appeared , to have her share of surprize , introduction , and pleasure . The eir plans ; and it was an agreeable surprize to her , therefore , to perceive th talking aunt had taken me quite by surprize , it must have been the death of me of all the dialogue which ensued of surprize , and inquiry , and congratulations e the present . They might chuse to surprize her ." Mrs . Cole had many to agree the mode of it , the mystery , the surprize , is more like a young woman ' s sc t to her song took her agreeably by surprize -- a second , slightly but correctl ." " Oh ! no -- there is nothing to surprize one at all .-- A pretty fortune ; a nt to be considered . Emma ' s only surprize was that Jane Fairfax should accept of your admiration may take you by surprize some day or other ." Mr . Knightley ration for her will ever take me by surprize .-- I never had a thought of her in h expected by the best judges , for surprize -- but there was great joy . Mr . W e sound of at first , without great surprize . " So unreasonably early !" she wa ed Frank Churchill , with a look of surprize and displeasure .-- " That is easy ; and Emma could imagine with what surprize and mortification she must be retur ttled that Jane should go . Quite a surprize to me ! I had not the least idea !- d . It is impossible to express our surprize . He came to speak to his father on ng engaged !" Emma even jumped with surprize ;-- and , horror - struck , exclaim
NLTK's collection of online texts includes Firefox forums, conversations overheard in New York, screenplays of Pirates of the Caribbean, personal ads and wine reviews:
from nltk.corpus import webtext
There's also an instant messaging chat session repository, each file containing hundreds of Posts collected by chat rooms of a specific date and age (teens, 20, 30, 40, plus a generic adult chat room). The file name contains the date, room, and number of posts. For example, 10-19-20s posts.xml contains 706 Posts collected from chat rooms in their 20s on October 19, 2006.
from nltk.corpus import nps_chat
Brown corpus is the first million word English electronic corpus, which was founded by Brown University in 1961. This corpus contains 500 texts from different sources, which are classified according to style, such as news, editorials, etc.
from nltk.corpus import brown
The Reuters corpus contains 10788 news documents, totaling 1.3 million words. These documents are divided into 90 topics and are divided into two groups according to "training" and "testing". Therefore, documents with fileid "test/14826" and so on belong to the test group.
from nltk.corpus import reuters
The corpus is actually a collection of 55 texts, each of which is a presidential speech.
from nltk.corpus import inaugural
Many text corpora contain linguistic tagging, including part of speech tagging, named entity, syntactic structure, semantic role, etc. NLTK provides a convenient way to access several of these corpora. There is also a data package containing corpora and corpus samples, which can be downloaded for free for teaching and research.
If you have your own collection of text files and want to access them using the methods discussed earlier, you can easily load them with the help of plaintext corpusreader in NLTK.
from nltk.corpus import PlaintextCorpusReader corpus_root = '/usr/share/dict' # Let's assume that the directory where the corpus is located is this # The PlaintextCorpusReader parameter is the address of the corpus and the regular expression, respectively wordlists = PlaintextCorpusReader(corpus_root, '.*')
Using BracketParseCorpusReader to load corpus
from nltk.corpus import BracketParseCorpusReader # Suppose the corpus is stored at this address corpus_root = r"C:\corpora\penntreebank\parsed\mrg\wsj" # File pattern is used to match the files contained in its subfolders file_pattern = r".*/wsj_.*\.mrg" ptb = BracketParseCorpusReader(corpus_root, file_pattern)
Frequency distribution calculates observed events, such as words that appear in text. For this purpose, the text needs to be processed into a pair sequence: (condition, event)
For example, there will be 15 conditions (each style is a condition) and 116192 events (each word is an event) for handling the whole brown corpus by style
from nltk.corpus import brown # The outer loop is for genre in brown. Category(), which takes one style (condition) at a time # The inner loop is for word in brown.words(categories=genre), which takes out one word (event) each time cfd = nltk.ConditionalFreqDist(((genre, word) for genre in brown.categories() for word in brown.words(categories=genre)))
The condition is the word American or citizen, and the number of times a word appears in a particular speech is the number of times it is plotted. It uses the first four character age of each presentation's filename - 1865-Lincoln.txt, for example - as an event. This code generates a pair ('america', '1865') for each lowercase word that begins with america in file 1865-Lincoln.txt
# Draw distribution map and distribution table from nltk.corpus import inaugural # fileid[:4], year, as event # target, keyword ['america ',' citizen '], as condition cfd = nltk.ConditionalFreqDist((target, fileid[:4]) for fileid in inaugural.fileids() for w in inaugural.words(fileid) for target in ['america', 'citizen'] if w.lower().startswith(target)) cfd.plot()
- String. Endswitch (str1), judge whether the string ends in str1
- String. Startswitch (str2), judge whether the string starts with str2
Vocabulary corpus is a / usr/dict/words file in Unix, which is used by some spell checker.
In addition to this corpus, there are stop words. This corpus is often used in NLP processing. Because high-frequency words, such as "de" and other words without meaning, often hinder the judgment of the algorithm, so stop words are often used to process them before the application of the algorithm to remove these high-frequency words without meaning.
# Load disabled words from nltk.corpus import stopwords
A slightly richer dictionary resource is a table (or spreadsheet) that contains a word in each row plus some properties. NLTK includes CMU pronunciation dictionary nltk.corpus.cmudict.entries(), which is designed for speech synthesizer.
Another example of a tabular dictionary is a comparative vocabulary. NLTK contains the so-called SWA Desh wordlists, a list of about 200 common words in several languages. The language identifier uses ISO639 two letter code.
You can use from nltk.corpus import swadesh to load.
WordNet is a semantic oriented English dictionary, similar to traditional dictionaries, but with a richer structure. NLTK includes English WordNet with 155287 words and 117659 synonyms. We will start by looking for synonyms and how they are accessed in WordNet.
This is the most famous dictionary knowledge base at present.
Bibliography: python natural language processing
Reference blog: https://www.jianshu.com/p/7401d220a095