2, NLTK corpus

Key points of this chapter:

1. Access to text corpus

import nltk
from nltk.book import *
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

1.1 Gutenberg corpus

NLTK contains a small selection of text from the Project Gutenberg electronic text archive. There are about 25000 (now 36000) free e-books on the project

from nltk.corpus import gutenberg
gutenberg.fileids()
['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']
emma = gutenberg.words('austen-emma.txt')
len(emma)
192427

Text retrieval: concordance()

emma = nltk.Text(emma)
emma.concordance("surprize")
Displaying 25 of 25 matches:
her father , was sometimes taken by surprize at his being still able to pity ` p
them do the other any good ." " You surprize me ! Emma must do Harriet good : an
 Knightley actually looked red with surprize and displeasure , as he stood up , 
Mr . Elton , and found to his great surprize , that Mr . Elton was actually on h
nd aid ." Emma saw Mrs . Weston ' s surprize , and felt that it must be great , 
 father was quite taken up with the surprize of so sudden a journey , and his fe
cy , in all the favouring warmth of surprize and conjecture . She was , moreover
she appeared , to have her share of surprize , introduction , and pleasure . The
eir plans ; and it was an agreeable surprize to her , therefore , to perceive th
 talking aunt had taken me quite by surprize , it must have been the death of me
of all the dialogue which ensued of surprize , and inquiry , and congratulations
e the present . They might chuse to surprize her ." Mrs . Cole had many to agree
 the mode of it , the mystery , the surprize , is more like a young woman ' s sc
t to her song took her agreeably by surprize -- a second , slightly but correctl
." " Oh ! no -- there is nothing to surprize one at all .-- A pretty fortune ; a
nt to be considered . Emma ' s only surprize was that Jane Fairfax should accept
 of your admiration may take you by surprize some day or other ." Mr . Knightley
ration for her will ever take me by surprize .-- I never had a thought of her in
h expected by the best judges , for surprize -- but there was great joy . Mr . W
e sound of at first , without great surprize . " So unreasonably early !" she wa
ed Frank Churchill , with a look of surprize and displeasure .-- " That is easy 
 ; and Emma could imagine with what surprize and mortification she must be retur
ttled that Jane should go . Quite a surprize to me ! I had not the least idea !-
d . It is impossible to express our surprize . He came to speak to his father on
ng engaged !" Emma even jumped with surprize ;-- and , horror - struck , exclaim

1.2 Internet and chat text

NLTK's collection of online texts includes Firefox forums, conversations overheard in New York, screenplays of Pirates of the Caribbean, personal ads and wine reviews:

from nltk.corpus import webtext

There's also an instant messaging chat session repository, each file containing hundreds of Posts collected by chat rooms of a specific date and age (teens, 20, 30, 40, plus a generic adult chat room). The file name contains the date, room, and number of posts. For example, 10-19-20s posts.xml contains 706 Posts collected from chat rooms in their 20s on October 19, 2006.

from nltk.corpus import nps_chat

1.3 Brown corpus

Brown corpus is the first million word English electronic corpus, which was founded by Brown University in 1961. This corpus contains 500 texts from different sources, which are classified according to style, such as news, editorials, etc.

from nltk.corpus import brown

1.4 Reuters corpus

The Reuters corpus contains 10788 news documents, totaling 1.3 million words. These documents are divided into 90 topics and are divided into two groups according to "training" and "testing". Therefore, documents with fileid "test/14826" and so on belong to the test group.

from nltk.corpus import reuters

1.5 inaugural speech corpus

The corpus is actually a collection of 55 texts, each of which is a presidential speech.

from nltk.corpus import inaugural

1.6 annotated text corpus

Many text corpora contain linguistic tagging, including part of speech tagging, named entity, syntactic structure, semantic role, etc. NLTK provides a convenient way to access several of these corpora. There is also a data package containing corpora and corpus samples, which can be downloaded for free for teaching and research.

1.7 loading your own corpus

If you have your own collection of text files and want to access them using the methods discussed earlier, you can easily load them with the help of plaintext corpusreader in NLTK.

from nltk.corpus import PlaintextCorpusReader
corpus_root = '/usr/share/dict' # Let's assume that the directory where the corpus is located is this
# The PlaintextCorpusReader parameter is the address of the corpus and the regular expression, respectively
wordlists = PlaintextCorpusReader(corpus_root, '.*') 

Using BracketParseCorpusReader to load corpus

from nltk.corpus import BracketParseCorpusReader
# Suppose the corpus is stored at this address
corpus_root = r"C:\corpora\penntreebank\parsed\mrg\wsj"
# File pattern is used to match the files contained in its subfolders
file_pattern = r".*/wsj_.*\.mrg"
ptb = BracketParseCorpusReader(corpus_root, file_pattern)

2. Conditional frequency distribution

2.1 conditions and events

Frequency distribution calculates observed events, such as words that appear in text. For this purpose, the text needs to be processed into a pair sequence: (condition, event)

For example, there will be 15 conditions (each style is a condition) and 116192 events (each word is an event) for handling the whole brown corpus by style

from nltk.corpus import brown
# The outer loop is for genre in brown. Category(), which takes one style (condition) at a time
# The inner loop is for word in brown.words(categories=genre), which takes out one word (event) each time
cfd = nltk.ConditionalFreqDist(((genre, word) for genre in brown.categories() for word in brown.words(categories=genre)))

2.2 draw distribution map and distribution table

The condition is the word American or citizen, and the number of times a word appears in a particular speech is the number of times it is plotted. It uses the first four character age of each presentation's filename - 1865-Lincoln.txt, for example - as an event. This code generates a pair ('america', '1865') for each lowercase word that begins with america in file 1865-Lincoln.txt

# Draw distribution map and distribution table
from nltk.corpus import inaugural

# fileid[:4], year, as event
# target, keyword ['america ',' citizen '], as condition
cfd = nltk.ConditionalFreqDist((target, fileid[:4])
                              for fileid in inaugural.fileids()
                              for w in inaugural.words(fileid)
                              for target in ['america', 'citizen']
                              if w.lower().startswith(target))
cfd.plot()

  • String. Endswitch (str1), judge whether the string ends in str1
  • String. Startswitch (str2), judge whether the string starts with str2

3. Dictionary resources

3.1 vocabulary list corpus

Vocabulary corpus is a / usr/dict/words file in Unix, which is used by some spell checker.
In addition to this corpus, there are stop words. This corpus is often used in NLP processing. Because high-frequency words, such as "de" and other words without meaning, often hinder the judgment of the algorithm, so stop words are often used to process them before the application of the algorithm to remove these high-frequency words without meaning.

# Load disabled words
from nltk.corpus import stopwords

3.2 Dictionary of pronunciation

A slightly richer dictionary resource is a table (or spreadsheet) that contains a word in each row plus some properties. NLTK includes CMU pronunciation dictionary nltk.corpus.cmudict.entries(), which is designed for speech synthesizer.

3.3 comparative vocabulary

Another example of a tabular dictionary is a comparative vocabulary. NLTK contains the so-called SWA Desh wordlists, a list of about 200 common words in several languages. The language identifier uses ISO639 two letter code.
You can use from nltk.corpus import swadesh to load.

4,WordNet

WordNet is a semantic oriented English dictionary, similar to traditional dictionaries, but with a richer structure. NLTK includes English WordNet with 155287 words and 117659 synonyms. We will start by looking for synonyms and how they are accessed in WordNet.
This is the most famous dictionary knowledge base at present.

Bibliography: python natural language processing
Reference blog: https://www.jianshu.com/p/7401d220a095

Published 21 original articles, won praise 3, visited 632
Private letter follow

Tags: Firefox Session xml Unix

Posted on Sat, 07 Mar 2020 04:34:06 -0800 by ekosoftco