NLTK – Part 3

Ch. 7 Extracting Information from Text

Information Extraction

structured data – a regular and predictable organization of entities and relationships

ex: looking at a relation between companies and locations, if we were given a location, we would like to discover which companies do business in that location

Screen Shot 2015-06-12 at 12.38.47 PM

above is example code that demonstrates how the relations work. the form (element1, relation, element2) is key for locating key words as seen in line 6 of the code. query is assigned to the for-loop used to look for companies in Atlanta.

named entity detection – search for mentions of potentially interesting entities in each sentence

relation detection – search for likely relations between different entities in the text



chunking – the basic technique used for entity detection. segments and labels multi-token sequences. chunking usually selects a subset of tokens and the pieces produced by a “chunker” do not overlap in source text

chunk – the larger boxes (ex: NP)


noun phrase chunking – also known as NP-chunking; searching for chunks corresponding to individual noun phrases. it is useful for getting POS tag information

Screen Shot 2015-06-12 at 12.48.23 PM

above is an example of noun phrase chunking where the brackets indicate the noun chunks

chunk grammar – rules that indicate how sentences should be chunked (rules are specific to cases, there is not a general set.)

ex: single regular-expression rule – an NP chunk should be formed whenever the chunker finds an optional determiner (DT) followed by any number of adjectives (JJ) and then a noun (NN).


above is an example of a simple regular expression base on the NP chunker. it classifies every word’s POS.

Tag Patterns

tag patterns – a sequence of POS tags delimited using angle brackets. (ex:?*. they are similar to regular expressions. the example tag pattern will chunk any sequence of tokens starting with an optional determiner, followed by zero or more adjectives, followed by zero or more nouns.

if a tag pattern matches at overlapping locations then the leftmost match takes precedence

Screen Shot 2015-06-15 at 10.36.11 AM

nltk.RegexParser uses a set of regular expression patterns to specify the behavior of the parser

in the above example, we are assigning a list of 3 tuples to the variable noun. the grammar variable shows how we are taking leftmost precedence of the nouns. when printed, all three will show, but the ones in the statement NP money/NN market/NN) are the two that are consecutive since they are the first two out of three.


chink – a sequence of tokens that is not included in a chunk.

Screen Shot 2015-06-15 at 10.50.05 AM

in the above example, the output will not have brackets around anything that we are chinking.

Screen Shot 2015-06-15 at 10.53.07 AM

the first statement shows you how to chunk everything and the second statement shows you have to chink specific words/phrases.

*remember chinking: “} – {” vs. chunking: “{ – }”

IOB tags – tag a chunk structure after parsing with either I (inside) if the token is inside a chunk, O (outside) if the token is outside a chunk, or B (begin) if the token marks the beginning of a chunk.

the chunk_types argument selects a specific type of argument in a text


Screen Shot 2015-06-15 at 11.18.00 AM

in the above example, all the nouns were not chinked while everything else was. (the nouns are still in parentheses)

Evaluation Chunkers

we can evaluate chunkers and how precise they are by using the cp.evaluateargument.

Screen Shot 2015-06-15 at 11.22.44 AM

Recursion in Linguistic Structure

most of the chunk structures we have worked with have been relatively flat, however, it is possible to build chunk structures with more depth by creating chunk grammar containing recursive rules.

Screen Shot 2015-06-15 at 11.43.09 AM

above is an example of a chunk structure with more depth. the more rules, the more depth. keep in mind that there is a shortcoming in this code — the word “saw” is missing the solution to this and other patterns like this is to get the chunker to loop over its own patterns/repeat the process.


tree – a set of connected labeled nodes, each reachable by a unique path from a distinguished root node.

its best to use a recursive function to traverse through a tree


Named Entity Recognition

Named Entities – definite noun phrases that refer to a specific type of individual

ex:  Screen Shot 2015-06-16 at 10.43.17 AM

Named Entity Recognition – a system whose goal is to recognize any mention of any named entity. it is useful for identifying relations in Information Extraction, and Question Answering by improving the precision of Information Retrieval.

gazetteer – a geographical dictionary that is useful for identifying named entities although is very error prone since it sometimes finds incorrect locations on words that aren’t meant to be located and some terms are ambiguous .

ex: ../images/locations.png

incorrect locations: Sanchez in the Dominican Republic, On in Vietnam, Of in Turkey, etc.

nltk.ne_chunk() is a classifier that NTLK has already provided to be able to recognize named entities. (if we set the parameter binary=True then the named entities will be tagged as NE, or else the classfier will label such as PERSON, ORGANIZATION, and GPE.

Screen Shot 2015-06-16 at 11.00.43 AM

look at the difference in the above example when you have binary=True in the parameters versus when you don’t; some things are classified differently.

Relation Extraction

after we identify named entities between text, we want to extract the relations that exist between them. normally this is in the form (X, a, Y) where ‘X’ and ‘Y’ are named entities like an ORGANIZATION, or LOCATION, and ‘a’ is the relation term that we are looking for.

Ch. 8 Analyzing Sentence Structure

generative grammar is a language that is made up of a collection of grammatical sentences.

ex:  a. I went shopping at the mall.

b. I got a new shirt.

c. My mom said I went shopping at the mall and I got a new shirt.

sentence c is made up of both sentence a and b; an example of how generative grammar works.

this chapter talks about grammar and parsing, patterns of form in a sequence of words and how to appropriately respond to phrases without being mislead by the sequence of words (an ambiguous sentence).

Screenshot from 2015-06-29 14:26:27

in the above picture code is shown to represent the two trees that show how the sentence “I shot an elephant in my pajamas” are analyzed. one way with the elephant in the pajamas and the other with the person in their pajamas shooting the elephant.

coordinate structure – two phrases are joined by a coordinating conjunction such as and, but or or

** If A and B are both phrases of grammatical category X, then A and B is also a phrase of category X. Meaning that we can combine two of the same types of phrases (noun phrases, adjective phrases, etc.) and have the sentence still make sense. However we cannot combine two different types of phrases in a sentence.

constituent structure – the observation that words combine with other words to form units aka taking a well-formed long sentence and replacing it with a shorter one without altering the meaning of the sentence


Screenshot from 2015-06-29 10:59:27

above in the green box we define the word bank(also known as a grammar production) we can use in our sentences and in the blue box we create our sentence and use a for loop to go through each word and assign what grammar type goes with each word as defined in the green box.

Recursion in Syntactic Structure

recursive grammar – when a category on the left hand side of a production is also on the right hand side of a production



there are nested sentences within this tree with “chatterer said”, then “Buster thought”, and lastly “the tree was tall”. they have the same sentence structure within one another.

Parsing with Context Free Grammar

parser – processes input sentences according to the productions of a grammar. there are two type of parsing algorithms — recursive descent parsing and shift-reduce parsing.

Recursive Descent Parsing – breaking a high-level goal into several lower-level subgoals. It builds a parse tree with the initial goal to be the root node. (ie. if your goal is to find S, the S -> NP VP production replaces the goal with two subgoals — one to find NP and one to find VP. in this case, S would be the root node and NP and VP would be the children).


NLTK provides a recursive descent parser by using nltk.RecursiveDescentParser

here is an example of how it can be used:

Screenshot from 2015-06-30 11:11:36

Shift-Reduce Parser – tries to find a sequence of words and phrases that correspond to the RIGHT HAND side of a grammar production, and replace them will the left hand side until the sentence is reduced to an S.


NLTK also provides a shift reducing parser by using nltk.ShiftReduceParser.

here is an example of how it is used:

Screenshot from 2015-06-30 11:23:53

left-corner parser- a hybrid between the bottom-up and top-down parsing approaches we have seen. it is a top-down parser recursive productions (unlike a recursive descending parser).

well-formed substring table is a table that neatly displays the grammar type of each word in a table.


Screenshot from 2015-06-30 11:33:10

in this example we create out sentence and split it, and then display the tokens in a nice neat table, displayed in the same order the sentence is in. so “I” is a noun phrase (NP), “shot” is a verb (V), “an” is a determiner (Det), and so on.

Dependencies and Dependency Grammar

dependency grammar focuses on how words relate to other words. it is a binary asymmetric relation between a head and its dependents

here is an example of a dependency graph:../images/depgraph0.png

a dependency graph is said to be projective if all the words are written in linear order and the arrows can be drawn above the words without crossing — a word and its descendants form a contiguous sequence of words within the sentence.

Ch. 9 Building Feature Based Grammars

in this chapter, we are going to learn how to declare the features of words and phrases. here is an example:

Screenshot from 2015-06-30 17:28:05

the features are to the left of the colons and the values are to the right. (CAT is the grammatical category, ORTH is the spelling, REF is the referent, and REL is the relation).

feature structures – the pairing of features and values. they contain various kinds of information about grammatical entities for example, what “semantic role” is played by a POS. (is it a subject or object?)

Screenshot from 2015-06-30 18:14:43

in the example a. the dog runs and b. the dog run, we can see the verb co-vary with the syntactic properties of the subject noun phrase (also called an agreement)

below is an example of how syntactic agreement works — grammatical word sequences. there is a singular and a plural version

Screen Shot 2015-07-06 at 9.47.16 AM


Screenshot from 2015-06-30 18:27:13

in ex (5), there is only one subject, noun phrase, verb phrase, etc. (singular) however in ex (6), using the same template, there is both a singular version and a plural version as well. every production in (5) has two counterparts in (6) this is helpful with a large set of grammar that covers a good section of the english language. (this production can be inefficient in that it comes at a cost of “blowing up” the number of productions very quickly)

Using Attributes and Constraints

linguistic categories have properties (ex: a noun has a property of being plural). an example of this notation for this is n[NUM=pl]

this is helpful in that we can allow variables to replace feature values (like templates). here is an example:

Screen Shot 2015-07-06 at 10.04.22 AM

here we allow the “?” to be a variable that can be either singular (sg) or plural (pl); that way we only have to write one production and not two!

Screen Shot 2015-07-07 at 9.34.17 AM

Screen Shot 2015-07-07 at 9.34.04 AM

think of these productions like trees — 10(a) and 11(a) can be combined in a production like 10(b) and 11(b) as well because the values of Det and N must be the same. however 10(a) and 11(b) cannot be combined.