All those python programs that are built to work with human language data is prepared on a platform which is collectively called as Natural Language Toolkit.
Providing user friendly interface to over 50 corpora and database of several dictionaries. Word Net is one such example for this which comprises of a suite of
text processing libraries for classification, tokenization, stemming, tagging, parsing, and
semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.
The vital aspect of NLTK is it is absolutely free and is a community driven open source project. This tool kit is known to be a perfect companion for all
engineer students, linguist enthusiast, educationist, Researchers and also to some industry users. Windows, Linux and Mac OSX supports this program and its
hands-on guide explaining the fundamentals of program and the topics in computational linguistics and API Documentation.
NLTK has been called “a wonderful tool for teaching, and working in, computational linguistics using Python,” and “an amazing library to play with natural language.”
Natural language processing with Python is a perfect package of for the readers who can now learn the fundamentals of writing Python Programs. Corpora working tutorial,
text categorizing, linguistic analyzing structure and many more options. Over this entire tool kit constitutes the practical guide work for programming for language
processing.
For a context free grammars the basic data classes setup is essential. A ‘grammar’ is something that denotes the representation of a specific tree to the structure of a
given text. Each tree is named as “Parse tre’ for the text or ‘Parse’ to be simple.
In a ‘context free’ grammar setup for any piece of text, the set of parse tree is completely
dependent on that particular piece and not the rest of the text. Grammars that are free from context are usually used to find possible syntactic structures for sentences.
In this situation, the possible leaves of parse tree become the word tokens, and the phrasal categories constitutes the node values such as NP and VP.
Context free grammars are encoded by the CFG class. Each of these CFG is made up of a start symbol and a set of productions. Start symbol here is a representation of the
root node value for the parse trees. To quote with an example, S is the symbol for syntactic parsing. These start symbols are usually encoded with the help of the Nonterminal
class.
The parent-child relationship of a parse tree is well explained in grammar’s ‘productions. The notion that a particular node can be the parent of that particular set of children
is conceptually specified by each production. For example the parent of an NP node and a VP node is S Node is defined by production.
The production class takes the responsibility of implementation of Grammars. There is a left hand and right hand side for every production. The left hand side of the
production is known as the Nonterminal end that defines the node type for a potential parent and on the other hand the right hand side is a list that defines the
permissible children for that parent type. This list is usually made up of nonterminals and text types where in each nonterminal shows that the corresponding child may
be a tree token with that specific node type and each text type specifies the corresponding child may be a token with that type.
The class that is distinguishes node values from leaf values are the Nonterminal Class. They take care in such a way that no grammar by any chance is using a leaf value
( such as A) as the subtree node. All node values are wrapped in the non terminal class and this is within a CFG. However, the trees that get specified by the grammar
will not include these non terminal wrappers and that is something to be noticed.
Grammars deserves a better explanation or understanding according to which a grammar is an indication of a tree structure tree that can be produced as explained in the
below procedure.
1. Set the tree to the start symbol and continue till the tree is deprived of nonterminal leaves.
2. Then pick a production prod with whose left hand side is a nonterminal leaf of tree.
3. Replace the nonterminal leaf with a subtree, whose node value is the value wrapped by the nonterminal left hand side and whose children are the right hand side of
production.
This process or operation of replacing the left hand side (lhs) of a production with the right hand side (rhs) in a tree is known as “expanding” lhs to rhs in tree.
A collection of methods for tree (grammar) transformations used in parsing natural language. Although many of these methods are technically grammar transformations
(ie. Chomsky Norm Form), when working with treebanks it is much more natural to visualize these modifications in a tree structure. Hence, we will do all
transformation directly to the tree itself. Transforming the tree directly also allows us to do parent annotation. A grammar can then be simply induced from the
modified tree.
The following is a short tutorial on the available transformations.
Chomsky Normal Form (Binarization) :
Any Grammar has a Chomsky Normal Formal which is equivalent grammar where a CNF is explained by each production having either two non terminals or one terminal on the right
hand side. When the data is hierarchically structured such as a tree bank, it is natural to view this in terms of productions. Here the root of every subtree is defined as
the head (left hand side) of the production and all of its children are the right hand side constituents.
To convert a tree into CNF, one need to make sure that every subtree has either two subrees as children (binarization). To binarize a subtree with more than two
children, the artificial nodes play an important role.
Parent Annotation :
Apart from binarizing the tree, there are two standard modifications to node labels that we can do in the same cross over: parent annotation and Markov order-N smoothing
(or sibling smoothing).
The purpose of parent annotation is to filter out the probabilities of productions by addition of a small amount of context. With this simple addition, a CYK
(inside-outside, dynamic programming chart parses) can improve from 74% to 79% accuracy. A natural generalization from parent annotation is to grandparent annotation
and beyond.