Today in class we got setup with the tools needed to work with the Natural Language Toolkit (NLTK).
NLTK is the preeminent technology to work with if you want to write computer programs that do natural language processing (NLP) which is a core topic for this class this semester. By natural language here, we are referring to spoken, human languages — as opposed to computer programming languages, the languages of various network protocols, or other symbolic languages not spoken by people.
NLTK is a library for Python, meaning that it is a collection of functionality distributed together, which you can install on your computer, and then use from within your Python programs. NLTK is free, freely available, open source software.
NLTK supports work, research, and teaching in various fields including linguistics, cognitive science, artificial intelligence, information retrieval, and machine learning. NLTK supports various NLP tasks such as: classification, tokenization, stemming, tagging, parsing, and semantic reasoning functionalities. (from Wikipedia) We'll talk about what some of those various tasks mean in the coming weeks.
There are five main steps: (Click to jump down.)
I. Python. To work with NLTK, you'll need Python. Your system probably already has Python installed on it. The question is whether your system has Python version 2 or version 3. In class we determined that everyone in our class has Python 3 on their system.
II. Text editor. To create Python computer programs, we'll use a text editor. This semester, I'll recommend that you use Atom. This is an open source, freely available text editor with lots of great, helpful features to help you learn how to code. It is easy to download and install. I presume some of you might have used Atom before. If you would like to use a different text editor of your own choosing this semester, that's great, but I will not be able to offer detailed tech support for anything other than Atom.
A text editor is one part of any development environment. Some development environments like P5js/ or Microsoft Visual Studio are what is called an integrated development environment (IDE) because all the various development tools are integrated into one seamless application. Notice how in P5js this semetser, you were able to write code, use code editing helpers like syntax highlighting, run your programs, and see the result, all in one web interface.
In working with NLTK and Python, we'll be working with a development environment that is not really "integrated" in that sense, but rather is a set of a few tools that work together well. This might be a little trickier to learn, but will be more powerful, will give you deeper knowledge of how these tools work and how they work on your computer. Working this way will give you more flexibility in the days to come to work with these tools, and it will provide valuable skills to take out of this class as you encounter other development tools and practices in the future.
III. The command line. To run NLTK Python programs, and to manage the source code that you'll be working on, we will be using a tool called the command line interface (CLI). The command line is a different kind of interface for interacting with a computer. So far in your life you have probably mostly interacted with computers using graphical user interfaces (GUI): all the graphical, window-based, user-friendly interfaces that use to run our apps and other programs.
The command line is a different paradigm for interacting with computers. Instead of clicking on graphical icons, you type text-based commands. All of the same functionality is present in both modes. The command line offers some advantages for the type of work tha tyou'll be doing with Python NLTK. Mainly because developing graphical user interfaces can be time-consuming. For your NLTK work, I'd like you to be able to focus on the principles of coding programs that focus on the task of processing text, and not have to get distracted designing and developing user interfaces. (Those are great things to learn! But would be better served in a class that focused on them.)
The CLI also automatically gives access to other text-processing
tools that we can use together with the Python programs that
you're going to write. To do this, we'll be using the
command-line technique called the pipe,
signified with the vertical bar symbol: |
. This
allows you to send the output from one program directly into the
input of another. This is mentioned in the reading, and we'll
talk about it in the coming days.
If you are on Mac or Linux, the command line is already installed on your system and is already easily accessible. Simply find and open Terminal. If you are on Windows, you should install a tool called Cygwin.
IV. NLTK. To actually
install NLTK, we'll use a Python installer tool
called pip
, or if you are using Python3, make sure
to type pip3
.
Open Terminal, and in the command line type:
pip3 install --user -U nltkThat should display a bunch of text and at the end somewhere you should see something that says "
Success
". If this doesn't work for
you, ask me for help. If you see that, it means you have now
installed a whole host of natural language programming tools to
your computer for use from within Python, all bundled together
in the library called nltk
.
V. Your workspace. Make
a folder on your computer called "Coding Natural Languge" for
this whole semetser. Inside that, make a folder called "Week
7". Open the Terminal, type cd
(note
the space), then drag the "Week 7" folder into Terminal and
press <ENTER>. You have now just changed directories. In
command line parlance, a directory is
synonymous with a
folder in the GUI.
Now, drag the "Week 7" folder into Atom. Create a new file (in the menu click File > New File), type the following line:
import nltk print("Hello")
Make sure to save the file. Important:
Make sure that the file is
named main.py
, no spaces, all
lowercase, no other punctuation, and make sure that it is saved
into the folder that you called "Week 7".
Now, you should be able to type the following command on the command line:
python3 main.py
It might take a little while to run because your program is
loading all the nltk
tools. But ultimately you
should see a simple printout that says
"Hello
". This means that everything is
working for you.
Get started with some NLTK experiments. Have a look at this page: NLTK.org and read the 4 paragraphs of overview there.
Copy / paste the sample code there into
your main.py
file to try to get
those various features to work: Tokenize and tag some text,
Identify named entities, and Display a parse tree.
You should ignore all the >>>
characters and the ...
and make some other
adjustments. Your main.py
file
should look like
this: sample-main.py.
Make sure to save that! Then go back to the command line,
and again run python3 main.py
. You
should see some output similar to what is shown on that
page. Try running this with different sentences and see what
you can figure out.