Today we talked about using the data structure that we built in Unit 2, Lesson 3 as the basis from which to actually generate text.
We started with the code from last week that reads a corpus and generates a Markov model data structure from it, where that data structure is a dictionary of lists, with the keys of the dictionary being each unique word in the corpus, and the values of the dictionary being lists of all the words that follow that word, including repetitions if the word occurs multiple times.
If you are still confused about this, please study section II.a and section II.b from last week's class notes, which illustrate a very simple corpus and what the data model will look like for it.
To generate a new text from an existing corpus using a Markov model, the idea is to start with a single word, look that word up in the Markov model, get the list of all possible words that might come next, randomly select a one, and then repeat the process for that newly selected word.
We'll prompt the user for the first word to start with.
We'll start our implementation of this by beginning with the code that we ended off with in the last lesson — well, the code that we ended with at the end of section II (not the alternative approach that we considered from section III).
Starting with that, we have:
import pprint file = open("unit2lesson3_simple.txt","r") markov = {} words = file.read().replace("\n"," ").split() i = 0 while i < len(words) - 1: word = words[i] next_word = words[i+1] if word in markov: markov[word].append(next_word) else: markov[word] = [next_word] i = i + 1 pprint.pp(markov)
Let's start by commenting out the pprint.pp()
.
Add a prompt to the user with input()
to ask them
to enter a word to start with. Then look up that word in our
Markov data structure which returns a list, and
use random.choice()
to randomly select an item from
that list. That becomes the next word. Remember to import
random
.
import pprint import random file = open("unit2lesson3_simple.txt","r") markov = {} words = file.read().replace("\n"," ").split() i = 0 while i < len(words) - 1: word = words[i] next_word = words[i+1] if word in markov: markov[word].append(next_word) else: markov[word] = [next_word] i = i + 1 # pprint.pp( markov ) # Prompt the user for a word: word = input("Enter a word to start a sentence: ") next_word = random.choice( markov[word] )
How do we go further from there?
Let's make a variable to collect all the words that
we'll be finding. I will call this
variable output
. We will then need to repeat this
process of finding next words. How do we create repetition?
A loop!
Let's repeat until we get to the end of one sentence. In
other words, repeat until output
ends with a
period, .
. Check this
with output.endswith(".")
.
Inside the loop, after finding the next word, which I'm
calling next_word
, append next_word
to
the output
variable, so that it accumulates all the
words that we are selecting.
Note that the user prompt (input()
) is
happening before the loop (so it only happens once),
while the random.choice()
line
happens inside the loop, so it happens over and over
again.
import pprint import random file = open("unit2lesson3_simple.txt","r") markov = {} words = file.read().replace("\n"," ").split() i = 0 while i < len(words) - 1: word = words[i] next_word = words[i+1] if word in markov: markov[word].append(next_word) else: markov[word] = [next_word] i = i + 1 # pprint.pp( markov ) # Prompt the user for a word: word = input("Enter a word to start a sentence: ") output = word while not output.endswith("."): next_word = random.choice( markov[word] ) output = output + next_word word = next_word
Note that last line. That's the crucial step that allows
the process to repeat. This is saying to take the next word that
the algorithm has just selected and set it to word
,
so that when the loop repeats, the entire process will continue.
And that's really it! All you should need to do now is
print the output
variable:
import pprint
import random
file = open("unit2lesson3_simple.txt","r")
markov = {}
words = file.read().replace("\n"," ").split()
i = 0
while i < len(words) - 1:
word = words[i]
next_word = words[i+1]
if word in markov:
markov[word].append(next_word)
else:
markov[word] = [next_word]
i = i + 1
# pprint.pp( markov )
# Prompt the user for a word:
word = input("Enter a word to start a sentence: ")
output = word
while not output.endswith("."):
next_word = random.choice( markov[word] )
word = next_word
print(output)
Last thing. What if the user enters a word that does not appear
in our Markov model? Test it. You get an error right? It will
probably be a KeyError
. A solution here is to add
some error checking:
import pprint import random file = open("unit2lesson3_simple.txt","r") markov = {} words = file.read().replace("\n"," ").split() i = 0 while i < len(words) - 1: word = words[i] next_word = words[i+1] if word in markov: markov[word].append(next_word) else: markov[word] = [next_word] i = i + 1 # pprint.pp( markov ) # Prompt the user for a word: word = input("Enter a word to start a sentence: ") # Do some error checking while word not in markov: word = input("Your word was not in my corpus. Please try again: ") output = word while not output.endswith("."): next_word = random.choice( markov[word] ) word = next_word print(output)
If the word that the user has just entered is
not in
our Markov data structure, then repeat until
it is, each time prompting the user to enter a new word, with an
informative prompt that explains that their previous word
won't work for this corpus.
We experimented with output from the above and talked about some ways we might improve on it.
One method I suggested was that instead of just looking at one-word Markov states (modeling the transitions from each single word to the next word), we might try adding some additional context by thinking about how pairs of two words lead to a next word. So for example the word "I" might lead to many possibilities, but "I like" might be a little bit narrower in the next possible states, and that might add a little bit more sensibility to our outcome.
Here is how you could add a 2-gram
(a.k.a. a bigram) to our Markov model. I will
implement this by adding a new data structure, which I will
call markov_2gram
. This will be similar to the
previous markov
model that we created
(a dictionary of lists) but
instead of the keys being single words, the
keys will now be tuples, or more specifically,
pairs of words.
Then in the generation phase, we will first check if the current pair of words (previous word and current word) are in our bigram model, and if not, then we will fall back on checking our regular unigram model.
Here is what I came up with:
import pprint import random file = open("unit2lesson3_dickens_copperfield.txt","r") markov = {} markov_2gram = {} words = file.read().replace("\n"," ").split() previous_word = "" i = 0 while i < len(words) - 1: word = words[i] next_word = words[i+1] # 1gram if word in markov: markov[word].append(next_word) else: markov[word] = [next_word] # 2gram if previous_word: word_2gram = (previous_word,word) if word_2gram in markov_2gram: markov_2gram[word_2gram].append(next_word) else: markov_2gram[word_2gram] = [next_word] previous_word = word i = i + 1 # pprint.pp( markov ) # Prompt the user for a word: word = input("Enter a word to start a sentence: ") # Do some error checking while word not in markov: word = input("Your word was not in my corpus. Please try again: ") output = word previous_word = "" while not output.endswith("."): possible_next_words = None if previous_word: word_2gram = (previous_word,word) possible_next_words = markov_2gram[word_2gram] if not possible_next_words: possible_next_words = markov[word] next_word = random.choice( possible_next_words ) output = output + " " + next_word previous_word = word word = next_word print(output)(jump back up to table of contents)
Just for kicks, here is another Markov example that I came up with that I thought I would share. I was thinking we might have time to get to this in class today. We didn't, but I'll just share this here in any case. Presented here without too much explanation.
I wanted to demonstrate to you all how we could use a Markov
model to generate rhyming couplets. This uses a Python library
called pronouncing
, which is a Python interface to
the CMU rhyming database. It lets you take a word and do things
like generate a list of many words that rhyme with it, or get
its syllable count.
This generates one 10-syllable line of a poem, then takes the last word, and looks that up a list of rhyming words. The challenge then was how to generate another line of text using the Markov model that ends with one of those rhyming words.
At first I tried to "brute force" it: just generate another line of text, and if it ends with a rhyming word, use it, but if it does not end with a rhyming word, try again, and keep trying again until you get a rhyme. This was not working well.
So instead, I built another Markov model, kind of similar in approach to the bigram version above, but instead, it's a Markov in reverse. It models not the next words that come after a given word, but instead the previous words that come before a given word.
This way, I generated my first line, took the last word, found a list of rhymes, selected one of them randomly, and then used the reverse Markov model to generate the next line in reverse, starting with the rhyming word and finding preceding words until I had another 10 syllable line.
Here is what that looks like.
I would love if someone wanted to take this and apply it to a different kind of corpus — maybe a different author, or maybe legal documents, or ... If you want to work with this and have any questions, just let me know!
Here's an interesting example that it produced:
If you more ; of dreams I was going ; when I for pleasant my in hotel, the all up look ai look business-like as he anoints his toast, and liable to turn farther gone had he and ; know exactly burn exactly the room, the close to take the waiter, in must everybody and elsewhere and night, snowy
import pprint import random import pronouncing file = open("unit2lesson3_dickens_copperfield.txt","r") markov = {} markov_reverse = {} words = file.read().replace("\n"," ").split() i = 0 while i < len(words) - 1: #for (i,word) in enumerate(words): word = words[i] next_word = words[i+1] # 1gram if word in markov: markov[word].append(next_word) else: markov[word] = [next_word] # reverse if next_word in markov_reverse: markov_reverse[next_word].append(word) else: markov_reverse[next_word] = [word] i = i + 1 # Prompt the user for a word: word = input("Enter a word to start a sentence: ") # Do some error checking while word not in markov: word = input("Your word was not in my corpus. Please try again: ") poem = "" line = word + " " # generate 10 lines (5 rhyming couplets) starting with WORD for line_count in range(5): # first generate the first line of 10 syllables, starting with WORD syllables = 0 while syllables < 10: next_word = random.choice( markov[word] ) line = line + next_word + " " pronunciation_list = pronouncing.phones_for_word(next_word) if pronunciation_list: syllables = syllables + pronouncing.syllable_count(pronunciation_list[0]) word = next_word poem = poem + line + "\n" # find a rhyme rhymes = pronouncing.rhymes(word) word = "" for r in rhymes: if r in markov_reverse: word = r break if word == "": word = random.choice( [w for w in markov_reverse.keys()] ) rhyme_line_last_word = word # generate a line of 10 syllables using markov_reverse syllables = 0 line = word while syllables < 10: possible_previous_words = markov_reverse[word] previous_word = random.choice(possible_previous_words) line = previous_word + " " + line pronunciation_list = pronouncing.phones_for_word(previous_word) if pronunciation_list: syllables = syllables + pronouncing.syllable_count(pronunciation_list[0]) else: syllables = syllables + 1 word = previous_word poem = poem + " " + line + "\n\n" line = word + " " word = rhyme_line_last_word print("\n\n") print("Poem:") print(poem) print("\n\n")