Code as a Liberal Art, Spring 2025

Unit 2, Lesson 4 — Thursday, April 10

Using the Markov chain data structure

Today we talked about using the data structure that we built in Unit 2, Lesson 3 as the basis from which to actually generate text.

We started with the code from last week that reads a corpus and generates a Markov model data structure from it, where that data structure is a dictionary of lists, with the keys of the dictionary being each unique word in the corpus, and the values of the dictionary being lists of all the words that follow that word, including repetitions if the word occurs multiple times.

If you are still confused about this, please study section II.a and section II.b from last week's class notes, which illustrate a very simple corpus and what the data model will look like for it.

Table of contents

  1. Simple Markov text generator
  2. Bigram Markov text generator
  3. A Markov rhyming couplet generator

I. Simple Markov text generator

To generate a new text from an existing corpus using a Markov model, the idea is to start with a single word, look that word up in the Markov model, get the list of all possible words that might come next, randomly select a one, and then repeat the process for that newly selected word.

We'll prompt the user for the first word to start with.

We'll start our implementation of this by beginning with the code that we ended off with in the last lesson — well, the code that we ended with at the end of section II (not the alternative approach that we considered from section III).

Starting with that, we have:

import pprint
    
file = open("unit2lesson3_simple.txt","r")

markov = {}

words = file.read().replace("\n"," ").split()

i = 0
while i < len(words) - 1:

    word = words[i]
    next_word = words[i+1]

    if word in markov:
        markov[word].append(next_word)
    else:
        markov[word] = [next_word]

    i = i + 1

pprint.pp(markov)

Let's start by commenting out the pprint.pp().

Add a prompt to the user with input() to ask them to enter a word to start with. Then look up that word in our Markov data structure which returns a list, and use random.choice() to randomly select an item from that list. That becomes the next word. Remember to import random.

import pprint
import random
    
file = open("unit2lesson3_simple.txt","r")

markov = {}

words = file.read().replace("\n"," ").split()

i = 0
while i < len(words) - 1:

    word = words[i]
    next_word = words[i+1]

    if word in markov:
        markov[word].append(next_word)
    else:
        markov[word] = [next_word]

    i = i + 1

# pprint.pp( markov )

# Prompt the user for a word:
word = input("Enter a word to start a sentence: ")

next_word = random.choice( markov[word] )

How do we go further from there?

Let's make a variable to collect all the words that we'll be finding. I will call this variable output. We will then need to repeat this process of finding next words. How do we create repetition?

A loop!

Let's repeat until we get to the end of one sentence. In other words, repeat until output ends with a period, .. Check this with output.endswith(".").

Inside the loop, after finding the next word, which I'm calling next_word, append next_word to the output variable, so that it accumulates all the words that we are selecting.

Note that the user prompt (input()) is happening before the loop (so it only happens once), while the random.choice() line happens inside the loop, so it happens over and over again.

import pprint
import random
    
file = open("unit2lesson3_simple.txt","r")

markov = {}

words = file.read().replace("\n"," ").split()

i = 0
while i < len(words) - 1:

    word = words[i]
    next_word = words[i+1]

    if word in markov:
        markov[word].append(next_word)
    else:
        markov[word] = [next_word]

    i = i + 1

# pprint.pp( markov )

# Prompt the user for a word:
word = input("Enter a word to start a sentence: ")

output = word

while not output.endswith("."):

    next_word = random.choice( markov[word] )
    output = output + next_word

    word = next_word

Note that last line. That's the crucial step that allows the process to repeat. This is saying to take the next word that the algorithm has just selected and set it to word, so that when the loop repeats, the entire process will continue.

And that's really it! All you should need to do now is print the output variable:

import pprint
import random
    
file = open("unit2lesson3_simple.txt","r")

markov = {}

words = file.read().replace("\n"," ").split()

i = 0
while i < len(words) - 1:

    word = words[i]
    next_word = words[i+1]

    if word in markov:
        markov[word].append(next_word)
    else:
        markov[word] = [next_word]

    i = i + 1

# pprint.pp( markov )

# Prompt the user for a word:
word = input("Enter a word to start a sentence: ")

output = word

while not output.endswith("."):
    next_word = random.choice( markov[word] )

    word = next_word

print(output)

Last thing. What if the user enters a word that does not appear in our Markov model? Test it. You get an error right? It will probably be a KeyError. A solution here is to add some error checking:

import pprint
import random
    
file = open("unit2lesson3_simple.txt","r")

markov = {}

words = file.read().replace("\n"," ").split()

i = 0
while i < len(words) - 1:

    word = words[i]
    next_word = words[i+1]

    if word in markov:
        markov[word].append(next_word)
    else:
        markov[word] = [next_word]

    i = i + 1

# pprint.pp( markov )

# Prompt the user for a word:
word = input("Enter a word to start a sentence: ")
# Do some error checking
while word not in markov:
    word = input("Your word was not in my corpus. Please try again: ")

output = word

while not output.endswith("."):
    next_word = random.choice( markov[word] )

    word = next_word

print(output)

If the word that the user has just entered is not in our Markov data structure, then repeat until it is, each time prompting the user to enter a new word, with an informative prompt that explains that their previous word won't work for this corpus.

(jump back up to table of contents)

II. Bigram Markov text generator

We experimented with output from the above and talked about some ways we might improve on it.

One method I suggested was that instead of just looking at one-word Markov states (modeling the transitions from each single word to the next word), we might try adding some additional context by thinking about how pairs of two words lead to a next word. So for example the word "I" might lead to many possibilities, but "I like" might be a little bit narrower in the next possible states, and that might add a little bit more sensibility to our outcome.

Here is how you could add a 2-gram (a.k.a. a bigram) to our Markov model. I will implement this by adding a new data structure, which I will call markov_2gram. This will be similar to the previous markov model that we created (a dictionary of lists) but instead of the keys being single words, the keys will now be tuples, or more specifically, pairs of words.

Then in the generation phase, we will first check if the current pair of words (previous word and current word) are in our bigram model, and if not, then we will fall back on checking our regular unigram model.

Here is what I came up with:

import pprint
import random

file = open("unit2lesson3_dickens_copperfield.txt","r")

markov = {}
markov_2gram = {}

words = file.read().replace("\n"," ").split()

previous_word = ""

i = 0
while i < len(words) - 1:
    
    word = words[i]
    next_word = words[i+1]
    
    # 1gram
    if word in markov:
        markov[word].append(next_word)
    else:
        markov[word] = [next_word]
    
    # 2gram
    if previous_word:
        word_2gram = (previous_word,word)
        if word_2gram in markov_2gram:
            markov_2gram[word_2gram].append(next_word)
        else:
            markov_2gram[word_2gram] = [next_word]
    
    previous_word = word
        
    i = i + 1

# pprint.pp( markov )

# Prompt the user for a word:
word = input("Enter a word to start a sentence: ")
# Do some error checking
while word not in markov:
    word = input("Your word was not in my corpus. Please try again: ")

output = word

previous_word = ""

while not output.endswith("."):

    possible_next_words = None

    if previous_word:
        word_2gram = (previous_word,word)
        possible_next_words = markov_2gram[word_2gram]
        
    if not possible_next_words:
        possible_next_words = markov[word]

    next_word = random.choice( possible_next_words )
    
    output = output + " " + next_word
    
    previous_word = word
    word = next_word

print(output)
(jump back up to table of contents)

III. A Markov rhyming couplet generator

Just for kicks, here is another Markov example that I came up with that I thought I would share. I was thinking we might have time to get to this in class today. We didn't, but I'll just share this here in any case. Presented here without too much explanation.

I wanted to demonstrate to you all how we could use a Markov model to generate rhyming couplets. This uses a Python library called pronouncing, which is a Python interface to the CMU rhyming database. It lets you take a word and do things like generate a list of many words that rhyme with it, or get its syllable count.

This generates one 10-syllable line of a poem, then takes the last word, and looks that up a list of rhyming words. The challenge then was how to generate another line of text using the Markov model that ends with one of those rhyming words.

At first I tried to "brute force" it: just generate another line of text, and if it ends with a rhyming word, use it, but if it does not end with a rhyming word, try again, and keep trying again until you get a rhyme. This was not working well.

So instead, I built another Markov model, kind of similar in approach to the bigram version above, but instead, it's a Markov in reverse. It models not the next words that come after a given word, but instead the previous words that come before a given word.

This way, I generated my first line, took the last word, found a list of rhymes, selected one of them randomly, and then used the reverse Markov model to generate the next line in reverse, starting with the rhyming word and finding preceding words until I had another 10 syllable line.

Here is what that looks like.

I would love if someone wanted to take this and apply it to a different kind of corpus — maybe a different author, or maybe legal documents, or ... If you want to work with this and have any questions, just let me know!

Here's an interesting example that it produced:

If you more ; of dreams I was going ; when I
  for pleasant my in hotel, the all up look ai

look business-like as he anoints his toast, and liable to turn
  farther gone had he and ; know exactly burn

exactly the room, the close to take the waiter, in
  must everybody and elsewhere and night, snowy
import pprint 
import random
import pronouncing

file = open("unit2lesson3_dickens_copperfield.txt","r")

markov = {}
markov_reverse = {}

words = file.read().replace("\n"," ").split()

i = 0
while i < len(words) - 1:
#for (i,word) in enumerate(words):

    word = words[i]
    next_word = words[i+1]
    
    # 1gram
    if word in markov:
        markov[word].append(next_word)
    else:
        markov[word] = [next_word]
    
    # reverse
    if next_word in markov_reverse:
        markov_reverse[next_word].append(word)
    else:
        markov_reverse[next_word] = [word]
        
    i = i + 1


# Prompt the user for a word:
word = input("Enter a word to start a sentence: ")
# Do some error checking
while word not in markov:
    word = input("Your word was not in my corpus. Please try again: ")

poem = ""
line = word + " "

# generate 10 lines (5 rhyming couplets) starting with WORD
for line_count in range(5):

    # first generate the first line of 10 syllables, starting with WORD    
    syllables = 0

    while syllables < 10:
        next_word = random.choice( markov[word] )

        line = line + next_word + " "
        
        pronunciation_list = pronouncing.phones_for_word(next_word)
        if pronunciation_list:
            syllables = syllables + pronouncing.syllable_count(pronunciation_list[0])

        word = next_word 
    
    poem = poem + line + "\n"
    
    # find a rhyme
    rhymes = pronouncing.rhymes(word)
    word = ""
    
    for r in rhymes:
        if r in markov_reverse:
            word = r
            break
    if word == "":
        word = random.choice( [w for w in markov_reverse.keys()] )
        
    rhyme_line_last_word = word
        
    # generate a line of 10 syllables using markov_reverse
    syllables = 0
    line = word
    while syllables < 10:
        possible_previous_words = markov_reverse[word]
        previous_word = random.choice(possible_previous_words)
    
        line = previous_word + " " + line
        
        pronunciation_list = pronouncing.phones_for_word(previous_word)
        if pronunciation_list:
            syllables = syllables + pronouncing.syllable_count(pronunciation_list[0])
        else:
            syllables = syllables + 1

        word = previous_word 
    
    poem = poem + "  " + line + "\n\n"
    line = word + " "

    word = rhyme_line_last_word

print("\n\n")
print("Poem:")
print(poem)
print("\n\n")