We started this week where we left off last time, talking about data scraping. Refer back to the Unit 2 Lesson 2 class notes, starting with the section on downloading a webpage using Python, and continuing from there.
Then we picked up to talk about together building a Markov chain algorithm. We started by processing a text and generating a data structure that we'll use to implement our algorithm next time.
We talked about this data structure should look like and came up with something like this: (Note: I made a slight error in my below data and explanation. I've corrected it here. If you previously read this and were confused I apologize. Maybe make sure that it amkes sense to you now.)
{ "For": { "example": 3, "any": 1, "once": 2, ... }, "The": { "person": 2, "time": 1, "next": 1, ... }, "I": { ... }, ... }A dictionary for which the keys are words in the text, and the values are a dictionary of "next words", by which we mean every word that comes after this word in the text. We represent each "next word" in our list as a key/value pair in which the key is the "next word", and the value is the number of times that it appears after this word in the text.
The code that we came up with together is below. We implemented this using the text of Dickens' David Copperfield, which I downloaded from archive.org. You can download the David Copperfield text here.
file = open("dickens_copperfield.txt","r") markov_data = {} lines = file.readlines() for line in lines: words = line.split() i = 0 while i < len(words) - 1: current_word = words[i] next_word = words[i+1] if current_word not in markov_data: markov_data[current_word] = { next_word: 1 } else: # (we've encountered this word before) if next_word not in markov_data[current_word]: markov_data[current_word][next_word] = 1 else: markov_data[current_word][next_word] = markov_data[current_word][next_word] + 1 i = i + 1 # experiment here to see what we've generated print(markov_data["The"])
In class, Viyan had the great idea that we might modify our Markov chain data structure slightly to look like this:
{ "For": [ "example", "once", "example", "any", "example", "once", ... ], "The": [ "person", "time", "person", "next", ... ], "I": [ ... ], ... }This example here would be the data structure for the same example as my pseudocode above. In Viyan's version, instead of storing each "next word" as a dictionary of the word and its word count, we're simply storing each "next word" in a list, with repetitions. With this data structure, we could use the Python
random.choice()
function to choose an item from the list of "next words", and
statistically the choice would yield a random word with the same
probability that it appears in the text. There are definitely some
advantages to going with this approach! Maybe also some disadvantages
too.
For your homework this week, I'd like you to get the above code running for a sample corpus and see what the results look like. And then I challenge you to modify the code above so that it produces the data structure that Viyan described. Use the same sample corpus and see if you can get results that look right.