Due: Tuesday, March 2, 8pm
The class notes, in developing Example 1, discuss using a dictionary as "buckets" to count the frequencies of colors in an image. Re-use this technique to count the frequencies of words in large text document.
Start with a fairly large text. Perhaps you can look through the collections of texts at archive.org. For example, here is a US military planning document from 1984. From that page, click on "Full Text" on the right-hand side to access the complete, scanned contents of the document as plain text.
Refer back to the code snippet in homework 2 which opens a text file and loops over it. That code generates a list of words, but instead of building a list of words, you will now create a dictionary that keeps track of word frequencies (word counts).
Start by creating a dictionary, you can call it whatever you'd like, for example:
word_counts = {}Then you'll include the nested
for
loops from
homework 2 which loop over the file by each line and by each
word in each line. Within those loops, follow the pattern of
Example 1 from the class notes by checking if word in
word_counts
, and if so, increment the count for that
word, and if not, add word
to
the word_counts
dictionary with the
value 1
.
After you have completed this, you can do what Example 1 from
the class notes does and print()
out len(word_counts)
to display the total number
of words in the document, or use the code snippet at the end
that prints the most frequent word.
In Unit 1 we talked about how you might implement a sorting
algorithm yourself, but Python has some built-in sorting
functions. (Some of you have already experimented with this.)
Once you have the word_counts
dictionary, you can
sort it like this:
sorted_words = sorted(word_counts, key=word_counts.get, reverse=True)Now,
sorted_words
will contain the list of words
in order of frequency. So sorted_words[0]
is the
most frequent word, sorted_words[1]
is the second
most frequent, and so on. You could then say:
w = sorted_words[0] print( word_counts[w] )to print the word count of the most frequent word. Print out a few of the most frequent words and their word counts.
Experiment with Exif data, as illustrated at the end of the class notes. Modify my code snippet to try to access different Exif properties for a directory full of images.
(Optional, advanced.) At the end of the class notes, I suggested you might think about how to sort a list of images based on some Exif property. Accessing Exif data as I showed in the class notes, and using some techniques above, can you think how you might do this? For example, you could sort a list of images based on the time of day when they were taken. It could be interesting to create a collage of images ordered by time of day, or by shutter speed or f-stop.
Answer the following questions about hypothetical scenarios and what data structure you might use in each.