Due: Tuesday, March 7, 8pm
The class notes, in developing the code for counting the
occurences of colors in an image
(see Example
1), discuss using a dictionary as "buckets" to count the
number of occurences of each color. Re-use this technique to
count the number of occurences of words in large text
document. (Note: Feel free to put parts a, b,
and c in their own .py
files, or put
them all in a single .py
file named
"part-1.py
" and clearly label each
part with Python comments: #
.)
Start with a fairly large text. A couple weeks ago, the President delivered the 2023 State of the Union address, so let's use that. I will copy the text to a file for you to use here:
Start by saving that file into the folder where you will be working. Add the following line to that file:
file = open("u2t1-2023-sotu.txt","r")Now you can proceed similarly to Example 1 from the class notes.
Create a dictionary. Instead of color_counts
you
can call it whatever you'd like, for example:
word_counts = {}
Next, include a nested loop, but where Example 1 was looping over an image file by rows and columns, you will be looping over a text file, first by lines of text, and then by words within each line. Like this:
for line in file: for word in line.split():
Now, the inner loop will repeat once for each word in the
file, and each time it does, the variable word
will correspond to the current word. Follow the logic of
Example 1 but instead of using p
and color_counts
, use word
and word_counts
.
Next, you can follow the four line algorithm from Example 1
that calculates the most frequent color, but instead do so for
the most frequent word. Refer back to Unit 1 class notes if
you need help. Notice that Example 1 starts by
setting most_frequent_color
to
the (0,0)
pixel. In your case you'll want to set
that variable to the the first word of the file, but that's a
little tricky. So if your variable is
called most_frequent_word
, here are the two lines
to do it:
file.seek(0) most_frequent_word = file.readline().split()[0]
What is the most frequent word from the speech?
Stop words. Probably the most frequent
word is a little boring. Make a new list at the top of
your file called stop_words
, and add to it a
every word that you don't want to consider in your
calculation - "the", "a", etc. Now, in your inner loop
above, after the line that says line.split()
,
add the following:
if word in stop_words: continueThrough some experimentation, see if you can skip enough words to find start getting interesting results. This might be pretty tedious. We'll see some better techniques for achieving this result in a couple weeks.
word_counts
dictionary, you can sort it from
largest to smallest word count like this:
sorted_words = sorted(word_counts, key=word_counts.get, reverse=True)Now
sorted_words
will contain the list of words
in order of frequency. You can print out the top 10 most
common like so:
for i in range(10): w = sorted_words[i] print( w + " " + str(word_counts[ w ]))What are they? Can you run this a few times, and modify your
stop_words
variable until you start to get
some interesting results? What can you find?
Experiment with Exif data, as illustrated at the end of the
class notes. Modify my code snippet to try to access different
Exif properties for a directory full of
images. (Note: please put this part in its
own file named "part-2.py
".)
(Optional, advanced.) At the end of the class notes, I suggested you might think about how to sort a list of images based on some Exif property. Accessing Exif data as I showed in the class notes, and using some techniques above, can you think how you might do this? For example, you could sort a list of images based on the time of day when they were taken. It could be interesting to create a collage of images ordered by time of day, or by shutter speed or f-stop.
Answer the following questions about hypothetical scenarios and what data structure you might use in each. (Note: Please put parts 3 and 4 in a Google Doc named "Parts 3 and 4".)