Due: Wednesday, March 20, 8pm
Answer the following questions about hypothetical scenarios and what data structure you might use in each. Please put your answers in a Google Doc named "Part 1", in your Google Drive folder for this week (named "Unit 2, Lesson 1").
You are working with a bunch of images, and each one has a timestamp of when it was taken. You want to display them in the order they were taken. Should you store the images in a list or a dictionary, and why?
You are working with Twitter data, and you are trying to organize a bunch of tweets from a small group of users. Each user has several tweets. You are going to want to access all the tweets for a given user, and later on you may want to add more tweets from these users to your data structure. How might you combine a list and dictionary to store this data?
You have data about a bunch of people and their primary address. You want to find out if any of the people live together. How might you combine list(s) and/or dictionary(s) to store this data to simplify the process of determining if any people live at the same address?
The class notes for this week explain how to use a dictionary
as "buckets" to count the number of occurences of each color
in an image (see Example
1). Re-use this technique to count the number of
occurences of words in large text
document. (Note: Feel free to put parts a, b,
and c below in their own .py
files,
or put them all in a single .py
file
named "part_2.py
" and clearly label
each part with Python comments: #
.)
Start with a large text file. You can pick any text of your choosing or use the one I've provided below. If you would like to use your own, consider: something from the text section of archive.org, a large document from the Canterbury Corpus, or a freely available book from Project Gutenberg. Or feel free to use my example: given that the State of the Union address is happening March 7 this year, here is the State of the Union address from last year (2023). It's not that large of a file, but it will work for this exercise.
Whatever you would like to work with, make sure you can access
it as plain text (not a PDF or some other
format). Start by saving your text as a plain text file into
your code folder for this
week. Begin part_2.py
with a line of
code to open this file. If you are working with the example
I've provided, that line would be the following:
file = open("unit2-lesson1-hw-sotu2023.txt","r")Now you can proceed similarly to Example 1 from the class notes.
Create a dictionary variable, naming it whatever you'd
like. In the class notes I named
this color_counts
so for this case you might do
the following:
word_counts = {}
Next, loop over the contents of this file. There are different options for how to do this. Analagous to working with images, one approach would be a nested loop, and one would be a single loop.
With images, the nested loop option iterated over x and y, the width and height of an image; with text files, a nested loop could be used to iterate over lines, and within each line, all words, like this:
for line in file: for word in line.split(): # code in here will be repeated once per word # with the variable WORD holding each word of the file in turn
Here, the split()
in Python (and in many other
programming languages) breaks up one string into a list of
smaller strings. With no arguments, it breaks up the string
based on spaces, but you can specify other characters.
line
and word
are not special
keywords, they are simply variable names I have made up to
correspond to the values they will hold.
The single loop option would be to convert the file into one string with no line breaks, and to loop over the entire file by word. Like this:
for word in file.read().replace("\n"," ").split(): # code in here will be repeated once per word # with the variable WORD holding each word of the file in turn
In this example, the command replace()
is
replacing all line breaks with a space. The symbol in most
programming languages for a new line is \n
. Once
all line breaks are replaced with spaces, this snippet then
breaks up the string by spaces into a list of words
using split()
as above.
Now we can follow the logic of Example 1 but instead of
using p
and color_counts
,
use word
and word_counts
.
Next, follow the four line algorithm at the end of Example 1
that calculates the most frequent color, but instead do so for
the most frequent word. (If you need a refresher here, refer
back to the Unit 1 Lesson 2
HW, and our discussion of that
in Unit 1, Lesson 3.)
Notice that Example 1 starts by
setting most_frequent_color
to
the (0,0)
pixel. In this case, you'll want
to set that variable to the the first word of the file, but
that's a little tricky. So if your variable is
called most_frequent_word
, here are the two lines
to do it:
file.seek(0) most_frequent_word = file.readline().split()[0]
What is the most frequent word from the speech?
Stop words. Probably the most frequent
word is a little boring. Make a new list at the top of
your file called stop_words
, and add to it a
every word that you don't want to consider in your
calculation - "the", "a", etc. Now, in your inner loop
above, after the line that says line.split()
,
add the following:
if word in stop_words: continueThrough some experimentation, see if you can skip enough words to find start getting interesting results. This might be pretty tedious. We'll see some better techniques for achieving this result in a couple weeks.
word_counts
dictionary, you can sort it from
largest to smallest word count like this:
sorted_words = sorted(word_counts, key=word_counts.get, reverse=True)Now
sorted_words
will contain the list of words
in order of frequency. You can print out the top 10 most
common like so:
for i in range(10): w = sorted_words[i] print( w + " " + str(word_counts[ w ]))What are they? Can you run this a few times, and modify your
stop_words
variable until you start to get
some interesting results? What can you find?
Experiment with Exif data, as illustrated at the end of the
class notes. Modify my code snippet to try to access different
Exif properties for a directory full of
images. (Note: please put this part in its
own file named "part_3.py
".)
(Optional, advanced.) At the end of the class notes, I suggested you might think about how to sort a list of images based on some Exif property. Accessing Exif data as I showed in the class notes, and using some techniques above, can you think how you might do this? For example, you could sort a list of images based on the time of day when they were taken. It could be interesting to create a collage of images ordered by time of day, or by shutter speed or f-stop.