Code as a Liberal Art, Spring 2023

Unit 2, Tutorial 1 Homework

Due: Tuesday, March 7, 8pm

  1. Review the class notes for this week.
  2. The class notes, in developing the code for counting the occurences of colors in an image (see Example 1), discuss using a dictionary as "buckets" to count the number of occurences of each color. Re-use this technique to count the number of occurences of words in large text document. (Note: Feel free to put parts a, b, and c in their own .py files, or put them all in a single .py file named "part-1.py" and clearly label each part with Python comments: #.)

    Start with a fairly large text. A couple weeks ago, the President delivered the 2023 State of the Union address, so let's use that. I will copy the text to a file for you to use here:

    Start by saving that file into the folder where you will be working. Add the following line to that file:

    file = open("u2t1-2023-sotu.txt","r")
    
    Now you can proceed similarly to Example 1 from the class notes.

    Create a dictionary. Instead of color_counts you can call it whatever you'd like, for example:

    word_counts = {}
    

    Next, include a nested loop, but where Example 1 was looping over an image file by rows and columns, you will be looping over a text file, first by lines of text, and then by words within each line. Like this:

    for line in file:
        for word in line.split():
    

    Now, the inner loop will repeat once for each word in the file, and each time it does, the variable word will correspond to the current word. Follow the logic of Example 1 but instead of using p and color_counts, use word and word_counts.

    Next, you can follow the four line algorithm from Example 1 that calculates the most frequent color, but instead do so for the most frequent word. Refer back to Unit 1 class notes if you need help. Notice that Example 1 starts by setting most_frequent_color to the (0,0) pixel. In your case you'll want to set that variable to the the first word of the file, but that's a little tricky. So if your variable is called most_frequent_word, here are the two lines to do it:

    file.seek(0)
    most_frequent_word = file.readline().split()[0]
    

    1. What is the most frequent word from the speech?

    2. Stop words. Probably the most frequent word is a little boring. Make a new list at the top of your file called stop_words, and add to it a every word that you don't want to consider in your calculation - "the", "a", etc. Now, in your inner loop above, after the line that says line.split(), add the following:

      	if word in stop_words:
                  continue
      
      Through some experimentation, see if you can skip enough words to find start getting interesting results. This might be pretty tedious. We'll see some better techniques for achieving this result in a couple weeks.

    3. A different approach. In Unit 1 we talked about how you can implement a sorting algorithm yourself. This was a useful learning exercise, but Python also has some built-in sorting functions. (Some of you have already experimented with this.) Once you have the word_counts dictionary, you can sort it from largest to smallest word count like this:
      sorted_words = sorted(word_counts, key=word_counts.get, reverse=True)
      
      Now sorted_words will contain the list of words in order of frequency. You can print out the top 10 most common like so:
      for i in range(10):
          w = sorted_words[i]
          print( w + " " + str(word_counts[ w ]))
      
      What are they? Can you run this a few times, and modify your stop_words variable until you start to get some interesting results? What can you find?
  3. Experiment with Exif data, as illustrated at the end of the class notes. Modify my code snippet to try to access different Exif properties for a directory full of images. (Note: please put this part in its own file named "part-2.py".)

    (Optional, advanced.) At the end of the class notes, I suggested you might think about how to sort a list of images based on some Exif property. Accessing Exif data as I showed in the class notes, and using some techniques above, can you think how you might do this? For example, you could sort a list of images based on the time of day when they were taken. It could be interesting to create a collage of images ordered by time of day, or by shutter speed or f-stop.

  4. Answer the following questions about hypothetical scenarios and what data structure you might use in each. (Note: Please put parts 3 and 4 in a Google Doc named "Parts 3 and 4".)

    1. You are working with a bunch of images, and each one has a timestamp of when it was taken. You want to display them in the order they were taken. Should you store the images in a list or a dictionary, and why?
    2. You are working with Twitter data, and you are trying to organize a bunch of tweets from a small group of users. Each user has several tweets. You are going to want to access all the tweets for a given user, and later on you may want to add more tweets from these users to your data structure. How might you combine a list and dictionary to store this data?
    3. You have data about a bunch of people and their primary address. You want to find out if any of the people live together. How might you combine list(s) and/or dictionary(s) to store this data to simplify the process of determining if any people live at the same address?

  5. Think about a question you might want to ask using some of these methods. It can be related to another class - something you're reading, or writing, or another project you might be investigating that could perhaps benefit from quantitative algorithmic analysis. Desribe this question. Say what the input data is for this question. Consider working with this as the final project for unit 2.