Code as a Liberal Art, Spring 2024

Unit 2, Lesson 1 Homework

Due: Wednesday, March 20, 8pm

  1. Review the class notes for this week.
  2. Answer the following questions about hypothetical scenarios and what data structure you might use in each. Please put your answers in a Google Doc named "Part 1", in your Google Drive folder for this week (named "Unit 2, Lesson 1").

    1. You are working with a bunch of images, and each one has a timestamp of when it was taken. You want to display them in the order they were taken. Should you store the images in a list or a dictionary, and why?

    2. You are working with Twitter data, and you are trying to organize a bunch of tweets from a small group of users. Each user has several tweets. You are going to want to access all the tweets for a given user, and later on you may want to add more tweets from these users to your data structure. How might you combine a list and dictionary to store this data?

    3. You have data about a bunch of people and their primary address. You want to find out if any of the people live together. How might you combine list(s) and/or dictionary(s) to store this data to simplify the process of determining if any people live at the same address?

  3. The class notes for this week explain how to use a dictionary as "buckets" to count the number of occurences of each color in an image (see Example 1). Re-use this technique to count the number of occurences of words in large text document. (Note: Feel free to put parts a, b, and c below in their own .py files, or put them all in a single .py file named "part_2.py" and clearly label each part with Python comments: #.)

    Start with a large text file. You can pick any text of your choosing or use the one I've provided below. If you would like to use your own, consider: something from the text section of archive.org, a large document from the Canterbury Corpus, or a freely available book from Project Gutenberg. Or feel free to use my example: given that the State of the Union address is happening March 7 this year, here is the State of the Union address from last year (2023). It's not that large of a file, but it will work for this exercise.

    Whatever you would like to work with, make sure you can access it as plain text (not a PDF or some other format). Start by saving your text as a plain text file into your code folder for this week. Begin part_2.py with a line of code to open this file. If you are working with the example I've provided, that line would be the following:

    file = open("unit2-lesson1-hw-sotu2023.txt","r")
    
    Now you can proceed similarly to Example 1 from the class notes.

    Create a dictionary variable, naming it whatever you'd like. In the class notes I named this color_counts so for this case you might do the following:

    word_counts = {}
    

    Next, loop over the contents of this file. There are different options for how to do this. Analagous to working with images, one approach would be a nested loop, and one would be a single loop.

    With images, the nested loop option iterated over x and y, the width and height of an image; with text files, a nested loop could be used to iterate over lines, and within each line, all words, like this:

    for line in file:
        for word in line.split():
            # code in here will be repeated once per word
            # with the variable WORD holding each word of the file in turn
    

    Here, the split() in Python (and in many other programming languages) breaks up one string into a list of smaller strings. With no arguments, it breaks up the string based on spaces, but you can specify other characters.

    line and word are not special keywords, they are simply variable names I have made up to correspond to the values they will hold.

    The single loop option would be to convert the file into one string with no line breaks, and to loop over the entire file by word. Like this:

    for word in file.read().replace("\n"," ").split():
        # code in here will be repeated once per word
        # with the variable WORD holding each word of the file in turn
     

    In this example, the command replace() is replacing all line breaks with a space. The symbol in most programming languages for a new line is \n. Once all line breaks are replaced with spaces, this snippet then breaks up the string by spaces into a list of words using split() as above.

    Now we can follow the logic of Example 1 but instead of using p and color_counts, use word and word_counts.

    Next, follow the four line algorithm at the end of Example 1 that calculates the most frequent color, but instead do so for the most frequent word. (If you need a refresher here, refer back to the Unit 1 Lesson 2 HW, and our discussion of that in Unit 1, Lesson 3.) Notice that Example 1 starts by setting most_frequent_color to the (0,0) pixel. In this case, you'll want to set that variable to the the first word of the file, but that's a little tricky. So if your variable is called most_frequent_word, here are the two lines to do it:

    file.seek(0)
    most_frequent_word = file.readline().split()[0]
    

    1. What is the most frequent word from the speech?

    2. Stop words. Probably the most frequent word is a little boring. Make a new list at the top of your file called stop_words, and add to it a every word that you don't want to consider in your calculation - "the", "a", etc. Now, in your inner loop above, after the line that says line.split(), add the following:

      	if word in stop_words:
                  continue
      
      Through some experimentation, see if you can skip enough words to find start getting interesting results. This might be pretty tedious. We'll see some better techniques for achieving this result in a couple weeks.

    3. A different approach. In Unit 1 we talked about how you can implement a sorting algorithm yourself. This was a useful learning exercise, but Python also has some built-in sorting functions. (Some of you have already experimented with this.) Once you have the word_counts dictionary, you can sort it from largest to smallest word count like this:
      sorted_words = sorted(word_counts, key=word_counts.get, reverse=True)
      
      Now sorted_words will contain the list of words in order of frequency. You can print out the top 10 most common like so:
      for i in range(10):
          w = sorted_words[i]
          print( w + " " + str(word_counts[ w ]))
      
      What are they? Can you run this a few times, and modify your stop_words variable until you start to get some interesting results? What can you find?
  4. Experiment with Exif data, as illustrated at the end of the class notes. Modify my code snippet to try to access different Exif properties for a directory full of images. (Note: please put this part in its own file named "part_3.py".)

    (Optional, advanced.) At the end of the class notes, I suggested you might think about how to sort a list of images based on some Exif property. Accessing Exif data as I showed in the class notes, and using some techniques above, can you think how you might do this? For example, you could sort a list of images based on the time of day when they were taken. It could be interesting to create a collage of images ordered by time of day, or by shutter speed or f-stop.

  5. Think about a question you might want to ask using some of these methods. It can be related to another class - something you're reading, or writing, or another project you might be investigating that could perhaps benefit from quantitative algorithmic analysis. Desribe this question. Say what the input data is for this question. Put your thoughts here in a Google Doc named "Part 4". Consider working with this as your Unit 2 project.