Code as a Liberal Art, Spring 2021

Unit 2, Exercise 1 homework

Due: Tuesday, March 2, 8pm

  1. Review the class notes for this week.
  2. The class notes, in developing Example 1, discuss using a dictionary as "buckets" to count the frequencies of colors in an image. Re-use this technique to count the frequencies of words in large text document.

    Start with a fairly large text. Perhaps you can look through the collections of texts at archive.org. For example, here is a US military planning document from 1984. From that page, click on "Full Text" on the right-hand side to access the complete, scanned contents of the document as plain text.

    Refer back to the code snippet in homework 2 which opens a text file and loops over it. That code generates a list of words, but instead of building a list of words, you will now create a dictionary that keeps track of word frequencies (word counts).

    Start by creating a dictionary, you can call it whatever you'd like, for example:

    word_counts = {}
    
    Then you'll include the nested for loops from homework 2 which loop over the file by each line and by each word in each line. Within those loops, follow the pattern of Example 1 from the class notes by checking if word in word_counts, and if so, increment the count for that word, and if not, add word to the word_counts dictionary with the value 1.

    After you have completed this, you can do what Example 1 from the class notes does and print() out len(word_counts) to display the total number of words in the document, or use the code snippet at the end that prints the most frequent word.

    In Unit 1 we talked about how you might implement a sorting algorithm yourself, but Python has some built-in sorting functions. (Some of you have already experimented with this.) Once you have the word_counts dictionary, you can sort it like this:

    sorted_words = sorted(word_counts, key=word_counts.get, reverse=True)
    
    Now, sorted_words will contain the list of words in order of frequency. So sorted_words[0] is the most frequent word, sorted_words[1] is the second most frequent, and so on. You could then say:
    w = sorted_words[0]
    print( word_counts[w] )
    
    to print the word count of the most frequent word. Print out a few of the most frequent words and their word counts.

  3. Experiment with Exif data, as illustrated at the end of the class notes. Modify my code snippet to try to access different Exif properties for a directory full of images.

    (Optional, advanced.) At the end of the class notes, I suggested you might think about how to sort a list of images based on some Exif property. Accessing Exif data as I showed in the class notes, and using some techniques above, can you think how you might do this? For example, you could sort a list of images based on the time of day when they were taken. It could be interesting to create a collage of images ordered by time of day, or by shutter speed or f-stop.

  4. Answer the following questions about hypothetical scenarios and what data structure you might use in each.

    1. You are working with a bunch of images, and each one has a timestamp of when it was taken. You're going to want to display them in the order they were taken. Would you use a list or a dictionary, and why?
    2. You are working with Twitter data, and you are trying to organize a bunch of tweets from a small group of users. Each user has several tweets. You are going to want to access all the tweets for a given user, and later on you may want to add more tweets from these users to your data structure. How might you combine a list and dictionary to store this data?
    3. You have data about a bunch of people and their primary address. You want to find out if any of the people live together. How might you combine list(s) and/or dictionary(s) to store this data to simplify the process of determining if any people live at the same address?

  5. Think about a question you might want to ask using some of these methods. It can be related to another class - something you're reading, or writing, or another project you might be investigating that could perhaps benefit from quantitative algorithmic analysis. Desribe this question. Say what the input data is for this question. Consider working with this as the final project for unit 2.