Code as a Liberal Art, Spring 2021

Unit 2, Exercise 1 lesson — Wednesday, February 24

Data / Metadata / Data structure(s)

Digital stuff is often thought of as divided into algorithms and data. Sometimes these categories are described as, on the one hand, computer programs of executable instructions or actions, and on the other hand, the objects that are acted on. We might even go as far as calling this a kind of "digital ontology", in the sense that all digital stuff, all digital code, can plausibly be divided into one of these two categories.

In Unit 1 we looked at some algorithms, made some patterns, and generated some digital files, but we didn't really think about those files as data, per se. As I've mentioned, the graphical patterns that you generated looked somewhat like data visualizations, but there wasn't actually any data there that we were visualizing. It's almost like they were visualizations of pure process, visualizing the patterns of algorithm.

For Unit 2, we are going to turn our attention more toward data. Where Unit 1 focused on formalist digital explorations and using algorithms for creative, generative production, Unit 2 will shift to how data is gathered, how it is structured, and how we can use algorithms as techniques of analysis.

I poached this image from an online course titled "Introduction to Digital Humanities" at Harvard. I like that it looks like it's building on the work of Unit 1, in the direction of Unit 2. We will be talking about the digital humanities in the coming weeks.

Data comes from the Latin dare, meaning "give", and more specifically, it comes from the past tense of this root, conjuring a sense of "something given." Data is the plural of datum - as if to say "datums". This sense sounds awkward to our ears today as we usually speak of data as a singular. (We'd probably say "There is much data", not "There are many data", even though the latter is probably more strictly grammatically correct.)

But data usually evokes a sense of something that is or was there, a given, a starting point. It is typically thought of as something to gather, to mine, to extract, harvest, or reap; and then something to structure, to organize, and to analyze.

In this unit we will touch on programming operations for structuring data, techniques for gathering data, and algorithms for analyzing data.

Data structure: lists, tuples, and dictionaries

The topic of data structures is a big one in computer science. Many undergraduate computer science curricula will dedicate an entire semetser just to learning various data structures, when they should be used, and how they operate. And finding and optimizing data structures is an ongoing area of computer science research today. We will be thinking about data structures throughout the semester (indeed, we already have) but we will only spend a few weeks with them as the center of our attention, so the scope of this focus will be necessarily brief. We will focus on a few of the most commonly used data structures, how they are used in Python, how they can be combined, and when you might want to use one over another.

Lists

We have already been working with one common Python data structure: lists. A list is just an ordered sequence of values. In Python, those values can be heterogenous. In other words, they can all be different types. Other programming languages are not so forgiving and require all items in a list to be the same kind of thing - i.e., all numbers, or all text. But in Python we can mix it up.

Some of the operations of lists: you can create a list with empty square brackets and give it a name; you can use square bracket notation to specify the items in a list; you can use print() to print a list with all its contents; you can use append() to add new items to the end of a list; you can also use square bracket notation to access a specific value in a list based on it's position in that list; and you can other operations like getting the size of a list with the len() command. Have a look at how you can do each of these things in the Python shell. (Remember: please pay careful attention to the command prompts in my notes here. $ indicates the default command prompt that you get in Terminal, while >>> is the Python shell, which you can run by typing python {or python3} at the command prompt.)

$ python
>>> my_list = []
>>> my_list = [101, 13, 42, "a", "N", "xyz"]
>>> print(my_list)
[101, 13, 42, 'a', 'N', 'xyz']
>>> my_list.append("spaghetti")
>>> print(my_list)
[101, 13, 42, 'a', 'N', 'xyz', 'spaghetti']
>>> my_list[0]
101
>>> my_list[5]
'xyz'
>>> len(my_list)
7

Accessing a specific item in a list by using square bracket notation is called indexing the list, the number in brackets refer to the position in the list of the value that you want, and is called the index. You can also use variables as the index:

>>> i = 3
>>> my_list[i]
'a'

Using variables as the list index like this is useful because then you can reference the items from a list within a loop for example. A process I usually describe as "looping over a list". Here I'm looping over just a part of this list, just to illustrate that one needn't loop over the whole thing:

>>> for n in range(3,6):
...   print(my_list[n])
... 
a
N
xyz

Thus far, you have seen lists used as a long sequence of numbers representing the data of a digital image. In this case, each value of the list represented a pixel color, usually specified as (red, green, blue). So if an image was 10 pixels wide and 10 pixels tall, this list of pixel color data would be 100 in total (10 x 10).

It is important to keep in mind that data structures can contain other data structures. Because a digital image is usually a two dimensional object (it has a width and height), it usually makes sense to think about the pixels of a digital image as stored in a two dimensional data structure. To achieve this, instead of one list of many pixel values, one could have a list corresponding to the rows, where each row is represented as another list, of columns. In the Python shell, for a white and very tiny 3 x 3 image, that might look like this:

>>> img = []
>>> img.append( [ (255,255,255), (255,255,255), (255,255,255) ] )
>>> img.append( [ (255,255,255), (255,255,255), (255,255,255) ] )
>>> img.append( [ (255,255,255), (255,255,255), (255,255,255) ] )
>>> img
[ [(255, 255, 255), (255, 255, 255), (255, 255, 255)],
  [(255, 255, 255), (255, 255, 255), (255, 255, 255)],
  [(255, 255, 255), (255, 255, 255), (255, 255, 255)] ]

Notice that now the list contains other brackets. The text in blue would correspond to one row of three pixels. (A list within a list.)

Question from class: Does Python somehow know this is an image? Good question. No, Python does not think of this data structure at this point as an image in any special way. It is just a data structure for storing numbers, that we are thinking of as pixel values. We could use this data, structured in this way, to pass in to a library like Pillow to make it into an image. But data has an abstract quality. We could also use this same data structure if we were implementing some kind of spreadsheet application, for example. Or really, any other kind of data that had a "two dimensional" structure. Say, if we were implementing a chess or checkers game.

You could access a single row of this data with a regular index:

>>> img[0]
[(255, 255, 255), (255, 255, 255), (255, 255, 255)]

and you could access an item in that row like this: (row is not a special keyword here)

>>> row = img[0]
>>> row[1]
(255, 255, 255)

You could shortcut this by using two indices at once, like this:

>>> img[0][1] = (0,0,0)
>>> img
[ [(255, 255, 255), (0, 0, 0), (255, 255, 255)],
  [(255, 255, 255), (255, 255, 255), (255, 255, 255)],
  [(255, 255, 255), (255, 255, 255), (255, 255, 255)] ]

Notice that I've changed the second item in the first row, as indicated by my indices.

Tuples

Each of these pixel values with its red, green, and blue components (or hue, saturation, and brightness if you're working with that mode) is called a tuple. This weird word comes from thinking about: a pair, a triple, a quadruple, quintuple, sextuple, septuple, octuple, etc, etc, and wanting to generalize that idea. Hence, an n-tuple, or simply, tuple.

The primary difference between a list and a tuple is that a tuple cannot change. It's called immutable.

Think of a list as a dynamic way to store an ordered sequence of values, that may grow or shrink as data is added, removed, filtered, sorted, or re-organized. While a tuple is like a way to represent two or three numbers together. Like the x, y coordinates of a point on a grid.

Dictionaries

One of the most powerful data structures in Python is the dictionary. Like a regular dictionary, in which you can look up a word and get a definition, in a Python dictionary (often called a dict) you can look up a key and access a value. A dictionary creates a mapping, between keys and values; or, we might say it creates a set of key-value pairs.

It has some operations in common with lists, and some differences. A dictionary is created with curly braces; new key-value pairs are added by specifying the key in brackets; values are also retrieved using square brackets like with lists, but the indices need not be numbers; you can get the number of key-value pairs in the dictionary with len(); and you can check if a key is already in a dictionary with the in command:

>>> d = {}
>>> d["hair"] = "orange"
>>> d["species"] = "monster"
>>> d["name"] = "Gritty"
>>> print(d)
{'hair': 'orange', 'species': 'monster', 'name': 'Gritty'}
>>> d["hair"]
'orange'
>>> len(d)
3
>>> "name" in d
True

Just like with lists and tuples, you can combine dictionaries with lists in all sorts of ways. First I'll put a list into my dictionary, then I'll create another dictionary and put them both in a list:

>>> d["lucky numbers"] = [ 4, 11, 42, 101 ]
>>> d
{ 'hair': 'orange', 'species': 'monster', 'name': 'Gritty',
  'lucky numbers': [4, 11, 42, 101] }
>>> 
>>> d2 = {}
>>> d2["hair"] = "brown"
>>> d2["species"] = "human"
>>> d2["name"] = "Rory"
>>> 
>>> ds = [ d, d2 ]
>>> ds
[ { 'hair': 'orange', 'species': 'monster', 'name': 'Gritty',
    'lucky numbers': [4, 11, 42, 101] },
  { 'hair': 'brown', 'species': 'human', 'name': 'Rory' } ]

This all might seem fairly abstract at the moment, but we'll work through some examples today which will hopefully add some concrete understanding and clarity.

We can use dictionaries like buckets, to count things like colors in an image.

Using data structures to gather data

For the moment, let's continue working with images since you've been getting familiar with the Pillow image library. Let's calculate a histogram for the colors of an image. A histogram is usually a chart, where the x-axis indicates many buckets, and the y-axis indicates how many times that bucket occurs in some data set. (We talked about this briefly when we discussed randomness.) In this case, we won't actually draw this chart, but conceptually, the horizontal axis represents all pixel colors in an image, and for each one, we want to count how many times it occurs. This is a great use for a dictionary.

Start by importing Pillow, opening an image file, and getting its size:

from PIL import Image

img = Image.open("fire.jpg")

(width,height) = img.size

(Now hopefully you have some insight here and can see that img.size gives us a tuple, which is why we need the parenthesis on the left of the equal sign.)

Next, let's create a new dictionary. The keys of this dictionary will be colors (represented by tuples), and the value corresponding to each tuple will be how many times it has occured:

from PIL import Image

img = Image.open("fire.jpg")

(width,height) = img.size

color_counts = {}

Let's start by looping over the image and getting the color value for each pixel:

from PIL import Image

img = Image.open("fire.jpg")

(width,height) = img.size

color_counts = {}
for x in range(width):
    for y in range(height):
        p = img.getpixel((x,y))

Now, we'll use that pixel value p as the key in our dictionary. If this color (this key) is already in the dictionary, then we want to increment its value. But if the key is not in the dictionary (else), we'll simply add it to the dictionary with an initial value of 1:

from PIL import Image

img = Image.open("fire.jpg")

(width,height) = img.size

color_counts = {}
for x in range(width):
    for y in range(height):
        p = img.getpixel((x,y))
        if p in color_counts:
            color_counts[p] = color_counts[p] + 1
        else:
            color_counts[p] = 1

That's all there is to it for calculating color frequencies. But what if we now wanted to determine the most frequent color of this image? Think back to the work from week 2 about finding the largest or smallest number in a list. We can re-use that algorithm here:

from PIL import Image

img = Image.open("fire.jpg")

(width,height) = img.size

color_counts = {}
for x in range(width):
    for y in range(height):
        p = img.getpixel((x,y))
        if p in color_counts:
            color_counts[p] = color_counts[p] + 1
        else:
            color_counts[p] = 1

print( "There were " + str( len(color_counts) ) +
       " total colors in this image.")

most_frequent_color = img.getpixel((0,0))
for p in color_counts:
    if color_counts[p] > color_counts[most_frequent_color]:
        most_frequent_color = p

And that's all there is to it. Now we can just print out the results of our calculation:

Example 1: Counting color frequencies in an image.

from PIL import Image

img = Image.open("fire.jpg")

(width,height) = img.size

color_counts = {}
for x in range(width):
    for y in range(height):
        p = img.getpixel((x,y))
        if p in color_counts:
            color_counts[p] = color_counts[p] + 1
        else:
            color_counts[p] = 1

print( "There were " + str( len(color_counts) ) +
       " total colors in this image.")

most_frequent_color = img.getpixel((0,0))
for p in color_counts:
    if color_counts[p] > color_counts[most_frequent_color]:
        most_frequent_color = p

print("The most frequent color is: " + str(most_frequent_color) +
      " and it occurs " +
      str(color_counts[most_frequent_color]) + " times.")

For the homework, I will be challenging you to use the code from the Week 2 homework that opened up a text file and split it into words, and to merge it with the above code example and modify it so it determines the most frequent words in a text.

Data and metadata

This week I asked you to read an article about metadata by Matthew Mayernik. In it, he explains that the term "metadata" likely originated in the late 1960s "in the context of computer system design to refer to the use of one data element to describe or represent some characteristic of another data element." [1] In other words, it's data about other data.

We might debate the usefullness of this distinction (after all, isn't it all, just, data?) and perhaps there is some ideological baggage that this comes loaded with in terms of computer scientists wanting to create this distinction - as if to say that as a field, it does not want to get its hands dirty mucking about with actual data itself, which might be the messy, poorly structured stuff of culture and humanistic communication. But it is a distinction that seems to have some usefulness, and while Mayernik does not offer any hard and fast definitions, he lists many other contenders, such as "data attributes that describe, provide context, indicate the quality, or document other object (or data) characteristics." [2] Let's go with that ... and let's see if we can manage some successful metadata experiments that use these data attributes to provide context or analysis about some data objects.

The type of metadata that I'd like to start exploring is Exif data. This term stands for "exchangeable image file format", and it refers to a number of standardized fields that can be embedded in image files, currently only JPG and TIFF image formats. Exif data can include information about the camera device that was used to take the picture, including numerous photographic settings on that camera; it can also include information about the resolution and color properties of the image; and it can include contextual information like the date, time, and location of where and when the photo was taken.

As you are probably thinking, this can pose some serious privacy concerns. One high profile case illustrating this is the story of John McAffee, who was on the run, and whose location was then revealed to be Guatamala after doing an interview with Vice media, who left GPS information embedded in Exif data on photographic images they took of him. (thenextweb.com)

Exif data can be very challenging to work with in a reliable way. First of all it is usually only included in photographic images, not other types of graphics or images created with other tools. Also, not all cameras save the same fields. And lastly, owing to the privacy concerns(*) many platforms strip out some or all of the Exif data in images before sharing them on their platforms. So while most of the pictures you take with your phone probably have this embedded, trying to find images on the internet with these fields embedded can be frustrating.

[* - I would argue that concern for privacy is not the primary motivating concern for why these platforms strip Exif data from images before publishing them. For one, most of these platforms, such as Facebook, have proven themselves to be wholly unconcerned with the privacy of their users. On the contrary: their entire business model is predicated on extracting as much data about their users as possible. I contend that the performance of privacy concern regarding Exit data is actually a way of holding on to as much data as they can. After all, it is not the case that these companies ignore Exif data. Rather, they strip it out and save it for their own uses. Facebook and Google will show you where a photo was taken, even though they don't let you collect this data from files yourself. Removing it from the photo files that are published, I argue, is a way for them to prevent everyone from deriving value from this data and to maintain exclusive access to it.]

For this work, I'd like to work with a collection of 15 images. I saved these from Flickr, and I found them by doing a search for "people", sorted by "most interesting" (a property which Flickr determines by means that I'm not aware of) and filtered to only allow images with a Creative Commons license. Most of these have Exit data, but not all. I plan to work with these images as examples over the next couple weeks. You can find a zip file containing the folder of images here.

We can get started by playing around with Exif data in the Python shell. Make sure that you download and unzip that folder of images. Move it to the folder where you want to keep it, and then cd to the parent of that folder to be able to access the images from Python. In other words, you should be able to type ls and see the people folder in the listing.

>>> from PIL import Image, ExifTags
>>> img = Image.open("people/11767919503_335aa61249_o.jpg")
>>> img.getexif()
<PIL.Image.Exif object at 0x7fba16ed2a60>

These few commands simply import the libraries we need to work with this, opens an image, and gets the Exif data for that image. What we have here is an object, which we can treat like a dictionary. But in order to do that we have to figure out what the keys are. img.getexif().keys() will give us that, but we have to loop to see these. This should work:

>>> for key in img.getexif().keys():
...   key
... 
36864
37377
37378
36867
36868
# etc ...

But what's this? We can't tell what these keys are supposed to represent. Unfortunately the Exif standard dictates that all the keys should be these numeric codes. Fortunately there are commands that can help us figure out what these keys mean. That is what the ExifTags library is for that we imported above. ExifTags gives us its own dictionary. This one contains key-value pairs that translate numeric Exif codes into human-readable descriptions:

>>> for key in img.getexif().keys():
...   print( str(key) + " corresponds to: " + ExifTags.TAGS[key] )
... 
36864 corresponds to: ExifVersion
37377 corresponds to: ShutterSpeedValue
37378 corresponds to: ApertureValue
36867 corresponds to: DateTimeOriginal
36868 corresponds to: DateTimeDigitized
# etc ...

OK, now we're getting somewhere.

Let's make a new Python file. In that, let's make a loop that loops over all images in this folder and tries to print some metadata about each one. I'll see if I can print the value of the key 36867, which corresponds to "DateTimeOriginal", which I presume is meant to indicate the day and time when the photo was taken — I assume this would be determined by the date and time on the camera as set by the user, so this probably is not super reliable, but it might be interesting.

from os import listdir
from PIL import Image, ExifTags

files = listdir("people")

for f in files:
    img = Image.open( "people/" + f )
    exif_dictionary = img.getexif()
    date_time = exif_dictionary[36867]
    print(date_time)

Running this from the command line for me yields the following output:

$ python exif.py
2016:03:12 14:37:08
2008:04:12 13:34:59
2018:02:05 17:34:50
2015:03:29 15:24:02
2016:04:10 14:25:22
2014:06:07 21:00:58
2012:02:28 13:09:19
Traceback (most recent call last):
  File "exif.py", line 9, in <module>
    date_time = exif_dictionary[36867]
  File "/Users/rory/Library/Python/3.8/lib/python/site-packages/PIL/Image.py", line 3471, in __getitem__
    return self._data[tag]
KeyError: 36867

This is easy enough to fix. We got an error because one of the images does not have this Exif key. Remember that we can use the in command to check:

from os import listdir
from PIL import Image, ExifTags

files = listdir("people")

for f in files:
    img = Image.open( "people/" + f )
    exif_dictionary = img.getexif()
    if 36867 in exif_dictionary:
        date_time = exif_dictionary[36867]
        print(date_time)

Running that command, now I can see the date / time for all images that contain this field, and my program simply skips over images that don't contain it.

Can you start to think about some things that we might do with this data? In the homework I'll ask you to try to sort the images based on date, or any other Exif field you wish.

In working with this Exif stuff, it is useful to have a non-programmatic tool to quickly access the Exif (meta)data. I can recommend this simple online tool: http://exif.regex.info/exif.cgi. I have been using it and it works well for me. There are many other ways you could access this. On Mac I believe you can use built-in tools such as clicking the file in Finder, and then clicking File > Get Info.