This week we're going to talk about data serialization, which is a process for translating data in the form of variables and data structures into a format that can be stored for later use (like in a file), or transmitted to others (like across a network).
Serialization can be accomplished with many different types of data and file formats. The serialization format that we will be using today is JSON (usually pronounced JAY-sawn), which stands for JavaScript object notation. This is a set of rules for formatting a file, and is a very common serialization format today. JSON comes out of JavaScript, although it is a format that is used in the context of many programming languages. Most programming languages in use today have libraries to make it easier to read and write data in JSON format.
Those of you who did a data visualization for your midterm have already worked with data serialization, although we did not refer to it by that name when we were working on your code. Everyone who did data visualization this semester worked with CSV data which stands for comma-separated values. This is an older format that existed long before JSON, and is still useful in many cases. A common use case for CSV is to represent tabular data, like a spreadsheet. A CSV-formatted file looks like this:
Rory,human,brown,computers Gritty,monster,orange,hockey Fido,dog,beige,runningAs you can see, and as the name implies, CSV is a data format that simply lists rows of data, in which each value is separated by commas. Sometimes CSV files will contain a header row indicating what each of these values represents, like the first row of a spreadsheet. In this case, that might look like this:
name,species,hair color,favorite thingWhich data format you decide to use when you need to serialize data depends on the structure of the data, and the operations you need to do on it. Hopefully by the end of today you will have a sense of what JSON is good for, and when you might elect to use this format.
But before we get to serialization, I want to introduce a new
kind of data structure. In the process, I'll
introduce the command line
and show how you can run
Python code in this new mode. And we'll end the class today with
an example that shows how you can access serialized JSON data
via a network with a very simple API to create
a shared, interactive experience.
(The following links jump down to the various topics in these notes.)
So far we have talked about one kind of data
structure: lists (week
7). Lists are good at storing values in a fixed, sequential
order. Remember that you create a list with empty square
brackets (x = []
), then you add values to it using
the append()
command which are then stored in the
order in which they were added, and you reference those values
with a numerical index inside square brackets
(x[0]
, x[3]
, etc.)
During our discussion of lists, I mentioned that Python offers many different data structures. Which data structure to use in a given situation depends on the structure of the data that you are trying to model, and the operations that you will need to do with it. Determining which data structure(s) to use to solve a given problem can often be a question of some subjective debate, and there is often not one clear right answer.
The data structure we'll learn about today is the dictionary. While lists were sequential (added to in order, and accessed by numerical index), dictionaries are unordered, and they are accessed by an index of any kind of value (number, string, etc.) A dictionary contains a collection of mappings, also referred to as key-value pairs. Think of a key-value pair as like a pairing of a word and its definition — this is where the dictionary gets its name. Dictionaries do not preserve the order in which items are added to them, and instead are accessed, we might say, randomly, by retrieving the value (the definition) for a given key (the word).
Since at this point in our class we are starting to move out of Processing and into Python itself, I want introduce dictionary syntax to you outside of Processing. But first, we need to take a detour into the command line.
The command line is just a different kind of interface for accessing files and running programs on your computer. While Finder (on Mac) and Explorer (on Windows) use a graphical user interface (GUI) to manage our files, folders, and programs, the command line is a text-based interface. The command line is typically thought of as a kind of legacy mode of computer use: text-based command line interfaces were common before the prominence of the GUI.
Why learn the command line? A command line context is often an easier way to develop computer programs because we as developers do not need to worry about designing and implementing graphical interfaces, which often require extra code. Conversely, this means that using programs in a text-based command line mode is often less intuitive and less user-friendly than in graphical modes. Working in a command line mode is an important skill for a developer, however, as it often allows you to more rapidly run and test new code, without having to build up a more developed interface. Also, since the command line is text-based, there are some situations, like data processing, where running a program in a command line context may actually be easier. Lastly, there may be times when you will be working in a server context — for example installing code over a network into a cloud-based machine that is physically distant and inaccessible in person — and in these cases it is often impossible to access a GUI interface, so you would need to navigate a command line interface.
Sidenote: If you would like to learn more about the history of the command line, and watch an elaborated technical discussion, you can access this two part video tutorial that I created for another class: Command line tutorial part 1: history, Command line tutorial part 2: technical instruction. Keep in mind that this was created for another class, so a couple comments will be irrelevant, but the history and explanation of how to use the command line will all be relevant.
Let's go through a few basic commands that you'll need to get started working with the command line. Note the new formatting: whenever I use a fixed-width font on a black background with rounded corners and a thin gray bar on top, I am showing you valid instructions for the command line, like this:
$ pwd /Users/rory/dev/code-toolkit $ ls dictionary.pyThe way to read this is to look and see that
pwd
and ls
are valid command line
instructions. You can type them in at a command
prompt and press enter. The lines that come below are
examples of what you may see, but your results will be
different depending on the files and folders on your computer.
Please note: you should not enter the dollar
sign ($
) when you are entering command
line instructions. I use the dollar sign to signify the command
prompt. The command prompt may be signified by different
punctuation in your terminal. That is how the command line tells
you that it's ready for more input, and that is where you will
enter new commands. Always press enter to execute the command
you've typed.
pwd
— stands for "print working
directory", and this command will display the directory
(equivalent to a folder) that you are currently working in.
ls
— stands for "list", and this
command will list all of the files in your current
directory. This is the command line equivalent of simply opening
a folder in a Finder (Explorer) window and viewing its contents.
cd
— stands for "change
directory", and this is how you change the directory that you
are currently in. Mac has some really nice interaction between
the command line and the GUI, so you can
type cd
, and then drag a folder in to
the Terminal window, and press enter, and that will change to
that directory. I'm not sure if there is an equivalent to this
on Windows.
The command line allows you to type the name of any executable
program on your system, press enter, and the command line will
then try to run that program. Since we're working in Python
today, let's run the Python program. Do this by simply typing
python
, and you should see something
like this:
$ python Python 2.7.7 (default, Jun 2 2014, 18:55:26) [GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>>Notice that the punctuation indicating my command prompt has changed from a
$
to >>>
, which means that we are in
the python shell. This is a place where you can
type any valid Python code, and it will be interpretted and run.
To exit out of the Python shell you can type CONTROL-D,
or exit()
. Note that regular command
line instructions will not work inside the Python shell, so if
you want to cd
or ls
, you need to exit out of Python.
Now that we have a Python shell, let's learn about how to use dictionaries in this interactive Python shell. First let's review some list commands:
>>> a = [] >>> a.append(100) >>> a.append(200) >>> a.append(300) >>> print(a) [100, 200, 300] >>> print(a[0]) 100 >>> print(a[1]) 200 >>> print(len(a)) 3Note that you don't need to type
print()
in
the Python shell to see a variable value. You can simply type
the variable name, and the Python shell will evaluate it and
print its value:
>>> a [100, 200, 300] >>> a[0] 100 >>> len(a) 3It is also valid Python code to initialize a list by using this notation:
>>> a = [ 100, 200, 300 ] >>> print(a) [100, 200, 300] >>> print(a[0]) 100 >>> print(a[1]) 200
Now let's learn some new syntax for dictionaries:
>>> r = {} >>> r["name"] = "Rory" >>> r["species"] = "human" >>> r["hair color"] = "brown" >>> r["favorite thing"] = "computers" >>> r {'hair color': 'brown', 'name': 'Rory', 'favorite thing': 'computers', 'species': 'human'} >>> r["name"] 'Rory'Notice that the order in which the key-value pairs are printed is not the same order in which I entered them. This is because dictionaries do not preserve order, as I mentioned above.
You can also use the len()
command as with
lists. In this case, len()
will return the number
of key-value pairs:
>>> len(r) 4If you try to access a key that is not in your dictionary, you will get a
KeyError
. For example:
>>> r["date of birth"] Traceback (most recent call last): File "<stdin>", line 1, in <module> KeyError: 'date of birth'This very helpful error message is telling you precisely the problem: that the key
"date of birth"
does not
exist in this dictionary.
There is a command that you can use to help avoid these
errors: in
. This command checks if a given key
is in the dictionary, like so:
>>> "date of birth" in r FalseSo you can use this boolean inside an
if
statement,
like so:
>>> if "date of birth" in r: ... r["date of birth"] ... else: ... "Key not in this dictionary" ... 'Key not in this dictionary'Similar to the new array initialization above, you can also initialize a dictionary with some new syntax:
>>> r = { "name": "Rory", "species": "human", "hair color": "brown", "favorite thing": "computers" } >>> r["name"] 'Rory'One last thing, you can even combine data structures in useful and important ways. So for example:
>>> r = { "name": "Rory", "species": "human", "hair color": "brown", "favorite thing": "computers" } >>> g = { "name": "Gritty", "species": "monster", "hair color": "orange", "favorite thing": "hockey" } >>> a = [ r, g ] >>> a [{'hair color': 'brown', 'name': 'Rory', 'favorite thing': 'computers', 'species': 'human'}, {'hair color': 'orange', 'name': 'Gritty', 'favorite thing': 'hockey', 'species': 'monster'}]So what I've done here is create two dictionaries,
r
and g
, and then
created a list a
, which holds those two
dictionaries. Note the format that Python uses to display the
contents of a
. This is JSON! This is what the JSON
text format looks like.
For the remainder of class, let's work through an example about how dictionaries work, how they can be used to serialize JSON data, and how that JSON data can be retrieved via a network.
Step 1. Let's start with a basic Processing example that reviews many of the concepts from throughout the semester so far. Create a new Processing sketch and add one graphical element that the user can move with the mouse. I'll draw this graphical element with a simple rectangle, but I'll call this a "creature" and we can imagine that it could be something more visually interesting.
Start with one creature. If we want it to move horizontally and
vertically and to have it's own color, how many variables
do we need to represent it? Three: x, y, color. So create these
three variables, give them initial values, do all the usual
Processing stuff like specifying window size, and add
a keyPressed()
block to respond to user input. You
should end up with something like
this:
.html
extension so that
you can view this file in a browser. You can copy/paste this
code into a Processing window. Or, if you want to download and
run this code, save the file and remove the .html
extension, but you will also need to remove the HTML code from
the file, like <pre>
tags.)
Step 2. Great. Now let's add a bunch of
other creatures. How do we do that? With
a list. And if we want these other creatures to
be represented in the same way as our single creature, how many
lists do we need? Three. A list for x, a list for y, and a list
for color. Add those lists, then append two initial values to
each list in setup()
, and add a
loop in draw()
to iterate over the lists to draw all
the creatures represented by the lists. Don't worry about
moving these around on the screen yet. After doing this, you
should end up with something like this:
Step 3. Now let's say we want to add a size to our creatures, so that each one can have its own unique size. How would we do this? We could add a size variable. We would do this by adding one size variable for our single moving creature, and then a list to hold the size values for the creatures that we're representing with lists. But, there is a different way to do this — one that involves dictionaries. And I think this will be a good example of how we can use this new data structure
Start by replacing the three variables that represent our moving
creature with one dictionary that contains
three key-value
pairs holding those values. So go
from this:
myX = 250 myY = 250 myColor = color(155,155,255)to this:
myCreature = {} myCreature["x"] = 250 myCreature["y"] = 250 myCreature["color"] = color(155,155,255)Similarly, inside
setup()
, where we are
initializing our lists, instead of append()
ing
values to three different lists, let's make some
dictionaries, and append those to one single list. So
in global space, go from this:
xList = [] yList = [] cList = []to this:
creatureList = []and then, inside
setup()
, go from this:
xList.append(100) yList.append(100) cList.append( color(255,155,155) ) xList.append(200) yList.append(200) cList.append( color(155,255,155) )to this:
c = {} c["x"] = 100 c["y"] = 100 c["color"] = color(255,155,155) creatureList.append(c) c = {} c["x"] = 200 c["y"] = 200 c["color"] = color(155,255,155) creatureList.append(c)Note that now I only have one list, and it contains a collection of dictionary objects. Putting that altogether will look like this: Also note that I also needed to modify
draw()
and keyPressed()
to use these new dictionaries.
Step 4. Now we can finally add that size variable that I was talking about in Step 3. To do this, simply add this new line in global space:
myCreature["size"] = 10and use this new key-value pair inside
draw()
:
rect(myCreature["x"],myCreature["y"],myCreature["size"],myCreature["size"])
Then, do something similar for the list of
creatures. Inside setup()
, add a
new key-value pair pair to each creature, like
this:
c["size"] = 30and use that when you are looping over the creatures to draw them. Putting that altogether should look something like this:
Step 5: Serializing this data. Hopefully you might see some advantages to working this way. There are some conveniences to working with key-value pairs instead of individual variables. But maybe not that many — you could still do all of this work so far without using dictionaries, as we have been doing throughout the whole semester! But one advantage to using data structures in this way is that they can be easily serialized.
So let's modify this sketch so that instead of hard-coding
the list of creature values, we are reading thata data from a
file. Add the following import
statement as the
first line of your program:
import jsonThis will let us use some new commands to read and write JSON data. Now add the following lines inside
setup()
:
f = open("data.json") j = json.load(f) for c in j: creatureList.append(c)This is going to open a file named
data.json
, read
its contents, and populate your list of creatures automatically
with data from that file. But first you need to add this file to
your sketch directory. Have a look at this
file: week11_data.json. That is
JSON data. It looks a lot like the way Python prints out the
contents of dictionaries and lists. Save this file into your
sketch folder, and run your sketch. Now, you should be able to
modify that JSON file to change values like size or color, or
add new creatures. Be careful when modifying a JSON file, the
format is very precise. You need commas separating each item,
but you cannot have a comma after the last item.
Putting that altogether should look something like this:
Step 6: Networking. In Step 5 we saw how we could read JSON data from a local file, in other words, a data file that is saved on disk, on the same computer that we are working on. (The word local perhaps get over-used in computer science. So far this semester we have talked about local in opposition to global, in terms of variable scope. But now, we are talking about local in opposition to remote, in terms of whether a file is located on your own computer, or, on a different computer over a network.)
Save the following code
file: week11_network.pyde. Once
you download it, you should be able to simply drag it into your
sketch window to add it as a new tab to your sketch. If
you'd like, have a look at this code. It defines two
functions: getData(cList)
and sendData(c)
. getData()
retrieves a
JSON file from a webserver with the IP address
of 174.138.45.118:5000
, then it populates a list of
dictionaries, and returns that list. You would use it like this:
creatureList = getData(creatureList)
sendData(c)
sends the data about one single
"creature" to this webserver. This webserver
has its own Python code which saves this data into a list and
distributes it to anyone who calls getData(cList)
.
Putting that altogether would look like this:
Now, if you run this code (and the webserver is turned on and running!) you should be able to move your creature around, while also seeing the moving creatures of anyone else currently connecting to this webserver.
If you would like to see what the webserver code looks like, you can have a look here: week11-web-server.py. This is a relatively short amount of code! Just keep in mind that this looks simple, but it is using a lot of things that we have not talked about yet. Mainly, this is using a Python web server library called Flask, which you can read about if you are curious.