Code Toolkit: Python, Fall 2020

Week 13 — Wednesday, December 2 — Class notes

(We actually discussed this material in class on December 9.)

Saving data in a database

If you remember back to week 11, we talked about how to serialize data for the purposes of transmitting it across a nework, or in order to persist it (meaning, to save the data to be used at a later time). I explained that data can be serialized using the CSV file format (which stands for comma-separated values) or using JSON (which stands for JavaScript object notation).

We ended that lesson by modifying Processing code to act as a network client — in other words, code that sends and receives data with a server. Last week, you leanred how to create your own server code by writing Python to run as a CGI script, and in the homework for last week, you experimented with this technique to create a CGI script that received user data from an HTML form as an HTTP request, and used that to generate an HTTP response.

But if you remember, during the week 11 exercise, as everyone got their Processing network client working, everyone's client was communicating with the same server, and the server replied with a response comprised of JSON data about all other clients that were currently making requests. By using that JSON data to draw something on the screen, your client acted like a window to a shared online data space, and it gave the impression that we were all moving our creatures around in the same world.

To implement this type of functionality, the server cannot simply use user data to generate a response (as with your MadLib), but must also save or persist that user data. By saving data from a user request, a server is then able to manipulate, aggregate, or process that data in various ways (for example creating tallies or averages) and then to share that data with other users. This pattern (receiving data via HTTP requests, processing and saving it some way, and then sharing it with other users later) is the premise behind the way that nearly all websites today function. Think about all social media sites, blogs, e-commerce, etc. On all of these sites, server-side computer programs receive data from users, process it, save it, and then share it with others.

The final piece of that process — saving data in a server context — is what we will be learning today. By understanding how this works, you should understand all the pieces of the networked exercise from week 11, and you should be able to implement your final project by applying these principles — whether your final project is a shared, multi-user game, a game with saved data, a data visualization with shared data or a data entry interface, or something else.

Working with a simple database

During step 5 of the exercise from week 11, and in part 2 of your homework assignment for that week, you saw how it is possible to save data using a JSON file for later retrieval. This works well in many cases. For example, this might work if a computer program wanted to save user-specified settings that would apply next time the user opened that program. You could also use it so save the state of a game, to allow the player to continue where they had left off. However, in the context of a server on a network, to which many user may be connecting this could create problems. Imagine if two users tried to save data at nearly the same moment. If your code was responding to two nearly simultaneous requests, it might try to access the same file twice. In the best case, you might have some lost data, but in the worst case, the structure of this shared file could be corrupted, causing errors.

The solution to this problem of networked, shared storage is to use a database. A database is sophisticated computer program that manages files of data. It allows simultaneous access by carefully managing which sections of the shared files are being accessed and when. Databases are a complicated type of computer program, and we will not get into the details of how they work. Indeed, we could spend an entire semester on this topic alone — many computer science programs do spend an entire semetser (or more) on this topic, and even to this data there are computer science PhD students and research programs investigating how to make database systems more effective and efficient in the context of different types of problems and scenarios.

For various reasons, databases are often indicated with this cylindrical iconography, and in most diagrams about digital technology, the cylinder is used as the symbol of the database.

One thing that is useful to think about is that a database is usually structured around transactions. A transaction is usually a small number of access operations, for example, access some data, modify it, and save it again. Most database systems ensure that transactions are atomic, meaning that they are guaranteed to all happen at once. Often times, this is referred to as ACID compliance. This stands for atomicity, consistency, isolation, durability. If a database is ACID compliant, it means that any read and write transactions will be guaranteed to work without stepping on each other or corrupting data, even in a high volume situation when there may be many simultaneous requests.

We won't get further into the details of that. But suffice it so say, if you want to save data, and you want it to be accessed in a shared context with multiple simultaneous users, you need to think about using a database.

While we won't talk how databases are implemented, we will talk about how to use them. Even this topic can get complicated very quickly. Traditionally, most databases have used a thing called SQL, which stands for structured query language. This is a programming language for specifying how data is structured in a database, and for making queries, which is the term for requests to access or modify and save data.

SQL databases are structured around tabular data. Think of like a spreadsheet: a list of rows and columns. The structure of data in an SQL database specifies all of the columns of the spreadsheet, and then later on, data is added as rows. A statement to define that data structure of one spreadsheet (i.e. one table) looks like this:

CREATE TABLE table_name (
    column1 datatype,
    column2 datatype,
    column3 datatype,
   ....
);

while an SQL statement to access data in that table might look like this:

SELECT column1, column2, ...
FROM table_name;

Again, we will not be getting into the complicated details of this type of coding work.

Instead, we will be using a Python library that allows us to save dictionaries directly into a database. This will give us all the functionality of a dictionary data structure, which you are hopefully getting familiar with by now, while also gaining all the features of a database, such as ACID compliance.

The database that we will be using for this is a well known open source tool called SQLite. SQLite is a very established, long-running open source project. It is a small, simple database system that also implements many of the sophisticated data management features of larger database management systems like MySQL, Postgres, Oracle, or Microsoft Access.

As the name suggests, SQLite is a database that does use the SQL language, but we will not be needing to learn that because we will be using a Python tool called SQLitedict. As this project website describes, this is "a lightweight wrapper around Python’s sqlite3 database with a simple, Pythonic dict-like interface and support for multi-thread access." What this means is that we can save dictionaries to a SQLite database without needing to get into all the complexity of SQL. (SQL is actually really great and I encourage you to look for opportunities to learn it in the future!)

Working with a dictionary database (SQLitedict)

To get started, install the SqliteDict library onto your system:

$ python -m pip install sqlitedict

As that website explains, this should allow you to get started with some of the following commands that you can experiment with by entering them into the Python shell. So open up a command shell, run python, and then experiment with the following: (Remember, you should not type the command prompts: >>> or ....)

>>> from sqlitedict import SqliteDict
≫>> d = SqliteDict('./my_db.sqlite', autocommit=True)
>>> d['some_key'] = 5 
>>> print(d['some_key'])  # prints the new value
>>> for key, value in d.items():
...     print(key, value)
...
>>> print(len(d)) # etc... all dict functions work

(Note that while this example is adapted from the SqliteDict documentation, there are a few key differences. The example code on the site was written to work with Python 2, but the code I have included here should work well with Python 2 or Python 3 — I believe many of your computers will be running Python 3. For example, I have used items() instead of iteritems(), explained below, and use parentheses with the print() command.)

As a sidenote, this bit of Python code introduces a new dictionary command that we have not talked about yet: items(). This command helps you more easily loop over all the key-value pairs of a dictionary. If you want to experiment with this, open a new Python shell and run the following commands:
>>> d = {}
>>> d["name"] = "Gritty"
>>> d["species"] = "monster"
>>> d["color"] = "orange"
>>> d.items()
This should print out some code that is trying to explain what this object is.

To see how to use this, use a for loop, which works very similarly to a while loop, but you specify your looping conditions differently. Type the following:
>>> for k,v in d.items():
...   print(k, v)
This will print out all the key-value pairs in this dictionary.

Back to the SqliteDict example above. Try adding additional key-value pairs into this dictionary, and experimenting by recalling what you remember about using dictionaries. If you need, you can refer back to the dictionary lesson from week 11. When you are done, type:

>>> d.close()

Now, you can exit Python (exit() or CTRL-D), type ls in the command prompt, and see that a new file has been created: my_db.sqlite. This file is storing your dictionary with all of its data. If you try to view this file (for example in Atom), it will look like a bunch of digital garbage. That is because this is a binary file, not a plain text file like the .py files and .json and .csv files that we have been working with so far.

But if you again run python, you should be able to type the following:

>>> from sqlitedict import SqliteDict
>>> d = SqliteDict('./my_db.sqlite', autocommit=True)
>>> for key, value in d.items():
...     print(key, value)

And that will show you all of the key-value pairs that you entered before. What this means is that you can exit Python, which will delete the dictionary variable that you defined, but it was saved into this SQLite database, meaning you can re-run Python, create a new dictionary that references this same database file (by specifying my_db.sqlite), and retrieve all the key-value pairs that you put into this dictionary.

Here is how you can put this into a server context in a CGI script. This will return

#!/usr/bin/python

import json
from sqlitedict import SqliteDict
import cgi

form = cgi.FieldStorage()

print("Content-Type: text/html")


d = SqliteDict('./my_db.sqlite', autocommit=True)

# (Option 1.) This FOR loop will return a user-friendly output of the
# contents of the dictionary stored in the SqliteDict database. This
# type of output is more like what you would use in the Mad Lib
# example — in any situation where you were trying to display
# data to someone who is using a web browser, for example to submit
# data via a web form.
for key, value in d.items():
    print("The value of " + str(key) + " is " + str(value))


# (Option 2.) Alternatively, if you are returning data to a computer
# program, like a Processing client that will be using this data to
# draw something, use this type of output. This will return JSON data:
return_data = {}
for key, value in d.items():
    return_data[key] = value
print(json.dumps(return_data))

To be clear, you probably would not ever want to do both the options illustrated by that code snippet. Depending on what this is for, you would use either option.

The homework for this week offers an example with which you could experiment more with these techniques.