Code as a Liberal Art, Spring 2025

Unit 2 Project

Assigned: Thursday, March 27

Due for in-class presentation, Thursday, April 17 (with 2-4 people presenting works-in-progress earlier on Tuesday, April 15)

Final draft of code: due in your Google Drive Folder on Friday, April 18, 8pm.

Using "scrapism" techniques, create an "active archive". These terms are concepts borrowed from our reading by Sam Lavigne, so consider reviewing that text to make sure you are clear on their meanings.

Start by using scraping techniques to build a corpus of text. You can either:

  1. Write an algorithm that starts from a list of URLs that you gather manually, iterating over the list, then downloading and saving some part of the response. In this case, we could think of your archiving process as more an example of "web scraping".
  2. Write an algorithm that starts from one URL and uses a queue to download the content of that URL, then find URLs within that content, add them to the queue, and repeat the process. In this case, we could think of your archiving process as more an example of "web crawling".

Ultimately, you will submit a written text or collection of texts that have been generated by algorithm and the data of your corpus.

We will see how to use a Markov chain to build a data structure that models an entire corpus and the relationships of words within it. You will also see how to use this Markov chain to generate new sentences or phrases that seem like they could plausibly be from that corpus.

Your final output can be entirely Markov-generated, or can be generated in a more structured way (think like a Mad Lib), or some hybrid of these two approaches. In other words, maybe you might use a Markov process to generate sentences within the larger template of a story, legal document, or some other template.

Refer back to Unit 2 lessons 2-4 for how to implement this.

Your Unit 2 Project folder in Google Drive should have two separate Python programs:

  1. One called build_corpus.py which will include your web scraping and/or crawling code, and will generate one or more text files within a subfolder called corpus.
  2. Another one called markov.py will read from the file(s) in this corpus and generate your output in a folder called output.

To complete the assignment experiment with running the Markov algorithm several times and adjusting it until you get something that you like. Then in class you can present both your ouput text, as well as the code used to generate it, and the code used to build your corpus.