Code as a Liberal Art, Spring 2022

Unit 2, Tutorial 2 homework

Due: Tuesday, March 22, 8pm

  1. Review the class notes for this week.
  2. Adapt the example that we worked through in class to a different case that you are interested in. Start with a different site and URL, try to run the data scraper, and see what happens.

    Keep in mind the ethics and legality of this work. An excessive amount of data scraping (for example while you may be writing and testing this code) could be burdensome to a small site or someone whose hosting provider charges a lot. Also, certain sites may not take kindly to you attempting to scrape their data. I don't think you will find yourself in any legal trouble, but keep in mind that folks have gotten in trouble for violating the terms of service of various platforms in the past. If you are unsure, try to find the terms of service for the site you want to work on. If you can't or if that is unclear, consider reaching out to a site administrator and ask if they have an issue with data scraping.

    Try to modify the data processing section to do something interesting in relation to your example. Can you find the most or least frequently used word on a page for example? Or other algorithmic techniques that we've worked with so far this semester.

    As you explore and decide what site you will work with, pose a question and genuinely see if your code can make a gesture toward answering it. Include a .py code file with your work, as well as a short Google Doc file (< 300 words) that states your site, your question, and whether you made any progress toward answering it.

Solution

Linked here is a solution to this exercise, based on our in-class review. During that discussion, we discovered a bug in this code whereby URLs that were already processed could be added again to teh queue, causing them to be processed again, potentially repeating that infinitely. We fixed the problem by adding a new list called urls_visited to keep track of URLs that we had already visited:

unit2-tutorial2-hw-solution.py