Code as a Liberal Art, Spring 2024

Unit 2, Tutorial 2 homework

Due: Wednesday, March 27, 8pm

Review the class notes for this week.
Adapt the example that we worked through in class to a different case that you are interested in. Start with a different site and URL, try to run the data scraper, and see what happens. Can get your code to run and collect a large amount of data in an automated way? Consider scraping public repositories of large quantities of textual data, such as archive.org, public legal filings databases, nyc.gov, data.gov, or things like Congressional Hearings at govinfo.gov.

Keep in mind the ethics and legality of this work. An excessive amount of data scraping (for example while you may be writing and testing this code) could be burdensome to a small site or someone whose hosting provider charges a lot. Also, certain sites may not take kindly to you attempting to scrape their data. I don't think you will find yourself in any legal trouble, but keep in mind that folks have gotten in trouble for violating the terms of service of various platforms in the past. If you are unsure, try to find the terms of service for the site you want to work on. If you can't or if that is unclear, consider reaching out to a site administrator and ask if they have an issue with data scraping.
Can you modify the data processing section to do something interesting in relation to your example. For example, could you find the most or least frequently used word on each page that you request? Or use other algorithmic techniques that we've worked with so far this semester.
Revist the question that you posed from part 4 of last week's homework. What type of scraping or crawling might you do to gather data that would help investigate that question? If your question was not conducive to this type of data gathering, can you think of how you might adjust it to be so? What types of questions are well-suited for investigation based on a process of collecting large amounts of digital data?

Include another Google Doc in your homework folder (< 300 words) that explores this, elaborating on a your question, and more specific details about how you might investigate it through a process of collecting data by scraping or crawling.