Due: Tuesday, March 22, 8pm
Adapt the example that we worked through in class to a different case that you are interested in. Start with a different site and URL, try to run the data scraper, and see what happens.
Keep in mind the ethics and legality of this work. An excessive amount of data scraping (for example while you may be writing and testing this code) could be burdensome to a small site or someone whose hosting provider charges a lot. Also, certain sites may not take kindly to you attempting to scrape their data. I don't think you will find yourself in any legal trouble, but keep in mind that folks have gotten in trouble for violating the terms of service of various platforms in the past. If you are unsure, try to find the terms of service for the site you want to work on. If you can't or if that is unclear, consider reaching out to a site administrator and ask if they have an issue with data scraping.
Try to modify the data processing section to do something interesting in relation to your example. Can you find the most or least frequently used word on a page for example? Or other algorithmic techniques that we've worked with so far this semester.
As you explore and decide what site you will work with, pose a
question and genuinely see if your code can make a gesture
toward answering it. Include a .py
code file with
your work, as well as a short Google Doc file (< 300 words)
that states your site, your question, and whether you made any
progress toward answering it.
Linked here is a solution to this exercise, based on our
in-class review. During that discussion, we discovered a bug in
this code whereby URLs that were already processed could be
added again to teh queue, causing them to be processed again,
potentially repeating that infinitely. We fixed the problem by
adding a new list called urls_visited
to keep track
of URLs that we had already visited:
unit2-tutorial2-hw-solution.py