Due: Wednesday, March 26, 8pm
Answer the following questions about hypothetical scenarios and what data structure you might use in each. Please put your answers in a Google Doc named "Part 1", in your Google Drive folder for this week (named "Unit 2, Lesson 2").
You are working with a bunch of images, and each one has a timestamp of when it was taken. You want to display them in the order they were taken. Should you store the images in a list or a dictionary, and why?
You are working with Twitter data, and you are trying to organize a bunch of tweets from a small group of users. Each user has several tweets. You are going to want to access all the tweets for a given user, and later on you may want to add more tweets from these users to your data structure. How might you combine a list and dictionary to store this data?
You have data about a bunch of people and their primary address. You want to find out if any of the people live together. How might you combine list(s) and/or dictionary(s) to store this data to simplify the process of determining if any people live at the same address?
Create a web crawler that implements a version of the Wikipedia philosphy game.
Hard code a URL in your program to serve as the starting point
— i.e. the first URL.
Add that URL to a queue
(a list), and iterate with
a while
loop to visit all the URLs in that queue.
Use Beautiful Soup to target the first paragraph of body text
(<p>
) in that page,
find_all()
<a>
tags within
that, and add those URLs to a queue in turn.
Continue looping until you find a URL that
equals https://en.wikipedia.org/wiki/Philosophy
. You
can check for string equality in Python with ==
.
You may want to use some other stopping condition so the game
does not go on forever. See my example
with page_count
. Perhaps make your crawler stop
after 10 or 20 pages?
You should probably also slow this program down a bit so your
crawler does not encounter any problems with accessing the
Wikipedia server too rapidly. Add import time
to
the top of your code file, and at the bottom of
your while
loop, add the following command:
time.sleep(.2)This will sleep for .2 seconds before proceeding to the next iteration of the loop. You can experiment with values here, but probably even that short pause should help.
Create a web scraper that starts with a list of URLs that you manually find and hard code in your program, iterates over those URLs, and (without any queueing or web crawling) simply fetches the URL, gets the textual content of the response, and appends that to a string variable. After visiting all the URLs in your list, save that string to a file.
After running, you should have one file containing the text of many web pages.
If you are able to get that to work, try starting with an
"index page". This can be a Google Search result, a NY Times
search result, a search on archive.org, a list of datasets on
a governmental website, or something else. Start your program
with a URL to this "index page", and have your program gather
all <a>
tags on that page, and add them to
a list. Then use that as the list of URLs for this
exercise. In other words, try to automate the creation of the
URL list for scraping.