Code as a Liberal Art, Spring 2022

Unit 2, Tutorial 3 lesson — Wednesday, March 23

Note that in class on Wednesday, March 23 we did not actually cover very much of the below because we spent most of our time reviewing the homework from last week. The homework for this week (here) was adapted to reflect that.

Instead, we returned to this material via Tutorial 4 on Wednesday, March 30, and the Tutorial 4 homework reflects that.

More advanced webscraping with DOM trees

Today we are going to talk about some more advanced techniques in data scraping. Last week we looked at the basic principles of stcraping data: requesting a web page from code (in this case Python), retrieving the resulting webpage, and doing something to operate on that webpage in some way as data. The main processing we did was to look for all links on the page and create something like a sitemap: a diagram of links. In the context of our discussion about distant reading, digital humanties, and network diagrams (i.e. graphs) we could say that you used data scraping to create a network diagram about a collection of webpages.

Today we're going to experiment with some techniques that try to get more granular & detailed about collecting data from webpages.

In general, data scraping is a process of finding what we might call unstructured data, and imposing some structure on it, usually by storing it in a data structure or a database of some sort. This could even be as simple as finding all the entries of a search result and copy/pasting them into rows of an Excel or Google Sheets spreadsheet.

You could of course do something like that by hand, but usually when we talk about data scraping we're talking about automating that process with an algorithm implemented in code. In order to automate a process like this, you need to start with some data that you're interested in, then determine some rules or patterns that you can use to tell a computer program how to identify those data points in order to automate the process of finding them and saving them into a structured form (like a data structure or a database) which you could then operate on later.

For example, say you were conducting some investigative journalism and you wanted to find everyone who had donated to a politician's election campaign so that you could determine average contributions by geographical region or industry. You might find this information on a webpage, but there are thousands of donors, and there is no easy way to do that calculation. Or say you wanted to generate a list of all the people who had donated to a charity so that you could put their names and contact information into a spreadsheet, in order to contact them all via email in an automated way to solicit funds for a similar non-profit.

In each of those cases, you would need to locate the data, then determine some rule for how to identify those data points in order to automate the process. We'll do that by looking at the structure of HTML web pages which we'll think of today as tree structures.

In most of these cases, the data probabyl does exist somewhere in an unstructured form. (The politician probably has a spreadsheet or database of contributors, the charity probably has a contact list of donors.) But the entities that hold that data in a structured form may not want to share it in an easy-to-use way. So data scraping is often a process of reverse engineering the structure of some collection of information.

Table of contents

Table of contents for the topics today: (Click to jump down to the corresponding section.)

  1. Trees
  2. Review: Absolute URLs versus relative URLs
  3. Navigating a parse tree
  4. More advanced techniques with BeautifulSoup
  5. Searching the DOM
  6. An example, NYTimes.com

I. Trees

To do this, we're going to think about a data structure known as a trees. A tree is a specific type of graph. Remember from our conceptual discussion this week about distant reading that a graph is comprised of a collection of nodes, connected by links called edges. (We actually talked about graphs and network diagrams today, Weds, because we ran out of time on Monday. The PDF of slides is here.)

A stand of trees in California viewed from a distance, by twitter user vermillion baby

In graph theory, a cycle (or loop) is any collection of one or more edges which will lead you from a node back to itself. In the image below on the left, hopefully you can see how the edge 2-3 would form a cycle, as would 2-5, 1-4, or 4-6.

A tree is defined as a graph with no cycles. This means that there is exactly one path (or route) from any node to any other node. As an implication of this, we can designate one node as a kind of base node, or a starting point to the tree. This is called the root. We usually then depict trees as oriented in a manner that fans out from the root, often arranged vertically, with the root at the top or bottom (usually at the top). Hopefully you can see how the below two graphs are structurally identical, with the one on the right merely rendered vertically, in the common manner of a tree.

Trees are great for representing hierarchy. Think about the org chart of a company — the structure of who reports to whom. The terminology of trees aligns with this aspect of hierarchy. We often refer to adjacent nodes in a tree as either parent or child nodes, in accordance with their position in the vertical representation. Each node in a tree (except the root node ) has precisely one parent, and zero or more children. When a node has no children, we can say that it is a leaf node. In other words, it is at the end of one branch. If two nodes share the same parent, we might also say that they are sibling nodes, that they are at the same level in the tree hierarchy, or that they are at the same depth in the tree.

Trees are used for many things in computer science.

A tree is what we might call an abstract data structure: it is a data structure in the sense that it is a diagrammatic form, or a model for organizing data, but it is abstract in the sense that it can be implemented in many different ways using other more primitive data structures like lists and dictionaries. Python does not have a tree data structure built-in as it does lists and dictionaries. But you can implement a tree in different ways using those built-in data structures.

We've already talked about some ways you could implement a tree. In the case of a dynamically-generated sitemap, we talked about how you could use a dictionary to represent pages, with each one stored with its URL as a key pointing to a list of other URLs corresponding to linked pages.

II. Review: Absolute URLs versus relative URLs

Before going further I want to elaborate a detail that I briefly discussed in last week's technical lesson. To explain this issue, I need to offer some context.

Many of the topics today will be relevant to Unit 3 of our class when we'll examine code as a form of communication and argumentation in the context of hypermedia texts, multimedia, and web publishing. Some principles of web scraping can be confusing without pausing to consider these, so we'll introduce these ideas here in the context of data, data structures, and web scraping, and keep in mind that we'll be referring back to them in a few weeks.

In the earliest days of the worldwide web, web sites were simply comprised of files in a file system. Files were arranged in folders which a web server could access, and URLs corresponded to this file tree. Things have changed quite a bit with more advanced web development techniques, but the conceptual idea of organizing a web site in this way, as defined by a hierarchy of URLs, has more or less persisted.

A very contrived example of files and folders arranged as for a website.

For example, if the above files and folders were the basis for a simple website, and this website was installed on a server with the domain name of http://rorysolomon.com, some of the URLs of this site would be:

Notice how a URL is essentially a single-line of text representing a path in a tree from root to leaf.

As we considered briefly last week (and will elaborate on in Unit 3), to create a link from one web page to another, you use an HTML element called an <a> tag, with an href attribute that includes the URL you are linking to. For example if I wanted to link from one page in my site to another, I could do it like this:

Inside homepage.html:

  <a href="http://rorysolomon.com/contact-info.html">click here for my contact info</a>

This is called an absolute URL because it specifies the complete information to tell a browser how to access this page. Another way to specify this link would be with a relative URL, which would look like this:

Inside homepage.html:

  <a href="contact-info.html">click here for my contact info</a>

Since no domain name is specified, the browser infers the location of this linked file based on the URL of the file containing the link. In this case, the browser would infer that contact-info.html is also located at http://rorysolomon.com, and that it is also in the root directory /. This would work because homepage.html and contact-info.html are in the same folder.

However, if I added this link in homepage.html, it would break:

Inside homepage.html (error):

  <a href="class-list.html">click here for my classes</a>

because these files are not in the same folder.

To fix this error, I would have to specify the path from homepage.html to class-list.html in terms of any subfolders that class-list.html is contained in. You do that like this:

Inside homepage.html:

  <a href="my-classes/class-list.html">click here for my classes</a>

Similarly, if I was working in code-toolkit.html and I wanted to link to another class in that same folder, I could specify a relative URL like this:

Inside code-toolkit.html:

  <a href="radical-software.html">click here for Radical Software</a>

You specify multiple subfolders by concatenating them together with / like this:

Inside homepage.html:

  <a href="my-classes/fall/radical-software.html">click here for Radical Software</a>

And if you ever need to go "up", to a parent folder, you can do that with the funny notation .., like this:

Inside class-list.html:

  <a href="../homepage.html">click here for the homepage</a>

or to go up multiple levels:

Inside code-toolkit.html:

  <a href="../../homepage.html">click here for the homepage</a>

Why am I explaining all this?!? Becuase it can cause major headaches when web scraping — problems I think several of you were encountering this week. In the example from last week's class notes, I was using the New York Times website, which specified absolute URLs, so I was able to simply add each URL to my list. But in many of your programs, for example when working with Wikipedia, the HTML specifies relative URLs. Trying to process those later will give an error because only a relative URL with no domain name will not be a well-formed HTTP request.

The solution is to check if the new URL that you are adding is relative or absolute, and if relative, add on the necessary parts to make it into an absolute URL. Have a look at the modified example from last week to see what I mean. (Changed portions are in blue.) Note that this code is not a perfect solution. This kind of work can get complicated and this code will need to be tested in your particular case before assuming it will work smoothyl. Feel free to ask me for guidance.

III. Navigating a parse tree

Another part of the homework that was causing people troubles, and which can be clarified by thinking about trees, is the work of parsing HTML with Beautiful Soup.

We'll go into more detail about how to write and work with HTML in Unit 3, but for now, let's focus on some techniques to read HTML: in particular, the tree aspects of HTML and how an understanding of this can help with the work of data scraping.

(a) Anatomy of an HTML tag

Hypertext Markup Langauge (HTML) is comprised of tags. Each tag is written as a pair of less-than and greater-than signs, which we call left and right angle brackets or braces:

<   >
These must always appear in pairs. We call the left brace open and the right one closed, so we can say they must always appear as open-and-closed pairs.

We say the stuff in between the braces is inside the braces. The stuff inside braces determines an HTML tag. The first bit of text after the open brace specifies which HTML tag this is. There are many (many) HTML tags. Again we'll learn more about these and how to use them in Unit 3, and today will focus on how they create a tree structure that you can navigate with code.

As an example, let's consider a tag that we've already been looking at:

<a href="http://nytimes.com" id="link1" class="some-style">click here</a>
In this case, the first character after the open brace indicates that this is an a tag, the type of tag used to create hyperlinks.

Other parts of this tag:

(b) Special attributes: id and class

There are many other types of possible attributes, and we won't cover them all. But two special attributes in particular bear more attention and explanation:

An id attribute will uniquely identify a tag on the HTML page. In other words, the value of an id tag must be unique.

A class tag describes some property of the tag which it probably has in common with other tags. This is used to style the tag (to control its appearance with CSS) which we'll talk about in Unit 3. Classes often indicate something about how this tag should look or how it fits in to the structure of the page. For example: class="main-content". A given class value will likely be applied to many tags on a page. Also, a tag can have more than one class, in which case they will be separated by spaces. For example: class="main-content large left-side".

Let's consider the following snippet of HTML code:

<html>
  <head>
    <title>Basic webpage</title>
  </head>
  <body>
    <div>
      <h1>Hello</h1>
      <div>
        <p>A paragraph</p>
        <p>Another paragraph</p>
      </div>
    </div>
  </body>
</html>
This example contains seven different tag types, and nine tags total. Notice how the <title>, <h1>, and <p> tags contain textual content, while the other tags contain other tags. For example the second <div> contains two <p> tags.

The computer program code that is a web browser parses this text and represents it internally as a tree. We call this tree of HTML a DOM, which stands for Document Object Model.

The above HTML would be represented by the following tree:

Owing to this tree structure, when an HTML tag contains other tags, we say these are its children.

The four leaf nodes in this case are not actually HTML tags at all, but they are still nodes in the DOM: they are text nodes.

IV. More advanced techniques with BeautifulSoup

Hopefully this already has started to give you some greater understanding into how Beautiful Soup works.

Create a new file in Atom, copy/paste the above HTML code into it, and save it. You can call it whatever you like, but make sure it ends with .html. I'll call my file basic-webpage.html. Now, open Terminal, cd to the directory/folder containing this file, run Python, and let's start to experiment with Beautiful Soup interactively in the shell.

First, let's open the file in Python, and use it to create a BeautifulSoup object:

>>> from bs4 import BeautifulSoup
>>> file = open("basic-webpage.html","r")
>>> soup = BeautifulSoup(file, "html.parser")
>>> soup
<html>
<head>
<title>Basic webpage</title>
etc ...  

This opens the HTML file as textual data, and passes that data in to a new BeautifulSoup object. The Beautiful Soup library gives us a bunch of ways to programmatically nagivate the DOM and to use Python code to search through the various HTML tags that comprise this document. We access all that through this BeautifulSoup object, which I've called soup, but you can call whatever you'd like.

The Beautiful Soup documentation can start being very useful to you here, in particular the section titled "Navigating the tree".

As the documentation explains, we can start by using dot notation to access various HTML tags as properties of the soup object:

>>> soup.head
<head>
<title>Basic webpage</title>
</head>
>>> soup.body
<body>
<div>
<h1>Hello</h1>
etc ...    
>>> soup.body.div
<div>
<h1>Hello</h1>
etc ...
>>> soup.body.div.div.p
<p>A paragraph</p>

Analagous to how a URL is a textual representation of a path in a file system, this dot notation is one line of Python code that corresponds to a path in the DOM.

Note that on the last line, I've referenced the <p> child of that <div> and what was displayed was the first tag, even though there are two <p> tags in that <div>. If you want to actually access all the children of a tag, you can use the Beautiful Soup property named .contents:

>>> soup.body.div.div.contents
['\n', <p>A paragraph</p>, '\n', <p>Another paragraph</p>, '\n']
Note those funky \n characters. That is a control character that indicates a newline, i.e. a line break. Every bit of text in the DOM is a node. So even the line breaks between tags get return to us as their own small text nodes.

Also, notice that this is returning something enclosed in square brackets [ ], indicating that this has returned a Python list. Remember that just like with any list, you can reference these items now by numerical index. (Keep in mind that this list also contains some \n characters.)

>>> soup.body.div.div.contents[0]
'\n'
>>> soup.body.div.div.contents[1]
<p>A paragraph</p>
>>> soup.body.div.div.contents[3]
<p>Another paragraph</p>
And you can loop over a list in Python with a for loop. This code loops over all the children of the inner <div> and prints each to the shell:
>>> for node in soup.body.div.div.contents:
...   node
... 
'\n'
<p>A paragraph</p>
'\n'
<p>Another paragraph</p>
'\n'

This can be useful, and for a small HTML document like this, maybe even sufficient. But this is very different from the kind of web page that you would typically encounter, which would normally be full of many, many more HTML tags. For this you will find it useful to search the tree.

V. Searching the DOM

Probably the most powerful, effecitve, all-purpose tool that Beautiful Soup offers you for searching the DOM is the find_all() command. (Docs here.)

This command is very flexible and takes several possible arguments. The first argument corresponds to the name of an HTML tag, and it will return all instances of that tag in the DOM. So the following code will return a list of all <div> tags:

>>> soup.find_all("div")

You can also search the DOM for any tags matching a specific class or id.

Let's make some additions to our HTML code:

<html>
  <head>
    <title>Basic webpage</title>
  </head>
  <body>
    <div>
      <h1 id="node-1">Hello</h1>
      <div class="cool">
        <p>A paragraph</p>
        <p class="cool">Another paragraph</p>
      </div>
    </div>
  </body>
</html>
You'll need to reload this to see these changes if you're working in the shell:
>>> file = open("basic-webpage.html","r")
>>> soup = BeautifulSoup(file, "html.parser")
And now we can search using find_all() targetting tags based on class or id in the following way:
>>> soup.find_all(id="node-1")
[<h1 id="node-1">Hello</h1>]
>>> soup.find_all(class_="cool")
[<div class="cool">
<p>A paragraph</p>
<p class="cool">Another paragraph</p>
</div>, <p class="cool">Another paragraph</p>]
You can also combine these criteria in many ways. So now this code gets tags with the class cool, but only the <p> tags:
>>> soup.find_all("p",class_="cool")
[<p class="cool">Another paragraph</p>]
Note that these are all returning Python lists. So you can then loop over them or index them numerically.

If you are trying to search for a tag and you only want one, you can use the find() command, which takes essentially the same arguments. Note that this returns the <h1> as before, but it returns one single element, not a list of length one:

>>> soup.find(id="node-1")
<h1 id="node-1">Hello</h1>

This gets even more powerful because you can call find_all() and find() on specific tags within the DOM, not only on the root object itself, what I've called soup. This will only search the descendents (children, children's children, etc) of the object you're calling find() on. For example, this code gets the <div> with the class cool, and then searches within that tag for any other tags with class cool:

>>> d = soup.find_all("div",class_="cool")
>>> d
[<div class="cool">
>>> d[0].find_all(class_="cool")
[<p class="cool">Another paragraph</p>]
This is so powerful because now you can find() or find_all() a specific element within the DOM, and then only search for a given tag within that. For example, you can search for a <div> or <span> tag with a given class or id, and then retrieve an <a> tag within that. This may help you find a URL or piece of data that you are looking for that may not be easily targetted directly, but may be inside the DOM within a different tag that you are able to target.

I hope that this technique of combining multiple find()s or find_all()s will be super useful as you work through the homework for this week, including taking some time to improve on your work from last week.

VI. An example, NYTimes.com

(Why do I always turn for examples to the Gray Lady?)

Searching for a keyword, getting a list of articles:

Using the browser inspect tool to start digging around in the DOM tree of this webpage of search results.

(Click to enlarge.)

Discovering that there is a convenient HTML tag (an <ol>) with an attribute called "data-testid" whose value is "search-results, which contains each search result as an <li>

(Click to enlarge.)
results = soup.findAll("ol", {"data-testid" : "search-results"})