Have you ever wanted to extract a specific piece of information from a website, so you can analyze it in your Google spreadsheet? Turns out, basic web scraping is possible right in your Google spreadsheet.
If you want to import social media data into your Google Sheets, you might also be interested in my follow up post to this one – using the same formulas as below to import popular social media network stats into your Google Sheets.
For example, recently I needed to find out the authors for a long list of blog posts from a Google Analytics report, to identify the star authors pulling in the page views. It would have been extremely tedious to open each link and manually enter each author’s name. Thankfully, there are some techniques available in Google Sheets to do this for us.
For the purposes of this post, I’m going to demonstrate the technique using posts from the New York Times.
Let’s take a recent New York Times article and copy the URL into our spreadsheet, in cell A1:
Navigate to the website, in this example the New York Times:
Note – I know what you’re thinking, wasn’t this supposed to be automated?!? Yes, and it is. But first we need to see how the New York Times labels the author on the webpage, so we can then create a formula to use going forward.
Hover over the author’s byline and right-click to bring up the menu and click
"Inspect Element" as shown in the following screenshot:
This brings up the developer inspection window where we can inspect the HTML element for the byline:
In the new developer console window, there is one line of HTML code that we’re interested in, and it’s the highlighted one:
<span class="byline-author" data-byline-name="JENNIFER MEDINA" itemprop="name" data-twitter-handle="jennymedina">JENNIFER MEDINA</span>
We’re going to use the importxml function in Google Sheets, with a second argument (called “xpath-query”) that accesses the specific HTML element above.
//span[@class='byline-author'], looks for span elements with a class name “byline-author”, and then returns the value of that element, which is the name of our author.
Copy this formula into the cell B1, next to our URL:
The final output for the New York Times example is as follows:
Consider the following article:
In this case there are two authors in the byline. The formula in step 4 above still works and will return both the names in separate cells, one under the other:
This is fine for a single-use case but if your data is structured in rows (i.e. a long list of URLs in column A), then you’ll want to adjust the formula to show both the author names on the same row.
To do this, I use a Query formula to limit the request to the first author, so the result exists only on that row. The new formula is:
=query(importxml(A1,"//span[@class='byline-author']"),"select * limit 1")
Notice the second argument is the SQL-style
"select * limit 1" statement, which limits to the first name.
Then in the adjancent cell, C1, I add another formula to collect the second author byline:
=query(importxml(A1,"//span[@class='byline-author']"),"select * limit 1 offset 1")
This works by using an
offset clause to return the second author’s name.
The result is:
For further details on the Query Language, check out the official Google Developer docs here.
Final New York Times example
Putting it all together for our New York Times example, first showing the formulas:
And now showing results:
Other media examples
Other websites use different HTML structures, so the formula has to be slightly modified to find the information by referencing the relevant, specific HTML tag. Again, the best way to do this for a new site is to follow the steps above.
Here are a couple of further examples:
For Business Insider, the author byline is accessed with:
For the Washington Post:
Other formulas for information gathering
If you’re interested in expanding this technique then you’ll want to check out these other Google Sheet formulas:
importdata – imports data at a given url in .csv or .tsv format
importfeed – imports an RSS or ATOM feed
importhtml – imports data from a table or list within an HTML page
importrange – imports a range of cells from a specified spreadsheet