Using Google Sheets as a basic web scraper

Have you ever wanted to extract a specific piece of information from a website, so you can analyze it in your Google spreadsheet? Turns out, basic web scraping is possible right in your Google spreadsheet.

If you want to import social media data into your Google Sheets, you might also be interested in my follow up post to this one – using the same formulas as below to import popular social media network stats into your Google Sheets.

For example, recently I needed to find out the authors for a long list of blog posts from a Google Analytics report, to identify the star authors pulling in the page views. It would have been extremely tedious to open each link and manually enter each author’s name. Thankfully, there are some techniques available in Google Sheets to do this for us.

Basic Example

For the purposes of this post, I’m going to demonstrate the technique using posts from the New York Times.

Step 1:

Let’s take a recent New York Times article and copy the URL into our spreadsheet, in cell A1:

Example New York Times URL
Example New York Times URL

Step 2:

Navigate to the website, in this example the New York Times:

New York Times screenshot
New York Times screenshot

Note – I know what you’re thinking, wasn’t this supposed to be automated?!? Yes, and it is. But first we need to see how the New York Times labels the author on the webpage, so we can then create a formula to use going forward.

Step 3:

Hover over the author’s byline and right-click to bring up the menu and click "Inspect Element" as shown in the following screenshot:

New York Times inspect element selection
New York Times inspect element selection

This brings up the developer inspection window where we can inspect the HTML element for the byline:

New York Times element in developer console
New York Times element in developer console

Step 4:

In the new developer console window, there is one line of HTML code that we’re interested in, and it’s the highlighted one:

<span class="byline-author" data-byline-name="JENNIFER MEDINA" itemprop="name" data-twitter-handle="jennymedina">JENNIFER MEDINA</span>

We’re going to use the importxml function in Google Sheets, with a second argument (called “xpath-query”) that accesses the specific HTML element above.

The xpath-query, //span[@class='byline-author'], looks for span elements with a class name “byline-author”, and then returns the value of that element, which is the name of our author.

Copy this formula into the cell B1, next to our URL:

=importxml(A1,"//span[@class='byline-author']")

The final output for the New York Times example is as follows:

Basic web scraping example using importXML in Google Sheets
Basic web scraping example using importXML in Google Sheets

Multi-author articles

Consider the following article:

http://www.nytimes.com/2015/09/25/us/pope-francis-congress-speech.html

In this case there are two authors in the byline. The formula in step 4 above still works and will return both the names in separate cells, one under the other:

Two author web scrape using importXMLTwo author web scrape using importXML
Two author web scrape using importXML

This is fine for a single-use case but if your data is structured in rows (i.e. a long list of URLs in column A), then you’ll want to adjust the formula to show both the author names on the same row.

To do this, I use a Query formula to limit the request to the first author, so the result exists only on that row. The new formula is:

=query(importxml(A1,"//span[@class='byline-author']"),"select * limit 1")

Notice the second argument is the SQL-style "select * limit 1" statement, which limits to the first name.

Then in the adjancent cell, C1, I add another formula to collect the second author byline:

=query(importxml(A1,"//span[@class='byline-author']"),"select * limit 1 offset 1")

This works by using an offset clause to return the second author’s name.

The result is:

Two author web scrape on same row
Two author web scrape on same row

For further details on the Query Language, check out the official Google Developer docs here.

Final New York Times example

Putting it all together for our New York Times example, first showing the formulas:

Web scraping formulas using importXML
Web scraping formulas using importXML

And now showing results:

Web scraping results using importXML
Web scraping results using importXML

Other media examples

Other websites use different HTML structures, so the formula has to be slightly modified to find the information by referencing the relevant, specific HTML tag. Again, the best way to do this for a new site is to follow the steps above.

Here are a couple of further examples:

For Business Insider, the author byline is accessed with:

=importxml(A1,"//li[@class='single-author']")

For the Washington Post:

=importxml(A1,"//span[@itemprop='name']")

Other formulas for information gathering

If you’re interested in expanding this technique then you’ll want to check out these other Google Sheet formulas:

importdata – imports data at a given url in .csv or .tsv format
importfeed – imports an RSS or ATOM feed
importhtml – imports data from a table or list within an HTML page
importrange – imports a range of cells from a specified spreadsheet

14 thoughts on “Using Google Sheets as a basic web scraper”

    1. Hey Cindi!

      Unfortunately these IMPORT formulas cannot work with data behind a password wall. What email service provider are you using? It’s relatively easy to connect to the MailChimp API (haven’t tried any others) and extract all sorts of email subscriber data that way. I’ll be publishing a post on this soon.

      Cheers,
      Ben

      1. Hi Ben – this doesn’t seem to be working form Trying to scrap using your formula but I’m getting ‘imported content is empty’.

        Has Business Insider changed their code?

        Thanks,

    1. Hey Dave,

      Unfortunately I’ve not been able to grab the client buys data either. I think the problem is that the page content is dynamically generated (it’s a Java Server Page or .jsp which means the content is programmatically created) so the import formulas can’t parse this data.

      I’ll let you know if I get anything working!

      Cheers,
      Ben

  1. Hello Guys,

    I don’t know is this specific for the region (europe) or google has changed something (post is 2-year old), but i had to use semicolon ” ; ” instead of colon ” , ” in the formula.

    1. Hi Piotrek,

      Yes, you’re right! This is a regional change that’s standard for syntax for continental Europe.

      Cheers,
      Ben

  2. Hi there,
    I’m trying to use this method for Street Easy listings. Mine will work and then I’ll come back to the sheet later and get the “Error
    Imported content is empty.” and ideas why this might be happening?

    Thanks

Leave a Reply

Your email address will not be published. Required fields are marked *