Using Google Sheets as a basic web scraper

Have you ever wanted to extract a specific piece of information from a website, so you can analyze it in your Google spreadsheet? Turns out, basic web scraping is possible right in your Google spreadsheet.

For example, recently I needed to find out the authors for a long list of blog posts from a Google Analytics report, to identify the star authors pulling in the page views. It would have been extremely tedious to open each link and manually enter each author’s name. Thankfully, there are some techniques available in Google Sheets to do this for us.

Basic Example

For the purposes of this post, I’m going to demonstrate the technique using posts from the New York Times.

Step 1:

Let’s take a recent New York Times article and copy the URL into our spreadsheet, in cell A1:

Example New York Times URL
Example New York Times URL

Step 2:

Navigate to the website, in this example the New York Times:

New York Times screenshot
New York Times screenshot

Note – I know what you’re thinking, wasn’t this supposed to be automated?!? Yes, and it is. But first we need to see how the New York Times labels the author on the webpage, so we can then create a formula to use going forward.

Step 3:

Hover over the author’s byline and right-click to bring up the menu and click "Inspect Element" as shown in the following screenshot:

New York Times inspect element selection
New York Times inspect element selection

This brings up the developer inspection window where we can inspect the HTML element for the byline:

New York Times element in developer console
New York Times element in developer console

Step 4:

In the new developer console window, there is one line of HTML code that we’re interested in, and it’s the highlighted one:

<span class="byline-author" data-byline-name="JENNIFER MEDINA" itemprop="name" data-twitter-handle="jennymedina">JENNIFER MEDINA</span>

We’re going to use the importxml function in Google Sheets, with a second argument (called “xpath-query”) that accesses the specific HTML element above.

The xpath-query, //span[@class='byline-author'], looks for span elements with a class name “byline-author”, and then returns the value of that element, which is the name of our author.

Copy this formula into the cell B1, next to our URL:


The final output for the New York Times example is as follows:

Basic web scraping example using importXML in Google Sheets
Basic web scraping example using importXML in Google Sheets

Multi-author articles

Consider the following article:

In this case there are two authors in the byline. The formula in step 4 above still works and will return both the names in separate cells, one under the other:

Two author web scrape using importXMLTwo author web scrape using importXML
Two author web scrape using importXML

This is fine for a single-use case but if your data is structured in rows (i.e. a long list of URLs in column A), then you’ll want to adjust the formula to show both the author names on the same row.

To do this, I use a Query formula to limit the request to the first author, so the result exists only on that row. The new formula is:

=query(importxml(A1,"//span[@class='byline-author']"),"select * limit 1")

Notice the second argument is the SQL-style "select * limit 1" statement, which limits to the first name.

Then in the adjancent cell, C1, I add another formula to collect the second author byline:

=query(importxml(A1,"//span[@class='byline-author']"),"select * limit 1 offset 1")

This works by using an offset clause to return the second author’s name.

The result is:

Two author web scrape on same row
Two author web scrape on same row

For further details on the Query Language, check out the official Google Developer docs here.

Final New York Times example

Putting it all together for our New York Times example, first showing the formulas:

Web scraping formulas using importXML
Web scraping formulas using importXML

And now showing results:

Web scraping results using importXML
Web scraping results using importXML

Other media examples

Other websites use different HTML structures, so the formula has to be slightly modified to find the information by referencing the relevant, specific HTML tag. Again, the best way to do this for a new site is to follow the steps above.

Here are a couple of further examples:

For Business Insider, the author byline is accessed with:


For the Washington Post:


Other formulas for information gathering

If you’re interested in expanding this technique then you’ll want to check out these other Google Sheet formulas:

importdata – imports data at a given url in .csv or .tsv format
importfeed – imports an RSS or ATOM feed
importhtml – imports data from a table or list within an HTML page
importrange – imports a range of cells from a specified spreadsheet

Leave a Reply

Your email address will not be published. Required fields are marked *