Want to build a web scraper in Google Sheets? Turns out, basic web scraping, automatically grabbing data from websites, is possible right in your Google Sheet, without needing to write any code.
You can extract specific information from a website and show it in your Google Sheet using some of Sheets’ special formulas.
For example, recently I needed to find out the authors for a long list of blog posts from a Google Analytics report, to identify the star authors pulling in the page views. It would have been extremely tedious to open each link and manually enter each author’s name. Thankfully, there are some techniques available in Google Sheets to do this for us.
Web Scraper Basic Example
Click here to get your own copy >>
For the purposes of this post, I’m going to demonstrate the technique using posts from the New York Times.
Let’s take a random New York Times article and copy the URL into our spreadsheet, in cell A1:
Navigate to the website, in this example the New York Times:
Note – I know what you’re thinking, wasn’t this supposed to be automated?!? Yes, and it is. But first we need to see how the New York Times labels the author on the webpage, so we can then create a formula to use going forward.
Hover over the author’s byline and right-click to bring up the menu and click
"Inspect Element" as shown in the following screenshot:
This brings up the developer inspection window where we can inspect the HTML element for the byline:
In the new developer console window, there is one line of HTML code that we’re interested in, and it’s the highlighted one:
<span class="byline-author" data-byline-name="JENNIFER MEDINA" itemprop="name" data-twitter-handle="jennymedina">JENNIFER MEDINA</span>
We’re going to use the IMPORTXML function in Google Sheets, with a second argument (called “xpath-query”) that accesses the specific HTML element above.
//span[@class='byline-author'], looks for span elements with a class name “byline-author”, and then returns the value of that element, which is the name of our author.
Copy this formula into the cell B1, next to our URL:
The final output for the New York Times example is as follows:
Web Scraper example with multi-author articles
Consider the following article:
In this case there are two authors in the byline. The formula in step 4 above still works and will return both the names in separate cells, one under the other:
This is fine for a single-use case but if your data is structured in rows (i.e. a long list of URLs in column A), then you’ll want to adjust the formula to show both the author names on the same row.
To do this, I use an Index formula to limit the request to the first author, so the result exists only on that row. The new formula is:
Notice the second argument is 1, which limits to the first name.
Then in the adjancent cell, C1, I add another formula to collect the second author byline:
This works by using 2 to return the author’s name in the second position of the array returned by the IMPORTXML function.
The result is:
Other media web scraper examples
Other websites use different HTML structures, so the formula has to be slightly modified to find the information by referencing the relevant, specific HTML tag. Again, the best way to do this for a new site is to follow the steps above.
Here are a couple of further examples:
For Business Insider, the author byline is accessed with:
For the Washington Post:
Using IMPORTHTML function to scrape tables on websites
Click here to get your own copy >>
Consider the following Wikipedia page, showing a table of the world’s tallest buildings:
Although we can simply copy and paste, this can be tedious for large tables and it’s not automatic. By using the IMPORTHTML formula, we can get Google Sheets to do the heavy lifting for us:
which gives us the output:
Finding the table number (in this example, 2) involves a bit of trial and error, testing out values starting from 1, until you get your desired output.
Note, this formula also works for lists on webpages, in which case you change the “table” reference in the formula to “list”.
For more advanced examples, check out:
Other IMPORT formulas:
If you’re interested in expanding this technique then you’ll want to check out these other Google Sheet formulas:
IMPORTDATA – imports data at a given url in .csv or .tsv format
IMPORTFEED – imports an RSS or ATOM feed
IMPORTRANGE – imports a range of cells from a specified spreadsheet.