Using Google Sheets as a basic web scraper

Want to build a web scraper in Google Sheets? Turns out, basic web scraping, automatically grabbing data from websites, is possible right in your Google Sheet, without needing to write any code.

You can extract specific information from a website and show it in your Google Sheet using some of Sheets’ special formulas.

If you want to import social media data into your Google Sheets, you might also be interested in my follow up post to this one – using the same formulas as below to import popular social media network stats into your Google Sheets.

For example, recently I needed to find out the authors for a long list of blog posts from a Google Analytics report, to identify the star authors pulling in the page views. It would have been extremely tedious to open each link and manually enter each author’s name. Thankfully, there are some techniques available in Google Sheets to do this for us.

Web Scraper Basic Example

For the purposes of this post, I’m going to demonstrate the technique using posts from the New York Times.

Step 1:

Let’s take a recent New York Times article and copy the URL into our spreadsheet, in cell A1:

Example New York Times URL
Example New York Times URL

Step 2:

Navigate to the website, in this example the New York Times:

New York Times screenshot
New York Times screenshot

Note – I know what you’re thinking, wasn’t this supposed to be automated?!? Yes, and it is. But first we need to see how the New York Times labels the author on the webpage, so we can then create a formula to use going forward.

Step 3:

Hover over the author’s byline and right-click to bring up the menu and click "Inspect Element" as shown in the following screenshot:

New York Times inspect element selection
New York Times inspect element selection

This brings up the developer inspection window where we can inspect the HTML element for the byline:

New York Times element in developer console
New York Times element in developer console

Step 4:

In the new developer console window, there is one line of HTML code that we’re interested in, and it’s the highlighted one:

<span class="byline-author" data-byline-name="JENNIFER MEDINA" itemprop="name" data-twitter-handle="jennymedina">JENNIFER MEDINA</span>

We’re going to use the importxml function in Google Sheets, with a second argument (called “xpath-query”) that accesses the specific HTML element above.

The xpath-query, //span[@class='byline-author'], looks for span elements with a class name “byline-author”, and then returns the value of that element, which is the name of our author.

Copy this formula into the cell B1, next to our URL:

=importxml(A1,"//span[@class='byline-author']")

The final output for the New York Times example is as follows:

Basic web scraper example using importXML in Google Sheets
Basic web scraping example using importXML in Google Sheets

Web Scraper example with multi-author articles

Consider the following article:

http://www.nytimes.com/2015/09/25/us/pope-francis-congress-speech.html

In this case there are two authors in the byline. The formula in step 4 above still works and will return both the names in separate cells, one under the other:

Two author web scrape using importXMLTwo author web scrape using importXML
Two author web scrape using importXML

This is fine for a single-use case but if your data is structured in rows (i.e. a long list of URLs in column A), then you’ll want to adjust the formula to show both the author names on the same row.

To do this, I use a Query formula to limit the request to the first author, so the result exists only on that row. The new formula is:

=query(importxml(A1,"//span[@class='byline-author']"),"select * limit 1")

Notice the second argument is the SQL-style "select * limit 1" statement, which limits to the first name.

Then in the adjancent cell, C1, I add another formula to collect the second author byline:

=query(importxml(A1,"//span[@class='byline-author']"),"select * limit 1 offset 1")

This works by using an offset clause to return the second author’s name.

The result is:

Two author web scrape on same row
Two author web scrape on same row

For further details on the Query Language, check out the official Google Developer docs here.

Final New York Times web scraper example

Putting it all together for our New York Times example, first showing the formulas:

Web scraping formulas using importXML
Web scraping formulas using importXML

And now showing results:

Web scraping results using importXML
Web scraping results using importXML

Other media web scraper examples

Other websites use different HTML structures, so the formula has to be slightly modified to find the information by referencing the relevant, specific HTML tag. Again, the best way to do this for a new site is to follow the steps above.

Here are a couple of further examples:

For Business Insider, the author byline is accessed with:

=importxml(A1,"//li[@class='single-author']")

For the Washington Post:

=importxml(A1,"//span[@itemprop='name']")

Other formulas for web scraping

If you’re interested in expanding this technique then you’ll want to check out these other Google Sheet formulas:

importdata – imports data at a given url in .csv or .tsv format
importfeed – imports an RSS or ATOM feed
importhtml – imports data from a table or list within an HTML page
importrange – imports a range of cells from a specified spreadsheet.

If you enjoyed this content, then you’ll definitely want to check out my follow up post on how to import popular social media network statistics into your Google Sheets, using these same formulas.

37 thoughts on “Using Google Sheets as a basic web scraper”

    1. Hey Cindi!

      Unfortunately these IMPORT formulas cannot work with data behind a password wall. What email service provider are you using? It’s relatively easy to connect to the MailChimp API (haven’t tried any others) and extract all sorts of email subscriber data that way. I’ll be publishing a post on this soon.

      Cheers,
      Ben

      1. Hi Ben – this doesn’t seem to be working form Trying to scrap using your formula but I’m getting ‘imported content is empty’.

        Has Business Insider changed their code?

        Thanks,

    1. Hey Dave,

      Unfortunately I’ve not been able to grab the client buys data either. I think the problem is that the page content is dynamically generated (it’s a Java Server Page or .jsp which means the content is programmatically created) so the import formulas can’t parse this data.

      I’ll let you know if I get anything working!

      Cheers,
      Ben

  1. Hello Guys,

    I don’t know is this specific for the region (europe) or google has changed something (post is 2-year old), but i had to use semicolon ” ; ” instead of colon ” , ” in the formula.

    1. Hi Piotrek,

      Yes, you’re right! This is a regional change that’s standard for syntax for continental Europe.

      Cheers,
      Ben

  2. Hi there,
    I’m trying to use this method for Street Easy listings. Mine will work and then I’ll come back to the sheet later and get the “Error
    Imported content is empty.” and ideas why this might be happening?

    Thanks

  3. As someone who uses programming for web scraping, this article was very helpful. I often start with complex frameworks like Scrapy for Python, not realizing there are simpler solutions out there. Cheers, Simon

  4. Hello Ben;
    First of all, I would like to admit that those infos are awesome and amazing, thank you so much! and I would like to ask you 2 more questions;

    1-Is there a way to grab the data from website which includes multiple pages?
    2-(Actually, this question depends on your response to the 1st question) So, if there is a way for pagination then is there an automated way to grab the data from the pages which require age verification (by entering the birth date) among those multiple pages?

    Sorry for my bad English I hope that it is good enough to make you understand my questions 🙂

    Thank you in advance

    Regards

    1. Hi Ozan,

      So there’s no way for standard formulas to deal with pagination and scrape data from multiple pages. Potentially you could modify the URL each time, depending on how it was set up, so that you could change the pagination number each time, e.g.

      In one cell put https://techcrunch.com/page/2/
      then in next cell: https://techcrunch.com/page/3/
      In next cell: https://techcrunch.com/page/4/
      etc..

      and then run the IMPORT formula for each of those.

      There’s no way you can grab data that requires age verification first, without using Apps Script to programmatically do that. The IMPORT formulas are pretty basic (but powerful!).

      Hope that helps!

      Ben

  5. Hi Ben,

    I’m trying to scrape a value for a crypto coin from this page

    https://bittrex.com/Market/Index?MarketName=BTC-ANS

    The value I want is in the html as this

    0.00203969

    I’ve tried a few different syntax, for example this

    =importxml(A1,”//span[@span data-bind=’text: summary.displayLast()’]”)

    But cannot find the correct syntax.

    Could you help out please.

    Thanks

    1. Hey Sean,

      Unfortunately I don’t think it’s going to be possible with these IMPORT formulas, as the Bittrex site is dynamically generated client-side javascript. I’ve tried all the tricks I’m aware of, but can only get the static parts like headings, not any of the data. Read more about this problem here: https://www.benlcollins.com/spreadsheets/import-social-media-statistics/#notWorking

      Your next step would be to look at writing some Apps Script to extract this data via the API they have (docs here: https://bittrex.com/Home/Api).

      Cheers,
      Ben

        1. Thanks Ben that is great, next question might be daft but once I’ve pulled that data how do I use it in anther formula on the sheet?

          I’m trying to calculate a sum using that price value * how many coins I hold and it isn’t working.

          Thanks!

          1. Hey Sean,

            I found this formula =IMPORTXML(A1,"//span[@id='quote_price']"), with the URL in cell A1, worked ok to use in other formulas (e.g. sum). I had the cell formatted as a number (or accounting). You can also try wrapping your existing formula with the VAlUE formula, like this:

            =VALUE(existing formula in here)

            Hope that helps! Feel free to share your formula if it’s still causing issues.

            Ben

            1. Sorry Ben you lost me 🙂

              At Q1 I have the URL http://coinmarketcap.com/currencies/antshares

              At O1 I have this =IMPORTXML(Q1,”//span[@id=’quote_price’]”)

              O1 is formatted as Financial but still for some reason has the $ symbol

              At L13 I have a sum of my coins

              At L15 I’d like to have a calculation and have tried this but its not working =sum(L13*O1)

              Not sure what you mean by wrapping the existing formala with the value parameter.

              Can you explain with an example as per my fields above?

              Many thanks!

              Sean

              1. Hey Sean,

                The $ is just formatting that’s been added, it’s still a number.

                If you have your total number of coins in cell L13, using the SUM formula, e.g. SUM(L1:L10) say, then you just need to multiply that by the $ rate from the IMPORT formula in cell O1 to get the value, you don’t need to use the sum here as well.

                Cheers,
                Ben

                1. I’m obviously doing something wrong….

                  L13 value is 120

                  L10 value is 10

                  at L15 I can do =sum(L13*L10) and I get the expected result of 1200

                  At O1 I have the imported value $8

                  If I try to get a total using that value as follows =sum(01*L13)

                  Instead of a number I get #VALUE!

                  I don’t see what the problem is 🙁

  6. Sorry, just seen this

    Error
    Function MULTIPLY parameter 1 expects number values. But ‘$8.04’ is a text and cannot be coerced to a number.

    I’ve formatted O1 as Number – Financial

    Any idea why it is still being treated as text?

    1. Not sure why yours is showing up as text, so you could try the following two tricks wherever you want to use O1, replace with:

      VALUE(O1)

      or

      O1 * 1

      Feel free to share your sheet if still no joy!

        1. Hey Sean,

          Honestly not sure why it’s being imported as text with a “$” at the front and not being recognized as a number. Anyway, here’s a new formula which will just deal with this issue (I’ve updated your sheet with the formula in the yellow box):

          =iferror(value(IMPORTXML(Q3,”//span[@id=’quote_price’]”)),value(substitute(IMPORTXML(Q3,”//span[@id=’quote_price’]”),”$”,””)))

          Hope that helps! Once it’s sorted I’ll remove the link to your sheet from the comments here.

          Cheers,
          Ben

  7. Hey Ben,

    Thanks for the example. I’m trying to get the only number available in this URL: https://blockexplorer.com/api/status?q=getDifficulty

    When I hit “inspect item”, it doesn’t seem to have any tags, headers and it’s not even defined explicitly as an HTML or XML document -it’s just plain text! And I can’t find the right syntax using the IMPORTXML function so that Google Sheets likes it. I just keep getting a #N/A or #VALUE!

    I’ve been trying to put in wildcards and some other things, but I can’t get it to work…

    1. Hey Daniel,

      Assuming you have that URL in cell A1, then you can use this formula to get the data into your Sheet:

      =importdata(A1)

      To extract just the number, try the following formula:

      =regexextract(importdata(A1),"\d+.\d+")

      Hope that helps!

      Ben

  8. Hi Ben,
    That was very helpful.

    I have a question. How can I get multiple data from a web page.
    For example: I want to extract all the image URLs and the corresponding alt tags from a list of web pages.
    Could you please explain?
    Thanks,
    Abarna

  9. Hi Ben,

    I’m having a similar problem to those above regarding Crypto Currency sheets. I have successfully managed to pull current values from coinmarketcap.com and it’s working well. However, I’m now trying to pull data from another site https://www.cryptocompare.com/coins/ndc/overview to be specific, using this formula – =IMPORTXML(J23,”//div[@class=’price-value’]”).
    I’m pretty sure I’ve made a fundamental error, but I only found out this was possible yesterday….lol I’ve been trawling youtube and forums for a solution and stumbled upon this one… The above is the best I’ve managed to come up with. Any help with this would be appreciated.

    Cheers

    Carlton

  10. Hi Ben,

    Thanks for answering all questions, I am having an issue could you please address it.
    Issue: I am trying grab data from google maps url (https://www.google.com/maps/place/Inox+Movies+-+LEPL+Icon/@16.5027383,80.657804,17z/data=!3m1!4b1!4m5!3m4!1s0x3a35fac8af6a8e6f:0x31b258f18dfbe910!8m2!3d16.5027383!4d80.6599927)

    from this url I would like business name it is in below format
    (Inox Movies – LEPL Icon)

    Formula is
    =query(importxml(A1,”//h1[@class=’section-hero-header-title’]”))

    I am not getting any data(showing : ErrorImported content is empty.) so could please verify once and correct us what we are doing wrong.

    Thanks in advance…

Leave a Reply

Your email address will not be published. Required fields are marked *