Using Google Sheets as a basic web scraper

Want to build a web scraper in Google Sheets? Turns out, basic web scraping, automatically grabbing data from websites, is possible right in your Google Sheet, without needing to write any code.

You can extract specific information from a website and show it in your Google Sheet using some of Sheets’ special formulas.

For example, recently I needed to find out the authors for a long list of blog posts from a Google Analytics report, to identify the star authors pulling in the page views. It would have been extremely tedious to open each link and manually enter each author’s name. Thankfully, there are some techniques available in Google Sheets to do this for us.

Web Scraper Basic Example

Grab the solution file for this tutorial:
Click here to get your own copy >>

For the purposes of this post, I’m going to demonstrate the technique using posts from the New York Times.

Step 1:

Let’s take a random New York Times article and copy the URL into our spreadsheet, in cell A1:

Example New York Times URL
Example New York Times URL

Step 2:

Navigate to the website, in this example the New York Times:

New York Times screenshot
New York Times screenshot

Note – I know what you’re thinking, wasn’t this supposed to be automated?!? Yes, and it is. But first we need to see how the New York Times labels the author on the webpage, so we can then create a formula to use going forward.

Step 3:

Hover over the author’s byline and right-click to bring up the menu and click "Inspect Element" as shown in the following screenshot:

New York Times inspect element selection
New York Times inspect element selection

This brings up the developer inspection window where we can inspect the HTML element for the byline:

New York Times element in developer console
New York Times element in developer console

Step 4:

In the new developer console window, there is one line of HTML code that we’re interested in, and it’s the highlighted one:

<span class="byline-author" data-byline-name="JENNIFER MEDINA" itemprop="name" data-twitter-handle="jennymedina">JENNIFER MEDINA</span>

We’re going to use the IMPORTXML function in Google Sheets, with a second argument (called “xpath-query”) that accesses the specific HTML element above.

The xpath-query, //span[@class='byline-author'], looks for span elements with a class name “byline-author”, and then returns the value of that element, which is the name of our author.

Copy this formula into the cell B1, next to our URL:

=IMPORTXML(A1,"//span[@class='byline-author']")

The final output for the New York Times example is as follows:

Basic web scraper example using importXML in Google Sheets
Basic web scraping example using importXML in Google Sheets

Web Scraper example with multi-author articles

Consider the following article:

http://www.nytimes.com/2015/09/25/us/pope-francis-congress-speech.html

In this case there are two authors in the byline. The formula in step 4 above still works and will return both the names in separate cells, one under the other:

Two author web scrape using importXMLTwo author web scrape using importXML
Two author web scrape using importXML

This is fine for a single-use case but if your data is structured in rows (i.e. a long list of URLs in column A), then you’ll want to adjust the formula to show both the author names on the same row.

To do this, I use an Index formula to limit the request to the first author, so the result exists only on that row. The new formula is:

=INDEX(IMPORTXML(A1,"//span[@class='byline-author']"),1)

Notice the second argument is 1, which limits to the first name.

Then in the adjancent cell, C1, I add another formula to collect the second author byline:

=INDEX(IMPORTXML(A1,"//span[@class='byline-author']"),2)

This works by using 2 to return the author’s name in the second position of the array returned by the IMPORTXML function.

The result is:

Two author web scrape on same row
Two author web scrape on same row

Other media web scraper examples

Other websites use different HTML structures, so the formula has to be slightly modified to find the information by referencing the relevant, specific HTML tag. Again, the best way to do this for a new site is to follow the steps above.

Here are a couple of further examples:

For Business Insider, the author byline is accessed with:

=IMPORTXML(A1,"//li[@class='single-author']")

For the Washington Post:

=INDEX(IMPORTXML(A1,"//span[@itemprop='name']"),1)

Using IMPORTHTML function to scrape tables on websites

Grab the solution file for this tutorial:
Click here to get your own copy >>

Consider the following Wikipedia page, showing a table of the world’s tallest buildings:

https://en.wikipedia.org/wiki/List_of_tallest_buildings

Although we can simply copy and paste, this can be tedious for large tables and it’s not automatic. By using the IMPORTHTML formula, we can get Google Sheets to do the heavy lifting for us:

=importhtml(A1,"table",2)

which gives us the output:

Google Sheets import of Wikipedia table
Google Sheets import of Wikipedia table

Finding the table number (in this example, 2) involves a bit of trial and error, testing out values starting from 1, until you get your desired output.

Note, this formula also works for lists on webpages, in which case you change the “table” reference in the formula to “list”.

Further reading

For more advanced examples, check out:

How to import popular social media network statistics into Google Sheets

Other IMPORT formulas:

If you’re interested in expanding this technique then you’ll want to check out these other Google Sheet formulas:

IMPORTDATA – imports data at a given url in .csv or .tsv format

IMPORTFEED – imports an RSS or ATOM feed

IMPORTRANGE – imports a range of cells from a specified spreadsheet.

78 thoughts on “Using Google Sheets as a basic web scraper”

    1. Hey Cindi!

      Unfortunately these IMPORT formulas cannot work with data behind a password wall. What email service provider are you using? It’s relatively easy to connect to the MailChimp API (haven’t tried any others) and extract all sorts of email subscriber data that way. I’ll be publishing a post on this soon.

      Cheers,
      Ben

      1. Hi Ben – this doesn’t seem to be working form Trying to scrap using your formula but I’m getting ‘imported content is empty’.

        Has Business Insider changed their code?

        Thanks,

    1. Hey Dave,

      Unfortunately I’ve not been able to grab the client buys data either. I think the problem is that the page content is dynamically generated (it’s a Java Server Page or .jsp which means the content is programmatically created) so the import formulas can’t parse this data.

      I’ll let you know if I get anything working!

      Cheers,
      Ben

  1. Hello Guys,

    I don’t know is this specific for the region (europe) or google has changed something (post is 2-year old), but i had to use semicolon ” ; ” instead of colon ” , ” in the formula.

    1. Hi Piotrek,

      Yes, you’re right! This is a regional change that’s standard for syntax for continental Europe.

      Cheers,
      Ben

  2. Hi there,
    I’m trying to use this method for Street Easy listings. Mine will work and then I’ll come back to the sheet later and get the “Error
    Imported content is empty.” and ideas why this might be happening?

    Thanks

  3. As someone who uses programming for web scraping, this article was very helpful. I often start with complex frameworks like Scrapy for Python, not realizing there are simpler solutions out there. Cheers, Simon

  4. Hello Ben;
    First of all, I would like to admit that those infos are awesome and amazing, thank you so much! and I would like to ask you 2 more questions;

    1-Is there a way to grab the data from website which includes multiple pages?
    2-(Actually, this question depends on your response to the 1st question) So, if there is a way for pagination then is there an automated way to grab the data from the pages which require age verification (by entering the birth date) among those multiple pages?

    Sorry for my bad English I hope that it is good enough to make you understand my questions 🙂

    Thank you in advance

    Regards

    1. Hi Ozan,

      So there’s no way for standard formulas to deal with pagination and scrape data from multiple pages. Potentially you could modify the URL each time, depending on how it was set up, so that you could change the pagination number each time, e.g.

      In one cell put https://techcrunch.com/page/2/
      then in next cell: https://techcrunch.com/page/3/
      In next cell: https://techcrunch.com/page/4/
      etc..

      and then run the IMPORT formula for each of those.

      There’s no way you can grab data that requires age verification first, without using Apps Script to programmatically do that. The IMPORT formulas are pretty basic (but powerful!).

      Hope that helps!

      Ben

  5. Hi Ben,

    I’m trying to scrape a value for a crypto coin from this page

    https://bittrex.com/Market/Index?MarketName=BTC-ANS

    The value I want is in the html as this

    0.00203969

    I’ve tried a few different syntax, for example this

    =importxml(A1,”//span[@span data-bind=’text: summary.displayLast()’]”)

    But cannot find the correct syntax.

    Could you help out please.

    Thanks

    1. Hey Sean,

      Unfortunately I don’t think it’s going to be possible with these IMPORT formulas, as the Bittrex site is dynamically generated client-side javascript. I’ve tried all the tricks I’m aware of, but can only get the static parts like headings, not any of the data. Read more about this problem here: https://www.benlcollins.com/spreadsheets/import-social-media-statistics/#notWorking

      Your next step would be to look at writing some Apps Script to extract this data via the API they have (docs here: https://bittrex.com/Home/Api).

      Cheers,
      Ben

        1. Thanks Ben that is great, next question might be daft but once I’ve pulled that data how do I use it in anther formula on the sheet?

          I’m trying to calculate a sum using that price value * how many coins I hold and it isn’t working.

          Thanks!

          1. Hey Sean,

            I found this formula =IMPORTXML(A1,"//span[@id='quote_price']"), with the URL in cell A1, worked ok to use in other formulas (e.g. sum). I had the cell formatted as a number (or accounting). You can also try wrapping your existing formula with the VAlUE formula, like this:

            =VALUE(existing formula in here)

            Hope that helps! Feel free to share your formula if it’s still causing issues.

            Ben

            1. Sorry Ben you lost me 🙂

              At Q1 I have the URL http://coinmarketcap.com/currencies/antshares

              At O1 I have this =IMPORTXML(Q1,”//span[@id=’quote_price’]”)

              O1 is formatted as Financial but still for some reason has the $ symbol

              At L13 I have a sum of my coins

              At L15 I’d like to have a calculation and have tried this but its not working =sum(L13*O1)

              Not sure what you mean by wrapping the existing formala with the value parameter.

              Can you explain with an example as per my fields above?

              Many thanks!

              Sean

              1. Hey Sean,

                The $ is just formatting that’s been added, it’s still a number.

                If you have your total number of coins in cell L13, using the SUM formula, e.g. SUM(L1:L10) say, then you just need to multiply that by the $ rate from the IMPORT formula in cell O1 to get the value, you don’t need to use the sum here as well.

                Cheers,
                Ben

                1. I’m obviously doing something wrong….

                  L13 value is 120

                  L10 value is 10

                  at L15 I can do =sum(L13*L10) and I get the expected result of 1200

                  At O1 I have the imported value $8

                  If I try to get a total using that value as follows =sum(01*L13)

                  Instead of a number I get #VALUE!

                  I don’t see what the problem is 🙁

  6. Sorry, just seen this

    Error
    Function MULTIPLY parameter 1 expects number values. But ‘$8.04’ is a text and cannot be coerced to a number.

    I’ve formatted O1 as Number – Financial

    Any idea why it is still being treated as text?

    1. Not sure why yours is showing up as text, so you could try the following two tricks wherever you want to use O1, replace with:

      VALUE(O1)

      or

      O1 * 1

      Feel free to share your sheet if still no joy!

        1. Hey Sean,

          Honestly not sure why it’s being imported as text with a “$” at the front and not being recognized as a number. Anyway, here’s a new formula which will just deal with this issue (I’ve updated your sheet with the formula in the yellow box):

          =iferror(value(IMPORTXML(Q3,”//span[@id=’quote_price’]”)),value(substitute(IMPORTXML(Q3,”//span[@id=’quote_price’]”),”$”,””)))

          Hope that helps! Once it’s sorted I’ll remove the link to your sheet from the comments here.

          Cheers,
          Ben

  7. Hey Ben,

    Thanks for the example. I’m trying to get the only number available in this URL: https://blockexplorer.com/api/status?q=getDifficulty

    When I hit “inspect item”, it doesn’t seem to have any tags, headers and it’s not even defined explicitly as an HTML or XML document -it’s just plain text! And I can’t find the right syntax using the IMPORTXML function so that Google Sheets likes it. I just keep getting a #N/A or #VALUE!

    I’ve been trying to put in wildcards and some other things, but I can’t get it to work…

    1. Hey Daniel,

      Assuming you have that URL in cell A1, then you can use this formula to get the data into your Sheet:

      =importdata(A1)

      To extract just the number, try the following formula:

      =regexextract(importdata(A1),"\d+.\d+")

      Hope that helps!

      Ben

  8. Hi Ben,
    That was very helpful.

    I have a question. How can I get multiple data from a web page.
    For example: I want to extract all the image URLs and the corresponding alt tags from a list of web pages.
    Could you please explain?
    Thanks,
    Abarna

  9. Hi Ben,

    I’m having a similar problem to those above regarding Crypto Currency sheets. I have successfully managed to pull current values from coinmarketcap.com and it’s working well. However, I’m now trying to pull data from another site https://www.cryptocompare.com/coins/ndc/overview to be specific, using this formula – =IMPORTXML(J23,”//div[@class=’price-value’]”).
    I’m pretty sure I’ve made a fundamental error, but I only found out this was possible yesterday….lol I’ve been trawling youtube and forums for a solution and stumbled upon this one… The above is the best I’ve managed to come up with. Any help with this would be appreciated.

    Cheers

    Carlton

  10. Hi Ben,

    Thanks for answering all questions, I am having an issue could you please address it.
    Issue: I am trying grab data from google maps url (https://www.google.com/maps/place/Inox+Movies+-+LEPL+Icon/@16.5027383,80.657804,17z/data=!3m1!4b1!4m5!3m4!1s0x3a35fac8af6a8e6f:0x31b258f18dfbe910!8m2!3d16.5027383!4d80.6599927)

    from this url I would like business name it is in below format
    (Inox Movies – LEPL Icon)

    Formula is
    =query(importxml(A1,"//h1[@class='section-hero-header-title']"))

    I am not getting any data(showing : ErrorImported content is empty.) so could please verify once and correct us what we are doing wrong.

    Thanks in advance…

    1. Hey Sivaji,

      I’ve tried to extract this data too but unfortunately haven’t been able to either. Not sure it’ll be possible with these IMPORT formulas 🤔

      Ben

  11. Hi Ben,
    Thanks for the article, it looks great and it looks you’re answering all beginners’ questions!
    I am trying to do this:
    1. the following web page (http://www.gcatholic.org/events/year/2017.htm#end) lists all nominations and resignations from bishops in the Catholic Church.
    2. I would like to regularly import into a google spreadsheet first name, last name, title, and country of the lastly nominated bishops (and only the nominated, not the info concerning those who resigned or so).
    I was trying to do it by myself after reading your article, but it’s more complicated than I thought, and I’m pretty much a beginner.
    Thanks for any help,
    Cédric

    1. Hi Cédric,

      You can use this formula to quickly get all the data from that page into your Google Sheet:

      =importhtml("http://www.gcatholic.org/events/year/2017.htm","table",4)

      This should save you some time. The quickest way to then extract details from this data is probably manually, because the data in unstructured, i.e. textual, so it changes from row to row.

      Hope that helps!

      Ben

      1. Thank you Ben.
        Your formula kind of works, but I guess I was too ambitious in wanting to import from an unstructured page.
        Cédric

  12. Hello Ben,
    Your Article is really helpful. Thankyou for using query function here. Unfortunately your suggestion can’t solve my problem. I want to explain you with this image below:
    https://cdn.pbrd.co/images/GNYpBTL.jpg

    I am scraping data from collins thesaurus for different word collection of my own. When i scrape data from collins i get more than 15 synonyms and their respective examples in new column with each synonym in new row. The above image has synonyms and respective example of synonym in 3rd, 4th column with covering all the rows of 3rd and 4th column of sheet for single word Constitution. As it enters many rows i can’t use the Importxml function on other rows for different words in image. With your example it becomes cumbersome to add offset to 15 different synonyms. So what i want is to scrape data in the way that all synonyms and their respective examples in a single row itself. Is there anyway to do it?

    1. Hey Bharath,

      Try wrapping your function with the transpose function, so it’ll look something like this:

      =TRANSPOSE(IMPORTXML(...))

      That should transform the data into two rows for you. It a lot more complex to get that data into a single row in the correct order, but it’s possible. Have a look at this example sheet: https://docs.google.com/a/benlcollins.com/spreadsheets/d/1e-mjf_HNWyYvkhFYWqRb2lTanAx77l-YXZO1YYPs71w/edit?usp=sharing

      Hope that helps!

      Ben

  13. Hi, great post and resources B.

    Would you happen to know the importxml Google sheets function to identify if site is using VideoObject schema like this?

    {
    “@context”: “http://schema.org”,
    “@type”: “VideoObject”,
    “name”: “Title”,
    “description”: “Video description”,
    “thumbnailUrl”: [
    “https://example.com/photos/1×1/photo.jpg”,
    “https://example.com/photos/4×3/photo.jpg”,
    “https://example.com/photos/16×9/photo.jpg”
    ],
    “uploadDate”: “2015-02-05T08:00:00+08:00”,
    “duration”: “PT1M33S”,
    “contentUrl”: “http://www.example.com/video123.flv”,
    “embedUrl”: “http://www.example.com/videoplayer.swf?video=123”,
    “interactionCount”: “2347”
    }

  14. Ben, it seems like you are the guy to turn to for extracting data from web pages. So here goes:
    1. I have a list of business entities in a Google Sheet and the name of the business entity is a hyperlink and I want to extract the hyperlink to a cell in the same row.
    2. The link that was just discussed above then takes me to a page with information for that business entity – see this example – http://www.americanwineryguide.com/wineries/bridge-press-cellars/
    3. I am looking to the pull details like – address, email, phone, founded, cases, etc. into cells in the same row.

    What do you think?

    1. Hey Doug,

      Yes, this should be possible, based on the url you shared. The IMPORT formula to extract the data for that example is as follows:

      =importxml(A1,"//div[@id='winery_detail_box1a']")

      The //div[@id='winery_detail_box1a'] might vary for different websites, so you have to use the Inspect Element function to find the identity of the element (see post above).

      Cheers,
      Ben

  15. Hello Ben!

    Thank you for a very helpful guide! I have been testing this out on several pages and it works perfectly.

    However, I fail continously on this site (in swedish)
    http://www.morningstar.se/Funds/Quicktake/Overview.aspx?perfid=0P00009NT9&programid=0000000000

    I’m trying to extract e.g. the table with “Årlig avkastning %” as headline . Actually I can’t extract anything from this page without using IMPORTDATA.

    Additionally, chrome addons e.g. Scraper can extract with the XPath
    “//*@id=”ctl00_ctl01_cphContent_cphMain_quicktake1_col1_ctl00_ctl04″]/table”

    But I fail in my Google Sheet so I suspect it has something to do with Java?

    Do you have any proposals?

    1. Hey Emil,

      If you use the IMPORTDATA function and scroll way down the data (line 1075 for me), you’ll see the data table for Årlig avkastning %. However, it’s obviously not very useful in this format. You may be able to extract with REGEX formulas but it would be pretty difficult to do…

      You’re right that the IMPORTXML function does not seem to be able to return anything. Most likely because the page is being (partly) generated by javascript.

      Cheers,
      Ben

    1. Hey Ritu – Unfortunately I don’t think you can with these formulas. The values are being generated by javascript, so the actual td and span‘s are just empty placeholders for the javascript to insert values into. If you try doing =importdata("http://www.bitcoinrates.in/") you’ll see that the tables are blank.

    1. Hey Robi,

      Finally catching up on comments.

      You might be better off grabbing the whole table by using the IMPORTHTML formula like this:

      =importhtml(A1,"table",2)

      That way you get all the columns.

      If you wanted to then extract a specific entry, you can use the QUERY function to pull that out:

      =query(importhtml(A1,"table",2),"select * where Col2='#52'")

      where in this case I have specified that I want #52.

      Hope that helps!

      Ben

  16. Hi Ben,
    Is it possible to do the same from google trends and whether a line chart in the website can be pulled as a table using importxml()

    thanks,
    Sid

  17. Hi Ben,
    Many thanks for your article.
    I’m trying to scrape data from a website with an attributes list like this:

    Design:
    Winter boots
    Recommended use:
    Everyday; Leisure; Winter hiking

    All the elements have the same class. How can I solve this?

  18. Hi Ben,

    Wow, this is a super interesting post…. I’m no Google Sheets wizard either, but even a beginner like me can see how powerful something like this could become.

    If I want to check URL/link status (similar to this post…https://medium.com/@the.benhawy/how-to-use-google-spreadsheets-to-check-for-broken-links-1bb0b35c8525) for hundreds of URLs at a time, wouldn’t the website ban my IP?

    I’m just wondering if I use these “fetchURL” or “IMPORTXML” codes/formulas on my Google Sheet, won’t I run into trouble with the site I am scraping? Does Google Sheets use my laptop’s IP address to conduct these URL fetch commands or am I safe to run hundreds of these checks at a time?

    I have hundreds of links pointing to the same site on my sheet at the moment so if I create the formula and drag it down my sheet I’m scared I may get my IP banned!

    Thanks for your time and attention with my NOOB question and again, great work on this post.

  19. hi, thanks for sharing such a useful article. but can import formulas can be used in a use case in which user is filling the google form, regarding his/her personal details and after filling the details will be paying for his stuff through paytm, and parallely all the details have been stored in google sheet, but once the payment is done on paytm, then in google sheet that column is updated automatically with a status of “payment received”, who have done the payment.
    is this possible with web scrapping. please let me know
    i hope i am able to explain the use case.

    1. Hi Naeem,

      Try this formula to grab all the Amazon ranking data:

      =index(importhtml($A$2,"list",10),9,1)

      This uses ImportHtml to extract the list from the Amazon page that contains the ranking data, and then uses an Index wrapper to extract the 9th row of the array, which has the ranking data in.

      To get an individual ranking number, extend it to this:

      =value(regexextract(index(split(index(importhtml($A$2,"list",10),9,1),"#"),1,2),"[0-9,]+"))

      This splits the result by the “#” to get the different rankings, Index to get the relevant part of the array and then Regexextract to grab the numbers and Value to convert them from strings to values.

      Hope that helps!

      Ben

  20. Thanks Ben, really good article and I made use of it but I’m also frustrated with the “Loading” error. It’s not even consistent. Do you have any idea how to circumvent it, maybe?

    1. Thanks Onder. Yes, these formulas can be temperamental, not sure there’s anything you can do about it though. Once you’ve used them to gather your data, I’d suggest converting them to static values, so you don’t lose the data if the formula stops working.

  21. Hi Ben! Thank you for this insightful article. I was wondering if this web scraper formula will work with websites like SimilarWeb, where I want to extract the integer value for the amount of traffic that a specific website is receiving as shown on the SimilarWeb results? Thank you!

  22. Hello Ben

    Wondering if this method or similar works to extract specific data off pdf documents. I need a similar system to enter data from my google sheet into a website’s data, click through to a specific link on two consecutive pages, and then scrub data from a pdf.

    The website is http://nycprop.nyc.gov/nycproperty/nynav/jsp/selectbbl.jsp

    For example
    Borough: 1
    Block: 40
    Lot: 3

    Page 1 – View Quarterly Prop Tax Bill (QTPB)
    Next Page – Click to view QPTB
    Scrub Mailing address from pdf

  23. Hi Ben

    This is amazing! has solved so many problems for me!
    Does the data automatically refresh as it updates on the page it is pulling from or is there additional work to do to make it update regularly?

    Cheers

Leave a Reply

Your email address will not be published. Required fields are marked *