Detailed guide on pushing scraped data to Google Sheets API using Python

When scraping data from the web there is a need to store that data somewhere. One of the best options to store that data in Google Sheets because of numerous reasons. First of all, it is free since Google Sheets are stored inside Google Drive which provides 15 GB of storage for free. Secondly, it has a nice and easy-to-use API. Thirdly, it has a proper request limit, which is 500 requests per 100 seconds per project, and 100 requests per 100 seconds per user. The fourth reason is accessibility, Google Cloud Servers are up 24/7. The fifth reason is. a spreadsheet can be shared with clients easily and they can edit it whenever they want rather than storing data in your private servers where the client doesn’t have access. And the sixth reason is Google Sheets is the only data connection type that have a live connection in Tableau Public.

So lets, get started!

In order to accomplish our goal, we need to create a google cloud project, enable google sheets API, create credentials and download private key to use in our python code to modify data in the spreadsheet.

Step 1. Create Google Cloud project

Go to Google Cloud Console and register if you don’t have an account. Click on Select a project to select or create a project.

Click on New Project

Type your Project Name and click Create

Step 2. Enable Google Drive API and Google Sheets API for newly created project

Click on navigation menu icon, hover on APIs & Services and click on Library

Search for Google Drive API and Google Sheets API and click on them.

Activate both Goodle Drive API and Google Sheets API by clicking Enable button

Step 3. Create Service Account to access Google Sheets from code

Click on Navigation Icon, hover on APIs & Services and click on Credentials

Click on Create Credentials and select Service Account. I selected Service Account because it is for server-to-server communication, just want we want: our automated web spider talking to Google Server.

Give a Name to your service account and click Create to go to the next step.

Select a Role that defines privileges and click Continue. I set Role to Project > Owner so I get all access and modify privileges on my spreadsheet.

No need to fill any more inputs, click Done.

Step 4. Download Private Key

By this step, we have our service account created. Just need to go to Navigation Menu Icon, APIs & Services, and select Credentials. Click on our newly created service account, and create and download a new private key in the following manner. Select JSON when prompted, and put the JSON file to the same directory where your python code will be for the sake of easiness.

Step 5. Create spreadsheet and share with your service account user

If you want to create spreadhsheet from python code skip this step. But since our spider will be pushing data in the same format, I prefer to create spredsheet and put headers manually.

So go to google driver and create a new spreadsheet, after that right-click on spreadsheet file and select share. Copy email of your service account and paste in the input field to share like this:

Step 6. Download required Python libraries

We will use PIP to download the required libraries. We will download Google Authentication Libraries and grpead – Python API for Google Sheets.

pip install oauth2client
pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib
pip install gspread

Done! Now we can use grpead librariy to access and modify our spreadhsheet.

Following sample code snippet illustrates how we can connect to sheet,read value of some column, check for duplicate and add row:

import gspread
from oauth2client.service_account import ServiceAccountCredentials

#authenticate
scope = ['https://www.googleapis.com/auth/drive']
key = ServiceAccountCredentials.from_json_keyfile_name('google service account key.json')
client = gspread.authorize(key)

#open first workbook of spreadsheet
title_of_spreadhsheet = "Businesses"
spreadsheet = client.open(title_or_spreadhsheet)
#index argument 0 indicates that we want to open 1st worksheet 
worksheet = spreadsheet.get_worksheet(0)

#Lets assume our spreadhsheet has 3 columns: Business ID,Name,Website

#retreive business ids which reside in first column to check for duplicates
businessIds = worksheet.col_values(1)

rows = []
rows.append(["1","Apple","apple.com"])
rows.append(["2","Tesla","tesla.com"])

#add single row
worksheet.append_row(rows[0])

#add several rows
worksheet.append_rows(rows)

Gspread documentation is pretty friendly, make sure to check their site for other functionalities.

This is it! Let me know if you face any challenges or you have any suggestions.