Detailed guide on pushing scraped data to Google Sheets API using Python
When scraping data from the web there is a need to store that data somewhere. One of the best options to store that data in Google Sheets because of numerous reasons. First of all, it is free since Google Sheets are stored inside Google Drive which provides 15 GB of storage for free. Secondly, it has a nice and easy-to-use API. Thirdly, it has a proper request limit, which is 500 requests per 100 seconds per project, and 100 requests per 100 seconds per user. The fourth reason is accessibility, Google Cloud Servers are up 24/7. The fifth reason is. a spreadsheet can be shared with clients easily and they can edit it whenever they want rather than storing data in your private servers where the client doesn’t have access. And the sixth reason is Google Sheets is the only data connection type that have a live connection in Tableau Public.
So lets, get started!
In order to accomplish our goal, we need to create a google cloud project, enable google sheets API, create credentials and download private key to use in our python code to modify data in the spreadsheet.
Step 1. Create Google Cloud project
Go to Google Cloud Console and register if you don’t have an account. Click on Select a project to select or create a project.
Click on New Project
Type your Project Name and click Create
Step 2. Enable Google Drive API and Google Sheets API for newly created project
Click on navigation menu icon, hover on APIs & Services and click on Library
Search for Google Drive API and Google Sheets API and click on them.
Activate both Goodle Drive API and Google Sheets API by clicking Enable button
Step 3. Create Service Account to access Google Sheets from code
Click on Navigation Icon, hover on APIs & Services and click on Credentials
Click on Create Credentials and select Service Account. I selected Service Account because it is for server-to-server communication, just want we want: our automated web spider talking to Google Server.
Give a Name to your service account and click Create to go to the next step.
Select a Role that defines privileges and click Continue. I set Role to Project > Owner so I get all access and modify privileges on my spreadsheet.
No need to fill any more inputs, click Done.
Step 4. Download Private Key
By this step, we have our service account created. Just need to go to Navigation Menu Icon, APIs & Services, and select Credentials. Click on our newly created service account, and create and download a new private key in the following manner. Select JSON when prompted, and put the JSON file to the same directory where your python code will be for the sake of easiness.
Step 5. Create spreadsheet and share with your service account user
If you want to create spreadhsheet from python code skip this step. But since our spider will be pushing data in the same format, I prefer to create spredsheet and put headers manually.
So go to google driver and create a new spreadsheet, after that right-click on spreadsheet file and select share. Copy email of your service account and paste in the input field to share like this:
Step 6. Download required Python libraries
pip install oauth2client pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib pip install gspread
Following sample code snippet illustrates how we can connect to sheet,read value of some column, check for duplicate and add row:
import gspread from oauth2client.service_account import ServiceAccountCredentials #authenticate scope = ['https://www.googleapis.com/auth/drive'] key = ServiceAccountCredentials.from_json_keyfile_name('google service account key.json') client = gspread.authorize(key) #open first workbook of spreadsheet title_of_spreadhsheet = "Businesses" spreadsheet = client.open(title_or_spreadhsheet) #index argument 0 indicates that we want to open 1st worksheet worksheet = spreadsheet.get_worksheet(0) #Lets assume our spreadhsheet has 3 columns: Business ID,Name,Website #retreive business ids which reside in first column to check for duplicates businessIds = worksheet.col_values(1) rows =  rows.append(["1","Apple","apple.com"]) rows.append(["2","Tesla","tesla.com"]) #add single row worksheet.append_row(rows) #add several rows worksheet.append_rows(rows)
Gspread documentation is pretty friendly, make sure to check their site for other functionalities.
This is it! Let me know if you face any challenges or you have any suggestions.