Python

How to Upload Data to Google BigQuery Using Python: In 3 Steps

Pinterest LinkedIn Tumblr

Automate your API data updates in Google’s cloud data warehouse

Google BigQuery is a fast, scalable data storage solution that easily integrates with some of the top data science applications like Power BI and Tableau.

If you’ve ever used BigQuery before, you probably know that it has a lot of features. Like, a LOT of features. This can definitely be intimidating for new users, but if you stick with it you’ll get a lot of great usage out of the platform!

One feature I regularly use BigQuery for is uploading data for sharing and integrating with other applications. Instead of editing and uploading a CSV file over and over, you’ll find that it’s much easier to have one function linked to all your applications! In this article, I’ll show you how you can take data from an API and make it easy to share and access across a wide variety of platforms and applications.

This should work for just about any API, as long as you’re able to output a pandas data frame.

Step 1: Create a Cloud Function

After signing into your account, the first thing you’ll want to do is go to the “Console” section in the upper right. If you don’t already have a project, go ahead and set that up to your preferences. Once you’re in your console, go to the navigation menu in the upper right and scroll down to “Cloud Functions”

Screenshot of Google BigQuery by Author

You will need to enable billing to use this feature, but depending on the amount of data you need this feature won’t be very expensive. Cloud Functions can use a variety of languages to do practically anything, from simple API calls to machine learning so it’s well worth it!

Once you’re in Cloud Functions, hit the “Create Function” button. The name and Region can be whatever you prefer, for the trigger type I usually keep it at HTTP.

Then, for runtime settings, I prefer to set my timeout to the maximum of 540 seconds, the memory allocation and the maximum number of instances are up to you but this is what I usually go with:

Screenshot of Google BigQuery by Author

Then just hit “Save” and “Next” and you can finally add your custom API pull. Before you just copy and paste your code into Google’s IDE, you’ll have to add a few BigQuery specific functions. Luckily for you, I’ve got everything you need in Step 2!

Step 2: Add BigQuery Specific Functions

The structure for these BigQuery Functions can seem a bit complicated, but in simple terms, it looks like this:

  • function 1: validate HTTP response
  • function 2: your custom API pull
  • function 3: load data frame into BigQuery table

So what does that look like in python? Something like this:

import pandas as pd
import json
import requests
from pandas.io import gbq
import pandas_gbq
import gcsfc
”’
function 1: All this function is doing is responding and validating any HTTP request, this is
important if you want to schedule an automatic refresh or test the function locally.
”’
def validate_http(request):
request.json = request.get_json()
if request.args:
get_api_data()
return f’Data pull complete’
elif request_json:
get_api_data()
return f’Data pull complete’
else:
get_api_data()
return f’Data pull complete’
”’
function 2: This is where you put your own code, as long as the output is a
pandas dataframe you can write it out however you want, here’s an example:
”’
def get_api_data():
url = ‘https://www.apidata.com’
r = requests.get(url)
data = r.json()
df = pd.DataFrane.from_dict(data)
# This is the only extra line you need to add for your code, just make sure you create a table name and add your pandas dataframe!
bq_load(‘TABLE NAME’, df)
”’
function 3: This function just converts your pandas dataframe into a bigquery table,
you’ll also need to designate the name and location of the table in the variable
names below.
”’
def bq_load(key, value):
project_name = ‘YOUR PROJECT NAME’
dataset_name = ‘YOUR DATASET NAME’
table_name = key
value.to_gbq(destination_table='{}.{}’.format(dataset_name, table_name), project_id=project_name, if_exists=’replace’)

view rawgbq.py hosted with ❤ by GitHub

If you’ve already got an API pull script written you can just paste it in the get_all_data() function and then copy and paste this whole script into your BigQuery Cloud Function. It’s that simple! Just make sure that your API call runs locally before attempting to run it in the cloud.

The last couple of things to do before deploying your function is to make sure that the “Entry Point” is the first function, validate_http, and make sure you have all the correct dependencies for any libraries you’re using. Here’s what that looks like for this example:

Screenshot of Google BigQuery by Author

Finally, you can deploy your function! After you hit deploy it will run for a few seconds and if you see a green checkmark that means you did it right.

Step 3: Test and Refresh Your Table

There are lots of ways to get your new table to show up with data in it, I think the simplest way is to click on your cloud function, hit “Testing” and then “Test the Function”. This will create a BigQuery table that should contain all the data from your API pull! To see this table just go to the main menu in the upper left and scroll down to “BigQuery”

Screenshot of Google BigQuery by Author

From here you can view the data’s schema, see when it was updated, and preview what the dataset looks like. And just like that, you’ve got data stored in BigQuery! Now you can link that data to Tableau, Google Data Studio, Power BI, or whatever application best suits you.

Having the ability to access this data from anywhere and link it to multiple applications is well worth all the setting up and you can even automate the process by using BigQuery’s “Cloud Scheduler”. This can also be found in the main menu by scrolling down until you see this:

Screenshot of Google BigQuery by Author

From there you can decide how often you want your data to refresh and BigQuery will handle the testing process automatically. That’s just one less step for you to worry about!

Final Thoughts

If you’ve ever thought about using a cloud data storage solution, I would definitely recommend checking out Google BigQuery! It can be a lot at first, but I think if you follow the steps in this article you should be able to get comfortable with some of its features. I hope that this article helped you get a better understanding of BigQuery and how to get your data uploaded to the cloud. Thank you so much for reading and I hope you have a great day!

Original Source

Data Scientist | Austin, TX | email: ben@benchamblee.blog | Linkedin: https://www.linkedin.com/in/benchamblee/ | Github: https://github.com/Bench-amblee