It’s true: there are many ways to extract data from Google Analytics (Universal Analytics) into other platforms to create ad-hoc analysis, dashboards, data science models, and much more. While we know a great implementation speaks for itself, unlocking additional value for your organization often means extracting your Google Analytics data and using it elsewhere. In this post, I’ll cover some of the most common ways to extract your data and provide insight from my years of working with organizations in all shapes and sizes to hopefully save you some headache and time as you think about what’s best for your needs. Below are a few questions to consider before we hop into the different methods:
Does this data already exist?
I have seen a number of organizations sponsor various projects that have similar data sets from Google Analytics—which is a challenge to manage from a governance, cost, and data quality perspective. If you can use a previously-developed dataset, we’d encourage it—but if not, think about how your dataset can be used by others in the future. In either case, it’s easiest to manage data quality and consistency the less you have.
How quickly and frequently do you need data?
Some projects may need real-time data where others may only need an updated monthly summary three business days at the close of a preceding month. Think through what you will need and write it down as this will help guide your choices.
How detailed is your reporting?
While this may seem like a difficult question to answer, I would recommend first documenting every dimension and metric you’d like to report on. The reason this matters is it’s easier to work with summarized data but not all metrics can be easily summarized and therefore hit level data may be required. Some likely culprits are ratios and time-based metrics, or if you use moderate to complex segments for reporting the data can be far easier to work with at its most granular level.
How perfect does the data need to be?
There are often tradeoffs between the velocity, accuracy, and cost. Google Analytics will sometimes sample data to deliver reports in certain scenarios. This may be perfectly fine for you or the end of the world, so be mindful of your requirements to strike a good balance.
Are there geographic restrictions to consider?
GDPR is a hot topic for many of our clients and if you’re a global organization you will need to be mindful of where your data sits and is being processed. EU countries are not allowed to transfer data outside of the EU; if applicable, what implications could this have for you? Additionally, time zones and currencies are often painful to align in reporting and should be considered if relevant for your project.
Which platforms will be consuming this data?
Ensuring compatibility between your platforms is also a key consideration. More often than not, organizations have multiple cloud environments, visualization, ETL, and modeling platforms. We always encourage clients to make optimal use of their platforms and ensure where Google Analytics data will be stored and transformed will be compatible end to end.
Now that we have some requirements as to how the data will be used, let’s cover some of the ways to extract data. In most cases, you will use the API or data transfer service within BigQuery to load your data. However, it should not be forgotten that you can pull unsampled reports or custom reports from Google Analytics directly and also use the Google Sheets plugin. There’s also third-party connectors for extracting data, but they rely on the services outlined below—the advantage being you don’t have to write code to extract your data. If simpler solutions meet your needs, fantastic! Below are high-level considerations and a comparison between using Big Query DTS and the Google Analytics 360 API.
BigQuery Transfer Service | Google Analytics 360 API | |
---|---|---|
Granularity | Raw sessions and hit level data | Aggregated Data |
Data Availability | Within 24 hours, 1 file for previous day data generally by morning. Streaming data (intra-day) is also an option and is updated continuously and generally incomplete for preceding 4 hours from when pulled | Intraday reporting is available, as is historical |
Backfilling | 13 months or 10 billion hits (whichever is smaller) of historical data to BigQuery | Historical data is limited to up to 14 months |
Data Refreshness | Data is updated up to 8 hours delay | Data is updated up to 4 hours delay |
Sampling | N/A | Both sampled and unsampled data is available (**refer to table below ) |
Additionally, if you’re using the API the following sampling limitations will apply:
Sampled | Unsampled |
---|---|
Available in both standard and premium Google Analytics 360 version | It’s only available in premium Google Analytics 360 version |
It provides 7 parameters | It provides only 4 parameters |
Number of requests per day per project: 50,000 Number of requests per view (profile) per day: 10,000 (cannot be increased) Number of requests per 100 seconds per project: 2,000 Number of requests per 100 seconds per user per project: 100 (can be increased in Google API Console to 1,000) | The maximum number of unique dimension values (i.e. rows) that will be reported is 3 million |
Analytics Standard: 500k sessions Analytics 360: 1M sessions Analytics 360 using resource based quota: 100M sessions | There are no session limits |
The summary tables above speak to some of the limitations, but our experience has taught us to be mindful of the following:
Data Availability and SLAs
There is nothing worse than working on a dashboard, perfecting it, and doing training with your stakeholders only to have it be disrupted by inconsistent data arrival. It is widely known that when working with the BigQuery DTS tables, Google Analytics data shows up when it wants to. In many cases, you’ll have your data show up before noon for the preceding day but not always. The API is the only sure fire way to ensure your data is consistently available at a specific time. So, if you’re looking for a consistent delivery time you’ll want to use the API. We have in the past used both the API and BigQuery DTS in the same project to balance data quality and availability.
Data Sampling Using the API
This goes without saying, but stakeholders generally expect perfect data, matching to the Google Analytics console at all times. Be careful as you work with your data to ensure that if sampling is happening, to what degree it is and it is communicated to stakeholders. In some cases, you’ll have more flexibility but perfect data is never guaranteed when working with the API unless you’re within the limits noted above.
Governance
It is sad to say, but we’ve seen organizations that have different numbers for the same metric between dashboards, due to a difference in methodologies of calculating the metric. I touched on this earlier, but if you can reuse a dataset that is ideal to maintain consistency in reporting. Conversely, adding new fields to an existing dataset isn’t a bad idea and could save a ton of time in the long run. I’ve also seen departments hampered by API quotas when data requests are not optimized and orchestrated—meaning reports that worked well for years begin to fail because someone else started running extracts at the same time, causing complete failure or sampling. This can happen out of the blue if there’s no governance in how data is pulled from Google Analytics, so be mindful of how many folks in your organization are making use of the API and you’re sticking to the limits.
Tips and Tricks (BONUS)
If you’re working with the BigQuery DTS, consider triggering your ETL process based on when data shows up as documented in your log files. Some organizations schedule their ETL processes to run at certain times only to find the data is missing and needs to be re-run later on. This is a foolproof method of ensuring your data shows up.
When using the API, also make use of the “isdatagolden” field to ensure your reporting is complete. Retry extracting data until this field is set to “true” to ensure the highest data quality.
Stakeholders will appreciate knowing when your data was last refreshed. We typically capture this when an ETL process has been completed to embed into a report or dashboard frontend. Ultimately, it will create less questions for you so we highly recommend this little trick to help better manage stakeholder expectations.
Technical debt is the arch nemesis of a well-run ETL process. When something breaks do you have folks available and documentation to support fixing it? Make time to document how things work today—your future self will thank you.