Data Concepts and How They Apply to BigQuery

Data Concepts and How They Apply to BigQuery
Estimated Reading Time: 7 minutes

Big data. Data lakes. Data warehouses. Unstructured data. Raw data. These buzz words are everywhere today, but what do they mean? How does Google’s BigQuery fit into the picture, and why do I need it? In the following post, we will explore these questions.

What is the difference between structured and unstructured data?

Let’s use a text field within a database for an example. If an application needs to access a text field to read an email address, we might consider structuring that field so that it must contain an @ and a domain name in order to match with an email. In this case, because we have a specific purpose, the data needs to be structured. Now, let’s take a feedback form for example.  A feedback form can have a suggestion field that can be entered in sentence form. While we can put constraints such as length on sentences, we cannot force them to follow a certain structure. This is a good example of unstructured data. Unstructured data is data that does not need to follow an exact format. Other examples could include things like images, videos, or any other freeform inputs. 

Another big word in data is raw. What does that mean exactly?

In terms of data, raw typically means unprocessed. In this context, the opposite of raw would be processed. Let’s walk through an example of processing data using the example of a feedback form. If a user inputs the sentence “I love articles about BigQuery and data lakes!”, the raw form would store the sentence in our database table, exactly as it appears. If we were to “process” that data, we would need to apply logic to the data. For example, imagine we are looking for responses that contained the word BigQuery for the data in that field. Instead of storing the raw sentence, we would store the value True since the sentence did contain the word BigQuery. In this case, we processed the data to become a structured True or False field from being raw and unstructured, to begin with. 

We obtained what we needed from the data, so why store the raw form?

In the example above, we had a clear problem with a clear solution; storing True or False if BigQuery was in the sentence. However, what if our focus changes from looking for BigQuery to looking for the term ‘data lake’? If we only stored the processed data, we would no longer be able to find other insights from the data, The focus on storing raw data is because we want to leave open the opportunity for future analysis we haven’t yet considered.

As the digital world has improved on data collection, we as a community have evolved in the way we think about data. In the past, we collected everything with structure and purpose, but over time we started to see value in collecting and keeping raw data. Collecting raw data allows new solutions to be found for problems we didn’t know we were collecting data for originally. The term big data was coined in the digital community when this mentality for data collection shifted. The simplest way to describe big data is that it means to collect anything and everything, everywhere we can. As this methodology became popular, companies began releasing products around storing large sets of data. Google’s response to the big data movement was BigQuery. BigQuery is specifically designed to provide cloud storage for large datasets allowing the end user to manage their data from all of their platforms in one place.

Big data collection changed the mentality of how we collected data, but it came with complications as well. When it came time to use this data, the problem faced was that the data was in different systems. Marketing data, back-end application data, analytics data, and many other data sources were all stored in their own separate places. This problem created the need for data to be consolidated into one place where it can be both analyzed and reported on. The need for consolidation led to the creation of the terms data lake and data warehouse. These both share a need to bring all of the data into one place, however, a data lake is specifically for raw data where data warehouses are typically for structured data with a specific purpose.

Why do I need a data lake or a data warehouse? Do I need both?

A data lake is raw because it is meant to be explored and analyzed. As data science teams are forming, there is a greater need to collect and store large amounts of raw data to be explored for different purposes. A large portion of data science work requires a significant amount of data to work with. Therefore, getting ahead on collecting large data sets over time will only strengthen the future of any advanced analysis you can do, such as the prediction of revenue or customer lifetime value for example. The larger and more open a data lake, the deeper the insights can be. A data warehouse on the other hand, is a collection of processed and structured tables that are designed for a specific purpose. The tables in these warehouses are designed to be analyzed with a problem in mind. For example, a single data warehouse table might display all of your marketing spending versus total revenue. This serve the purpose of having all of your data in one place, but is clearly processed for a specific purpose. As information is discovered and new insights are found in data lakes, the processed results may be moved to data warehouses where they can be reported on and interacted with. It is important to note that data warehouses and data lakes are concepts and not platforms. It is up to us as end users to apply and adhere to these concepts while implementing them in platforms such as Google BigQuery.

How does BigQuery fit into these concepts?

BigQuery is Google’s serverless cloud storage solution that is specifically built for handling large datasets. Due to its serverless nature, it is naturally scalable and affordable, making it very easy to create, manage, and interact with large datasets. Within a BigQuery project, you can create multiple datasets which are collections of SQL-like tables. You can create datasets for both a data warehouse and a data lake where each contains tables adhering to those data practice concepts. Having a data lake in BigQuery results in an easier integration to other Google Cloud products such as Google’s Cloud Machine Learning Engine, where you can crawl unstructured data to find new business insights. Using BigQuery for your data warehouse allows you to hook your data directly into Google’s dashboarding tool, Data Studio, for stylized reporting. Features like BigQuery’s table pagination by date allows you to quickly access small subsets of your data without having to search through large datasets to get focused results.

If you want to learn about more ways to find value in BigQuery, click here!

Have BigQuery Questions?

Contact the InfoTrust analytics consulting team today for answers.

Author

  • Tyler Blatt

    Tyler Blatt is the Manager of Emerging Solution Strategy at InfoTrust. Using his 10 years of experience in martech, he assists clients in constructing reliable frameworks to navigate the ever-changing technical environment resulting from privacy and technology advancements in the fields of marketing and analytics. In his free time, he can be found wrestling with his new baby girl, streaming on Twitch, or trying to improve his golf game.

Facebook
Twitter
LinkedIn
Email
Originally Published: June 12, 2019

Subscribe To Our Newsletter

July 31, 2020
Originally published on June 12, 2019

Other Articles You Will Enjoy

Is It Time to Upgrade? 4 Signs Your Organization Needs Google Analytics 4 360

Is It Time to Upgrade? 4 Signs Your Organization Needs Google Analytics 4 360

As VP of Partnerships at InfoTrust, I’ve had the opportunity to talk with hundreds of decision-makers about their interest in upgrading to Google Analytics…

4-minute read
Leveraging Custom Dimensions and Metrics in Google Analytics 4 for Content Performance Measurement: Best Practices and Real-World Examples

Leveraging Custom Dimensions and Metrics in Google Analytics 4 for Content Performance Measurement: Best Practices and Real-World Examples

In today’s digital landscape where content reigns supreme, understanding how your audience interacts with your content is paramount for success. For news and media…

5-minute read
Google Tag Best Practices for Google Analytics 4

Google Tag Best Practices for Google Analytics 4

After collaborating with several of my colleagues at InfoTrust including Bryan Lamb, Head of Capabilities, Corey Chapman, Senior Tag Management Engineer, Chinonso Emma-Ebere, Tech…

4-minute read
Tracking User Behavior with Events in Google Analytics 4: Examples and Use Cases

Tracking User Behavior with Events in Google Analytics 4: Examples and Use Cases

So you’ve created your Google Analytics 4 (GA4) properties, created your data stream(s), and followed all the necessary steps to configure your property. Now…

5-minute read
How to Integrate Google Analytics 4 with BigQuery for Enhanced Data Analysis and Reporting

How to Integrate Google Analytics 4 with BigQuery for Enhanced Data Analysis and Reporting

Has your business found that its reporting needs require advanced analysis of your analytics data beyond what is practical in the Google Analytics 4…

4-minute read
How Data Maturity Can Cultivate a Data-Driven Culture

How Data Maturity Can Cultivate a Data-Driven Culture

Data-driven decisions are a buzz topic in Martech. It is essential for C-suite executives to understand and more importantly, use their data to move…

4-minute read
How Does BigQuery Data Import for Google Analytics 4 Differ from Universal Analytics?

How Does BigQuery Data Import for Google Analytics 4 Differ from Universal Analytics?

All Google Analytics 4 (GA4) property owners can now enable ‌data export to BigQuery and start to utilize the raw event data collected on…

2-minute read
Leveraging Attribution Models in Google Analytics 4 to Improve Your Marketing Strategy: Tips and Best Practices

Leveraging Attribution Models in Google Analytics 4 to Improve Your Marketing Strategy: Tips and Best Practices

In the dynamic landscape of digital marketing, understanding the customer journey is crucial for optimizing strategies and maximizing ROI. Google Analytics 4 (GA4) introduces…

5-minute read
App Install Attribution in Google Analytics 4: What You Need to Know

App Install Attribution in Google Analytics 4: What You Need to Know

App install attribution in Google Analytics for Firebase (GA4) is a feature that helps you understand how users discover and install your app. It…

6-minute read

Get Your Assessment

Thank you! We will be in touch with your results soon.
{{ field.placeholder }}
{{ option.name }}

Talk To Us

Talk To Us

Receive Book Updates

Fill out this form to receive email announcements about Crawl, Walk, Run: Advancing Analytics Maturity with Google Marketing Platform. This includes pre-sale dates, official publishing dates, and more.

Search InfoTrust

Leave Us A Review

Leave a review and let us know how we’re doing. Only actual clients, please.