Share on facebook
Share on twitter
Share on linkedin
Share on email

Data Concepts and How They Apply to BigQuery

Data Concepts and How They Apply to BigQuery

Big data. Data lakes. Data warehouses. Unstructured data. Raw data. These buzz words are everywhere today, but what do they mean? How does Google’s BigQuery fit into the picture, and why do I need it? In the following post, we will explore these questions.

What is the difference between structured and unstructured data?

Let’s use a text field within a database for an example. If an application needs to access a text field to read an email address, we might consider structuring that field so that it must contain an @ and a domain name in order to match with an email. In this case, because we have a specific purpose, the data needs to be structured. Now, let’s take a feedback form for example.  A feedback form can have a suggestion field that can be entered in sentence form. While we can put constraints such as length on sentences, we cannot force them to follow a certain structure. This is a good example of unstructured data. Unstructured data is data that does not need to follow an exact format. Other examples could include things like images, videos, or any other freeform inputs. 

Another big word in data is raw. What does that mean exactly?

In terms of data, raw typically means unprocessed. In this context, the opposite of raw would be processed. Let’s walk through an example of processing data using the example of a feedback form. If a user inputs the sentence “I love articles about BigQuery and data lakes!”, the raw form would store the sentence in our database table, exactly as it appears. If we were to “process” that data, we would need to apply logic to the data. For example, imagine we are looking for responses that contained the word BigQuery for the data in that field. Instead of storing the raw sentence, we would store the value True since the sentence did contain the word BigQuery. In this case, we processed the data to become a structured True or False field from being raw and unstructured, to begin with. 

We obtained what we needed from the data, so why store the raw form?

In the example above, we had a clear problem with a clear solution; storing True or False if BigQuery was in the sentence. However, what if our focus changes from looking for BigQuery to looking for the term ‘data lake’? If we only stored the processed data, we would no longer be able to find other insights from the data, The focus on storing raw data is because we want to leave open the opportunity for future analysis we haven’t yet considered.

As the digital world has improved on data collection, we as a community have evolved in the way we think about data. In the past, we collected everything with structure and purpose, but over time we started to see value in collecting and keeping raw data. Collecting raw data allows new solutions to be found for problems we didn’t know we were collecting data for originally. The term big data was coined in the digital community when this mentality for data collection shifted. The simplest way to describe big data is that it means to collect anything and everything, everywhere we can. As this methodology became popular, companies began releasing products around storing large sets of data. Google’s response to the big data movement was BigQuery. BigQuery is specifically designed to provide cloud storage for large datasets allowing the end user to manage their data from all of their platforms in one place.

Big data collection changed the mentality of how we collected data, but it came with complications as well. When it came time to use this data, the problem faced was that the data was in different systems. Marketing data, back-end application data, analytics data, and many other data sources were all stored in their own separate places. This problem created the need for data to be consolidated into one place where it can be both analyzed and reported on. The need for consolidation led to the creation of the terms data lake and data warehouse. These both share a need to bring all of the data into one place, however, a data lake is specifically for raw data where data warehouses are typically for structured data with a specific purpose.

Why do I need a data lake or a data warehouse? Do I need both?

A data lake is raw because it is meant to be explored and analyzed. As data science teams are forming, there is a greater need to collect and store large amounts of raw data to be explored for different purposes. A large portion of data science work requires a significant amount of data to work with. Therefore, getting ahead on collecting large data sets over time will only strengthen the future of any advanced analysis you can do, such as the prediction of revenue or customer lifetime value for example. The larger and more open a data lake, the deeper the insights can be. A data warehouse on the other hand, is a collection of processed and structured tables that are designed for a specific purpose. The tables in these warehouses are designed to be analyzed with a problem in mind. For example, a single data warehouse table might display all of your marketing spending versus total revenue. This serve the purpose of having all of your data in one place, but is clearly processed for a specific purpose. As information is discovered and new insights are found in data lakes, the processed results may be moved to data warehouses where they can be reported on and interacted with. It is important to note that data warehouses and data lakes are concepts and not platforms. It is up to us as end users to apply and adhere to these concepts while implementing them in platforms such as Google BigQuery.

How does BigQuery fit into these concepts?

BigQuery is Google’s serverless cloud storage solution that is specifically built for handling large datasets. Due to its serverless nature, it is naturally scalable and affordable, making it very easy to create, manage, and interact with large datasets. Within a BigQuery project, you can create multiple datasets which are collections of SQL-like tables. You can create datasets for both a data warehouse and a data lake where each contains tables adhering to those data practice concepts. Having a data lake in BigQuery results in an easier integration to other Google Cloud products such as Google’s Cloud Machine Learning Engine, where you can crawl unstructured data to find new business insights. Using BigQuery for your data warehouse allows you to hook your data directly into Google’s dashboarding tool, Data Studio, for stylized reporting. Features like BigQuery’s table pagination by date allows you to quickly access small subsets of your data without having to search through large datasets to get focused results.

If you want to learn about more ways to find value in BigQuery, click here!

Have BigQuery Questions?

Contact the InfoTrust analytics consulting team today for answers.
Share on facebook
Facebook
Share on twitter
Twitter
Share on linkedin
LinkedIn
Share on email
Email

Other Articles You Will Enjoy

InfoTrust named to Inc. 5000 List for Fifth Straight Year

InfoTrust named to Inc. 5000 List for Fifth Straight Year

InfoTrust ranks No. 1,855 on the 2019 Inc. 5000 list with a 3-year revenue growth of 218%. BLUE ASH, OH – Inc. Magazine yesterday…

Are You Getting the Most Out of Your Google Analytics 360 Partner?

Are You Getting the Most Out of Your Google Analytics 360 Partner?

At InfoTrust, we are extremely proud as an organization to be a Google Analytics Certified Partner and Google Analytics 360 Sales Partner. It’s a…

It’s a Great Time To Be a Woman in Technology and Analytics

It’s a Great Time To Be a Woman in Technology and Analytics

I still remember my first job as a systems engineer in a bank. I was the only woman in the entire floor of the…

InfoTrust Earns Great Place to Work Certification for Third Straight Year

InfoTrust Earns Great Place to Work Certification for Third Straight Year

The InfoTrust team is thrilled to announce that our organization has once again been certified as a “Great Place to Work.” This is the…

Inc. Magazine: Why Entrepreneurs are Uniquely Positioned to Transform Our World

Inc. Magazine: Why Entrepreneurs are Uniquely Positioned to Transform Our World

InfoTrust’s CEO and co-founder, Alex Yastrebenetsky had the honor to attend the International Entrepreneurial Summit at the United Nations. This blog article was originally…

See How ITP, ETP and Other Browser Cookie Changes Affect Your Analytics

See How ITP, ETP and Other Browser Cookie Changes Affect Your Analytics

You’ve likely seen the abbreviation for Intelligent Tracking Prevention, ITP, strewn across the headlines of tech and marketing blogs over the past year. The…

Former P&G Digital Transformation Executive Joins InfoTrust as Head of Data Governance

Former P&G Digital Transformation Executive Joins InfoTrust as Head of Data Governance

BLUE ASH, OH — The digital analytics consulting and technology team at InfoTrust is pleased to announce that long-time Procter & Gamble executive Kent…

Black Friday Is Coming — Is Your Digital Analytics Team Ready?

Black Friday Is Coming — Is Your Digital Analytics Team Ready?

Though summer’s not yet over, major retail organizations around the globe are already developing strategic marketing plans for that oh-so-special day of the fiscal…

It’s Time to Move: What the New Version of Google Tag Manager Means for Your Business

It’s Time to Move: What the New Version of Google Tag Manager Means for Your Business

Is your business ready for the changes coming to Google’s data measurement infrastructure? You probably rely on Google Analytics as a marvelous tool to…

Leave Us A Review

Leave a review and let us know how we’re doing. Only actual clients, please.