Share on facebook
Share on twitter
Share on linkedin
Share on email

Data Concepts and How They Apply to BigQuery

Data Concepts and How They Apply to BigQuery

Big data. Data lakes. Data warehouses. Unstructured data. Raw data. These buzz words are everywhere today, but what do they mean? How does Google’s BigQuery fit into the picture, and why do I need it? In the following post, we will explore these questions.

What is the difference between structured and unstructured data?

Let’s use a text field within a database for an example. If an application needs to access a text field to read an email address, we might consider structuring that field so that it must contain an @ and a domain name in order to match with an email. In this case, because we have a specific purpose, the data needs to be structured. Now, let’s take a feedback form for example.  A feedback form can have a suggestion field that can be entered in sentence form. While we can put constraints such as length on sentences, we cannot force them to follow a certain structure. This is a good example of unstructured data. Unstructured data is data that does not need to follow an exact format. Other examples could include things like images, videos, or any other freeform inputs. 

Another big word in data is raw. What does that mean exactly?

In terms of data, raw typically means unprocessed. In this context, the opposite of raw would be processed. Let’s walk through an example of processing data using the example of a feedback form. If a user inputs the sentence “I love articles about BigQuery and data lakes!”, the raw form would store the sentence in our database table, exactly as it appears. If we were to “process” that data, we would need to apply logic to the data. For example, imagine we are looking for responses that contained the word BigQuery for the data in that field. Instead of storing the raw sentence, we would store the value True since the sentence did contain the word BigQuery. In this case, we processed the data to become a structured True or False field from being raw and unstructured, to begin with. 

We obtained what we needed from the data, so why store the raw form?

In the example above, we had a clear problem with a clear solution; storing True or False if BigQuery was in the sentence. However, what if our focus changes from looking for BigQuery to looking for the term ‘data lake’? If we only stored the processed data, we would no longer be able to find other insights from the data, The focus on storing raw data is because we want to leave open the opportunity for future analysis we haven’t yet considered.

As the digital world has improved on data collection, we as a community have evolved in the way we think about data. In the past, we collected everything with structure and purpose, but over time we started to see value in collecting and keeping raw data. Collecting raw data allows new solutions to be found for problems we didn’t know we were collecting data for originally. The term big data was coined in the digital community when this mentality for data collection shifted. The simplest way to describe big data is that it means to collect anything and everything, everywhere we can. As this methodology became popular, companies began releasing products around storing large sets of data. Google’s response to the big data movement was BigQuery. BigQuery is specifically designed to provide cloud storage for large datasets allowing the end user to manage their data from all of their platforms in one place.

Big data collection changed the mentality of how we collected data, but it came with complications as well. When it came time to use this data, the problem faced was that the data was in different systems. Marketing data, back-end application data, analytics data, and many other data sources were all stored in their own separate places. This problem created the need for data to be consolidated into one place where it can be both analyzed and reported on. The need for consolidation led to the creation of the terms data lake and data warehouse. These both share a need to bring all of the data into one place, however, a data lake is specifically for raw data where data warehouses are typically for structured data with a specific purpose.

Why do I need a data lake or a data warehouse? Do I need both?

A data lake is raw because it is meant to be explored and analyzed. As data science teams are forming, there is a greater need to collect and store large amounts of raw data to be explored for different purposes. A large portion of data science work requires a significant amount of data to work with. Therefore, getting ahead on collecting large data sets over time will only strengthen the future of any advanced analysis you can do, such as the prediction of revenue or customer lifetime value for example. The larger and more open a data lake, the deeper the insights can be. A data warehouse on the other hand, is a collection of processed and structured tables that are designed for a specific purpose. The tables in these warehouses are designed to be analyzed with a problem in mind. For example, a single data warehouse table might display all of your marketing spending versus total revenue. This serve the purpose of having all of your data in one place, but is clearly processed for a specific purpose. As information is discovered and new insights are found in data lakes, the processed results may be moved to data warehouses where they can be reported on and interacted with. It is important to note that data warehouses and data lakes are concepts and not platforms. It is up to us as end users to apply and adhere to these concepts while implementing them in platforms such as Google BigQuery.

How does BigQuery fit into these concepts?

BigQuery is Google’s serverless cloud storage solution that is specifically built for handling large datasets. Due to its serverless nature, it is naturally scalable and affordable, making it very easy to create, manage, and interact with large datasets. Within a BigQuery project, you can create multiple datasets which are collections of SQL-like tables. You can create datasets for both a data warehouse and a data lake where each contains tables adhering to those data practice concepts. Having a data lake in BigQuery results in an easier integration to other Google Cloud products such as Google’s Cloud Machine Learning Engine, where you can crawl unstructured data to find new business insights. Using BigQuery for your data warehouse allows you to hook your data directly into Google’s dashboarding tool, Data Studio, for stylized reporting. Features like BigQuery’s table pagination by date allows you to quickly access small subsets of your data without having to search through large datasets to get focused results.

If you want to learn about more ways to find value in BigQuery, click here!

Have BigQuery Questions?

Contact the InfoTrust analytics consulting team today for answers.
Share on facebook
Facebook
Share on twitter
Twitter
Share on linkedin
LinkedIn
Share on email
Email

Other Articles You Will Enjoy

InfoTrust Named One of Cincinnati’s Best Places to Work

InfoTrust Named One of Cincinnati’s Best Places to Work

For the seventh time in the analytics consulting and technology company’s history, InfoTrust has been named among Cincinnati’s Best Places to Work by the…

Fortune Names InfoTrust Among 50 Best Small Workplaces

Fortune Names InfoTrust Among 50 Best Small Workplaces

The entire analytics consulting and technology team at InfoTrust is thrilled to announce that we have been named one of the 50 Best Small…

Chicago Basket Brigade to Support Inpatient Families at Lurie Children’s Hospital

Chicago Basket Brigade to Support Inpatient Families at Lurie Children’s Hospital

We’re kicking off Basket Brigade 2019 in Chicago on Oct. 22! Once again, the InfoTrust Foundation is supporting Lurie Children’s Hospital and providing Thanksgiving…

Tips for Building an Effective Analytics Dashboard

Tips for Building an Effective Analytics Dashboard

Building an effective dashboard is tough work. I find most clients struggle when trying to build a dashboard internally due to a lack of…

InfoTrust and Tag Inspector Named Quanties Finalists

InfoTrust and Tag Inspector Named Quanties Finalists

The entire analytics consulting and technology team at InfoTrust is thrilled to be recognized as finalists in three categories at the upcoming DAA OneConference‘s…

Why Having a Data Layer is Critical During Black Friday Weekend

Why Having a Data Layer is Critical During Black Friday Weekend

Black Friday and Cyber Monday annually present marketers with ample eCommerce insights, as high traffic and large purchase volumes offer rich customer data for…

What Is Your Customer Data Doing Right Now?

What Is Your Customer Data Doing Right Now?

Do you or key stakeholders in your organization really know how or if your customer data is being collected, stored, and used? A discussion like…

InfoTrust Named to Inc. 5000 List for Fifth Straight Year

InfoTrust Named to Inc. 5000 List for Fifth Straight Year

InfoTrust ranks No. 1,855 on the 2019 Inc. 5000 list with a 3-year revenue growth of 218%. BLUE ASH, OH – Inc. Magazine yesterday…

Are You Getting the Most Out of Your Google Analytics 360 Partner?

Are You Getting the Most Out of Your Google Analytics 360 Partner?

At InfoTrust, we are extremely proud as an organization to be a Google Analytics Certified Partner and Google Analytics 360 Sales Partner. It’s a…

Leave Us A Review

Leave a review and let us know how we’re doing. Only actual clients, please.