Big data. Data lakes. Data warehouses. Unstructured data. Raw data. These buzz words are everywhere today, but what do they mean? How does Google’s BigQuery fit into the picture, and why do I need it? In the following post, we will explore these questions.
What is the difference between structured and unstructured data?
Let’s use a text field within a database for an example. If an application needs to access a text field to read an email address, we might consider structuring that field so that it must contain an @ and a domain name in order to match with an email. In this case, because we have a specific purpose, the data needs to be structured. Now, let’s take a feedback form for example. A feedback form can have a suggestion field that can be entered in sentence form. While we can put constraints such as length on sentences, we cannot force them to follow a certain structure. This is a good example of unstructured data. Unstructured data is data that does not need to follow an exact format. Other examples could include things like images, videos, or any other freeform inputs.
Another big word in data is raw. What does that mean exactly?
In terms of data, raw typically means unprocessed. In this context, the opposite of raw would be processed. Let’s walk through an example of processing data using the example of a feedback form. If a user inputs the sentence “I love articles about BigQuery and data lakes!”, the raw form would store the sentence in our database table, exactly as it appears. If we were to “process” that data, we would need to apply logic to the data. For example, imagine we are looking for responses that contained the word BigQuery for the data in that field. Instead of storing the raw sentence, we would store the value True since the sentence did contain the word BigQuery. In this case, we processed the data to become a structured True or False field from being raw and unstructured, to begin with.
We obtained what we needed from the data, so why store the raw form?
In the example above, we had a clear problem with a clear solution; storing True or False if BigQuery was in the sentence. However, what if our focus changes from looking for BigQuery to looking for the term ‘data lake’? If we only stored the processed data, we would no longer be able to find other insights from the data, The focus on storing raw data is because we want to leave open the opportunity for future analysis we haven’t yet considered.
As the digital world has improved on data collection, we as a community have evolved in the way we think about data. In the past, we collected everything with structure and purpose, but over time we started to see value in collecting and keeping raw data. Collecting raw data allows new solutions to be found for problems we didn’t know we were collecting data for originally. The term big data was coined in the digital community when this mentality for data collection shifted. The simplest way to describe big data is that it means to collect anything and everything, everywhere we can. As this methodology became popular, companies began releasing products around storing large sets of data. Google’s response to the big data movement was BigQuery. BigQuery is specifically designed to provide cloud storage for large datasets allowing the end user to manage their data from all of their platforms in one place.
Big data collection changed the mentality of how we collected data, but it came with complications as well. When it came time to use this data, the problem faced was that the data was in different systems. Marketing data, back-end application data, analytics data, and many other data sources were all stored in their own separate places. This problem created the need for data to be consolidated into one place where it can be both analyzed and reported on. The need for consolidation led to the creation of the terms data lake and data warehouse. These both share a need to bring all of the data into one place, however, a data lake is specifically for raw data where data warehouses are typically for structured data with a specific purpose.
Why do I need a data lake or a data warehouse? Do I need both?
A data lake is raw because it is meant to be explored and analyzed. As data science teams are forming, there is a greater need to collect and store large amounts of raw data to be explored for different purposes. A large portion of data science work requires a significant amount of data to work with. Therefore, getting ahead on collecting large data sets over time will only strengthen the future of any advanced analysis you can do, such as the prediction of revenue or customer lifetime value for example. The larger and more open a data lake, the deeper the insights can be. A data warehouse on the other hand, is a collection of processed and structured tables that are designed for a specific purpose. The tables in these warehouses are designed to be analyzed with a problem in mind. For example, a single data warehouse table might display all of your marketing spending versus total revenue. This serve the purpose of having all of your data in one place, but is clearly processed for a specific purpose. As information is discovered and new insights are found in data lakes, the processed results may be moved to data warehouses where they can be reported on and interacted with. It is important to note that data warehouses and data lakes are concepts and not platforms. It is up to us as end users to apply and adhere to these concepts while implementing them in platforms such as Google BigQuery.
How does BigQuery fit into these concepts?
BigQuery is Google’s serverless cloud storage solution that is specifically built for handling large datasets. Due to its serverless nature, it is naturally scalable and affordable, making it very easy to create, manage, and interact with large datasets. Within a BigQuery project, you can create multiple datasets which are collections of SQL-like tables. You can create datasets for both a data warehouse and a data lake where each contains tables adhering to those data practice concepts. Having a data lake in BigQuery results in an easier integration to other Google Cloud products such as Google’s Cloud Machine Learning Engine, where you can crawl unstructured data to find new business insights. Using BigQuery for your data warehouse allows you to hook your data directly into Google’s dashboarding tool, Data Studio, for stylized reporting. Features like BigQuery’s table pagination by date allows you to quickly access small subsets of your data without having to search through large datasets to get focused results.
If you want to learn about more ways to find value in BigQuery, click here!