Clean Your Data Like a Sleuth Detective and Keep It That Way

Clean Your Data Like a Sleuth Detective
Estimated Reading Time: 7 minutes

Audit, clean, rinse, repeat.

Sometimes it feels like all we do is look at data and optimize scrub-scrub in an obsessive-compulsive fashion. “Just one more pump of soap. Then I can do my reporting!”

The fact of the matter is, no matter how savvy you set up your Google Analytics configurations, the internet, data, and bot capabilities are always changing. Realistically? You’re going to have to get creative, and it may be an ongoing battle. Dive deep into the recesses of your reactive-brain (rather than your proactive-brain). It’s time to get a little messy.

So you came across some messy data. (uh oh)

Despite your best efforts—you’ve done the AWS view filters, excluded hostnames with ‘porn’ spelled in several different characters, you’ve accounted for errant security scans—bot traffic exists. If any vendor tells you they have foolproof ways of never gathering any garbage data, they’re living a fantasy world

Nevertheless, if and when we find undesirable data, there are methods of deleting it for good, and deleting it within sections of your reporting. Here we’re going to look at a few reactive methods of identifying (and hopefully getting rid of, or brushing aside) garbage data.

How can you tell? 

You found some iffy data. Let’s troubleshoot.

Assuming you haven’t made any dev, content, tagging, or campaign changes within the newly-messy time period, you may experience unexpected changes such as:

  1. Average session duration may have gone from 4 minutes to 0:23 seconds
  2. Bounce rate may have flown through the ceiling to nearly 100% (40–60% is typical)
  3. Traffic spikes in a sudden and unexpected, persistent rhythm

Method #1: Bot Traffic Scoring

I didn’t come up with this. But it really is brilliant. The latest version of reCAPTCHA (v3) allows bot detection and scoring attributions at a server level. You’re familiar with those (well, largely annoying) reCAPTCHA boxes that interrupt your frantically logging into that thing you were late for today? Yea, these guys:

The latest version of reCAPTCHA allows you to apply the same type of annoying verification— but without the annoyance. Your visitor never experiences anything inside of the UI. reCAPTCHA now has the ability to continuously monitor and score your hits numerically. This 1) gives you a better idea if your chunk of data is legit and 2) allows you to gather up all of the swill numbers, segment swiftly, and give their ashes to Poseidon. 

Don’t let Simo Ahava’s super-genius, mammoth run-down intimidate you. You can do it! There are two different server methods, one involving a PHP file and the other .js. I can help you with either if you have questions.

Method #2: PII Detection (No One Likes to Talk About PII)

Lucas Long does a great job telling us about the importance of cleaning up PII with configuration methods and products alike. But I’d like to share an understated method of getting rid of PII data. (Ok, fair, you will have to do some proactive work to identify it, but once you do, you’ll be thrilled to see it working for you—in the unfortunate event that you need it.)

Step 1: Collect Client ID as a Custom Dimension

Please, please, please do not neglect this. Client ID is the random assignment of numbers that Google Analytics assigns to your browser’s local storage. It’s already stored in Google Analytics, but you cannot get to it out-of-the-box! (Not unless you store it as a custom dimension.) 

So here’s an example: you collected an Event Category Label. 

Step 2: Find the Client ID associated with PII

If you are capturing Client ID as a Custom Dimension, you can add this as a secondary dimension to your event report.

Step 3: Get rid of it

Copy those Client IDs associated with this PII, take it to the crematorium (your User Explorer report tab), and smolder those bad boys.

To do this, simply paste the Client IDs associated with PII that you discovered earlier, click into the ID in the User Explorer report (above), then scroll to the bottom left and BURN IT WITH FIRE (or just click delete).

Method #3: Narrow Your Data from the Top Down

If you ask me (and you shouldn’t, cause I’d love to take credit), I invented this one. 

Let’s pretend you didn’t set up your Client ID as a Custom Dimension and you aren’t using reCAPTCHA v3. Here’s the rest of the scenario:

You woke up Monday morning and received a Google Analytics Alert email telling you that sessions went from 230 daily average to 3,000 daily average last Wednesday. There have been no major campaigns, no recent dev releases, and you haven’t migrated content.

Here are some ways to go about identifying and isolating what is (likely) bad data, and then officially filter it out of your reports.

Typically, bot or spam traffic carries with it a smattering of other dimensions which populate your bloated session count. Meaning, if you have a bot infestation, it will usually generate a value for most of the following dimensions:

  • Browser
  • Browser Version
  • City ID
  • Country
  • Landing Page
  • Language (by the way, ‘c’ language is not ‘Canadian’ it means ‘computer,’ aka, BOT!)
  • Operating System
  • Device
  • Operating System Version
  • Screen Resolution
  • Flash Version
  • … and more!

The trick is to find the exact combination of data which accounts for 100% of your suspected bot traffic. The more specific you can make it, the better. 

For instance, even if you knew that bot traffic was coming from Safari version 11.1.2, you wouldn’t want to apply a filter which got rid of all Safari Version 11.1.2. Yes, you may be getting rid of all of the bot traffic, but at the same time you may be getting rid of 10% of your legitimate traffic (Ok, maybe <1% … BURN NOTICE, Safari! That’s for ITP. But I digress …). 

To test out your combo of values, and to see if your theories hold water, dig into your reports, grab a pencil and paper, and write down dimension values which correspond to nonsense or inflated data. This is where your detective-work comes into play! Once you have your theory, head over to the top of your Reporting interface and click Create New Advanced Segment:

Click “+ New Segment”

Next, click “Conditions” in the Advanced section of the left-hand menu:

At this point, grab some of the dimension and metric values you’ve written down along the way. Your list may look something like this:

Be sure to use the “and” functions, between each metric and dimension, rather than “or”. Otherwise your logic won’t work. Start applying this to your Advanced Segment. You WON’T be permanently changing any data, so feel free to take some risks. It won’t bite. 

Tip: I like to start building this segment as “include” only, and not exclude. Why? Because if you come up with five or six perfectly aligned dimension values that account for 100% of the bot traffic, you then can look through reports with ONLY bot traffic and see if there are MORE specific dimension values you can apply. The more specific, the better, more focused your exclusion will be for reporting—and, the more likely you won’t be getting rid of legitimate traffic. 

What about you? Do you have any other tried and true, possibly hidden, or little-known methods of kickin’ those bots to the curb? Hit me up!

Ready to say bye to bots?

The InfoTrust team of experts is ready to help clean your data.

Author

Facebook
Twitter
LinkedIn
Email
Originally Published: September 8, 2021

Subscribe To Our Newsletter

December 15, 2023
Originally published on September 8, 2021

Other Articles You Will Enjoy

Best Practices for Leveraging Custom Insights in Google Analytics 4

Best Practices for Leveraging Custom Insights in Google Analytics 4

What Are Custom Insights? Google Analytics 4 (GA4) offers Custom Insights to detect changes in your data that are important to your team specifically….

3-minute read
App Install Attribution in Google Analytics 4: What You Need to Know

App Install Attribution in Google Analytics 4: What You Need to Know

App install attribution in Google Analytics for Firebase (GA4) is a feature that helps you understand how users discover and install your app. It…

6-minute read
How Does BigQuery Data Import for Google Analytics 4 Differ from Universal Analytics?

How Does BigQuery Data Import for Google Analytics 4 Differ from Universal Analytics?

All Google Analytics 4 (GA4) property owners can now enable ‌data export to BigQuery and start to utilize the raw event data collected on…

2-minute read
How to Integrate Google Analytics 4 with BigQuery for Enhanced Data Analysis and Reporting

How to Integrate Google Analytics 4 with BigQuery for Enhanced Data Analysis and Reporting

Has your business found that its reporting needs require advanced analysis of your analytics data beyond what is practical in the Google Analytics 4…

4-minute read
Google Tag Best Practices for Google Analytics 4

Google Tag Best Practices for Google Analytics 4

After collaborating with several of my colleagues at InfoTrust including Bryan Lamb, Head of Capabilities, Corey Chapman, Senior Tag Management Engineer, Chinonso Emma-Ebere, Tech…

4-minute read
Advanced Analysis Techniques in Google Analytics 4: How to Use AI-Powered Insights and Predictive Analytics for Effective Marketing

Advanced Analysis Techniques in Google Analytics 4: How to Use AI-Powered Insights and Predictive Analytics for Effective Marketing

AI-powered insights and predictive analytics are revolutionary tools reshaping the modern marketing landscape. These advanced analytics techniques, particularly prominent in Google Analytics 4 (GA4),…

8-minute read
How to Track User Engagement and Behavior on Your Website Using Google Analytics 4 Custom Metrics

How to Track User Engagement and Behavior on Your Website Using Google Analytics 4 Custom Metrics

Understanding how users engage with your website is crucial for improving user experience and achieving your business goals. Google Analytics 4 (GA4) offers a…

5-minute read
Leveraging Custom Dimensions and Metrics in Google Analytics 4 for Content Performance Measurement: Best Practices and Real-World Examples

Leveraging Custom Dimensions and Metrics in Google Analytics 4 for Content Performance Measurement: Best Practices and Real-World Examples

In today’s digital landscape where content reigns supreme, understanding how your audience interacts with your content is paramount for success. For news and media…

5-minute read
Is It Time to Upgrade? 4 Signs Your Organization Needs Google Analytics 4 360

Is It Time to Upgrade? 4 Signs Your Organization Needs Google Analytics 4 360

As VP of Partnerships at InfoTrust, I’ve had the opportunity to talk with hundreds of decision-makers about their interest in upgrading to Google Analytics…

4-minute read

Get Your Assessment

Thank you! We will be in touch with your results soon.
{{ field.placeholder }}
{{ option.name }}

Talk To Us

Talk To Us

Receive Book Updates

Fill out this form to receive email announcements about Crawl, Walk, Run: Advancing Analytics Maturity with Google Marketing Platform. This includes pre-sale dates, official publishing dates, and more.

Search InfoTrust

Leave Us A Review

Leave a review and let us know how we’re doing. Only actual clients, please.