How Do I Get Session-Level Attribution in Google Analytics 4 BigQuery?

Estimated Reading Time: 12 minutes
August 14, 2024
How Do I Get Session-Level Attribution in Google Analytics 4 BigQuery?

In the world of digital analytics, understanding user behavior across sessions is crucial for making informed business decisions. Google Analytics 4 (GA4) has revolutionized how we collect and analyze data, but with this change comes new challenges, particularly when working with BigQuery (BQ) exports. One of the most daunting tasks analysts face is recreating session-level attribution in BQ to match the numbers seen in the GA4 user interface. This blog will guide you through the intricacies of this process and provide valuable insights to help you navigate these murky waters.

Key Considerations

Missing Session Data

Unlike Universal Analytics (UA), GA4 BQ primarily stores data at the event and user level. Session-level metrics are not readily available. To get sessionized data, we will have to ladder up event-level data to build out sessionized data for a period before July 19, 2024.

Referencing the Right Traffic Source

To bridge the gap, we need to leverage the collected_traffic_source field within the event schema. This field captures traffic source data associated with each event, providing a snapshot of the immediate source that triggered the user’s action.

When you look at the schema in BQ, you’ll find two fields: traffic_source and collected_traffic_source.  

traffic_source

  • traffic_source is scoped at the user level (first visit), indicating what source brought the user to your website for the very first time; user-level traffic data corresponds to “First User” (dimensions in the UI and API).
  • This user-level traffic parameter (source, medium, campaign name) will persist throughout the user’s lifetime on your website and is a good indicator of the performance of your acquisition campaigns for new users.
  • The screenshot below shows the fields that can be queried in BQ to get user-level traffic information:

collected_traffic_source 

  • The collected_traffic_source RECORD contains the traffic source data that was collected with the event. Effectively, GA4 just strips data out of the event (utms, gclid, dclid, referrer, etc.) into that field with no processing.
  • This field will help us to build session-level attribution in BQ.
  • The screenshot below shows the fields that can be queried in BQ to get event-level traffic information:

Session_traffic_source_last_click

  • Google added a new record to the GA4 BQ Event export on July 19, 2024. This record contains the Google Ads and Manual campaign context information for each session in the export. This information is repeated for every event in a session. This makes it easy to analyze channels in BQ for any event within a session. The exported data is the same as the data used in GA4 behavioral reporting.

Referencing the Right Attribution Model

Session-level attribution in GA4 is by default following the last non-direct attribution click model.

The last non-direct click attribution model in GA4 is also known as the “paid and organic last click” model. This model attributes 100 percent of the conversion credit to the last non-direct touchpoint in a user’s journey before converting. Here’s a detailed explanation:

Key characteristics:

  1. If the current session has any source/medium in the session_start event, it attributes 100 percent to that particular traffic touchpoint.
  2. However, if there is no source/medium in your current session, it will try to find the past sessions for the same user and find the last non-direct traffic touchpoint, excluding direct traffic.
  3. It ignores direct traffic unless the entire conversion path consists only of direct visits.
  4. This model is useful for understanding which marketing efforts are most effective in driving conversions immediately before the conversion occurs.

Example scenario:

Let’s consider a user’s journey to purchasing a product on an e-commerce website:

  1. Day 1: User clicks on a display ad (Display)
  2. Day 3: User finds the website through organic search (Organic Search)
  3. Day 5: User clicks on a paid search ad (Paid Search)
  4. Day 7: User visits the website directly (Direct)
  5. Day 7: User completes the purchase (Conversion)

In this scenario, the last non-direct click attribution model would attribute 100% of the conversion credit to Paid Search, even though it wasn’t the last touchpoint before the conversion.

Here’s how the model evaluates this path:

  1. It starts from the conversion and moves backward.
  2. It ignores the direct visit on Day 7.
  3. It identifies Paid Search as the last non-direct touchpoint.
  4. It assigns 100% of the conversion credit to Paid Search.

This model is particularly useful when you want to identify which marketing channels are most effective at driving conversions while acknowledging that users often visit a site directly just before converting. By ignoring direct traffic (unless it’s the only traffic in the path), it helps highlight the impact of your marketing efforts.

Referencing the Right Lookback Window

The lookback window in GA4 determines how far back in time an interaction is considered for attribution credit. The default is 30 days, but it can be adjusted to 60 or 90 days. This setting affects how past interactions affect current attribution. Changes will apply to the first user source and key events in the future. Your code in BQ also needs to make sure that it matches the lookback window selected in the admin settings. 

Referencing the Right Calculations Based on Reporting Identity

Selected reporting identity will also play a role in how sessions are calculated. When calculating different metrics in BigQuery, ensure you are using the correct methodology. 

For example:

The standard method of counting sessions for GA4 properties is counting the unique combinations of user_pseudo_id/user_id and ga_session_id regardless of the timeframe. In UA, sessions would reset at midnight. If you follow the UA model, calculate sessions for each day, and add them up to get a total session count, you would be double-counting the sessions that span across multiple days. Depending on your selected Reporting Identity, the user and session count calculation method will have to be updated.

For example, for Blended reporting, in BQ you would use the query below to create unique session ID:

concat(coalesce(user_id, user_pseudo_id),

         ifnull((select coalesce(cast(value.int_value as string),
value.string_value)

                 from unnest(event_params)

                 where key=”ga_session_id”), “”)

    )  as session_id

And for device-based only the following:

concat(user_pseudo_id,(

     select value.int_value from unnest(event_params) where key =
‘ga_session_id’)) AS session_id,

If the reporting identity is set to “Blended” and Consent Mode is implemented, Consent Mode heavily impacts what you see in the UI vs what you see in BQ. Event totals will match, but any breakdown that requires stitching of data (i.e. traffic attribution) will be done by Consent Mode’s blackbox AI thing-a-magic in the UI (if it’s enabled).

Other Issues and Recommended Solutions

1. Issues in source/medium where gclid/dclid is present

This is a bug in the GA4 export causing incorrect attribution of event traffic data, particularly affecting Google search ads, Display & Video, and Campaign Manager traffic. While Google is addressing this issue in the coming quarters, there’s a workaround you can implement using SQL queries in BQ. Here’s the general approach for gclid traffic:

  • Aggregate traffic source information at the session level.
  • Identify sessions with gclid parameters in the page_location.
  • Overwrite source and medium with hardcoded values “google/cpc” for sessions with gclid parameters.

While this fix corrects ‌source and medium attribution, it can’t fully repair campaign values for certain ad types like Performance Max campaigns. For more comprehensive campaign data correction, consider setting up Google Ads Data Transfer, which allows joining GA4 with Google Ads data using the gclid parameter as a join key.

This approach allows you to ‘repair’ not just the source and medium fields, but also the campaign field. Here’s how it works:

  • Extract the gclid parameter for every session where it’s available.
  • Join this data with campaign information from the Google Ads data transfer in BQ.
  • Retrieve detailed campaign information, including campaign names, on a session-by-session basis.

Prerequisites:

Before implementing this solution, ensure you have set up your Google Ads Data Transfer to BQ.

2. iOS ATT challenges

  • gclid (Google Click Identifier) is not present on iOS devices due to ATT restrictions.
  • gbraid and wbraid parameters, which are ATT-compliant alternatives to gclid, don’t make it into BQ exports.
  • GA4’s integration with BQ is unable to decrypt gclid, leading to incomplete or inaccurate attribution data. You can add manual traffic parameters to Google campaigns, even when auto-tagging is enabled.
  • This won’t affect the GA4 UI, as GA4 doesn’t allow manual override for auto-tagged landing pages. However, it will allow you to see campaign values in BQ exports.

3. Data delays in GA4 

GA4 users often encounter a significant challenge when it comes to same-day or day-before traffic source processing. This issue can lead to substantial fluctuations in attributed sessions and conversions when reviewing recent data. Let’s explore this problem and discuss some practical solutions.

3.1 The attribution delay issue

GA4’s attribution model struggles with processing recent data accurately. When examining yesterday’s or today’s data, you’ll likely notice significant changes in attributed sessions and conversions over the next day or two. This inconsistency can be frustrating for marketers and analysts who rely on timely and accurate data for decision-making.

Stabilization Timeline

Based on experience, it typically takes until about 2-3pm the following day for ‌traffic attribution to reach an acceptable level of accuracy. At this point, the data usually settles to within 2-3 percent of the ultimate “golden record” processing for the previous day’s data.

BQ Solutions

Those who are using 360 license can avail the Fresh Daily export option to get data faster in BQ. The good news is that the schema is identical to Daily export type. Your existing queries can be switched over to Fresh Daily with less effort.

4. Inflated numbers in the Direct category 

The last non-direct attribution model implemented by BQ will assign traffic as direct if it finds null values in the entire conversion path. Because of this, you might see inflated numbers in the Direct category. On UI, since this data is processed, you see Null, (not set) and direct all as separate categories, but on BQ it will be under the Direct category.

Is It Worth It? BQ vs API route

When it comes to reporting on GA4 metrics, you have two main options: the GA4 dataset in BQ and the API. Each approach offers distinct advantages and disadvantages, particularly regarding data accuracy and stakeholder perception.

BigQuery: The Quest for Exact Figures

If achieving pinpoint accuracy is your priority, BQ is the way to go. The data in BQ reflects the most precise figures available. Here’s the upside:

  • Unmatched Accuracy: BQ offers the most accurate representation of your data.
  • Cost-Effective for Most Businesses: The query fees are usually negligible for most companies.
  • Customized Attribution: Empowers internal teams to comprehend the underlying logic embedded within queries, fostering transparency and providing the opportunity to develop custom attribution models tailored to specific business needs. Once modifications are implemented, BQ’s robust processing capabilities make it effortless to update historical data, ensuring consistency and accuracy across all time periods.

However, there’s a crucial caveat:

  • Prepare for Explanations: Be ready to explain discrepancies between BQ data and the GA4 interface values. Stakeholders might question the difference, and justifying BQ’s accuracy can be time-consuming.
  • Steep learning curve: You will require advanced SQL skills to create all the calculations in BQ.

API over BQ: Prioritizing User Experience

If you want to avoid constant explanations and maintain a familiar user experience, the API might be a better fit. Here’s why:

  • Values Match GA Interface: Data retrieved via the API aligns closely with what stakeholders see in the GA4 interface. This reduces confusion and resistance.
  • Batch Requests for Monthly Accuracy: Batching API requests by month can significantly improve the accuracy of user metrics.

However, keep these limitations in mind:

  • Subject to quota or token limits: There are still known issues with quota/token limits when using the API. 
  • Attribution Challenges: BQ’s source, medium, and channel attribution can’t be perfectly replicated with the API. You’ll need to develop your own logic for multi-session attribution.
  • Unavailability of Segments: Importing Segments from GA4 to Looker Studio is currently not an option.
  • Sampling: Sampling may occur if the report requests more events than the GA4 quota limit (10 million events for GA4 standard vs. 1 billion events for GA4 360). No indicator of the data is being sampled.
  • Separate pull by each data scope: Adjust requirements and switch any metrics that mix scopes.

The Takeaway: Tailor Your Approach

The optimal choice depends on your priorities. If data precision is paramount, BQ is the champion. But if minimizing stakeholder confusion and maintaining a user-friendly experience are your goals, the API might be the better option. Remember, you can also leverage the API to retrieve data precisely as it appears in the GA4 interface.

By understanding the strengths and weaknesses of each approach, you can make an informed decision that aligns with your reporting goals and stakeholder needs.

Conclusion

While creating session-level attribution in BQ can be challenging, understanding these nuances and following best practices can help you achieve results that closely align with the GA4 UI. Remember that perfect alignment may not always be possible due to the complexities of data processing and attribution logic in GA4.

By investing time in mastering these concepts and techniques, you’ll be better equipped to extract valuable insights from your GA4 data in BQ, enabling more informed decision-making for your organization.

Do you have questions about session-level attribution in Google Analytics 4?

Our team is here to help whenever you need us.

Author

Last Updated: August 14, 2024

Get Your Assessment

Thank you! We will be in touch with your results soon.
{{ field.placeholder }}
{{ option.name }}

Talk To Us

Talk To Us

Receive Book Updates

Fill out this form to receive email announcements about Crawl, Walk, Run: Advancing Analytics Maturity with Google Marketing Platform. This includes pre-sale dates, official publishing dates, and more.

Search InfoTrust

Leave Us A Review

Leave a review and let us know how we’re doing. Only actual clients, please.