In the latest episode of our ‘Calling Kevin’ video series, we show you how to customise your GA4 report library – updating your Google Analytics reporting interface to include a new, personalised collection of reports.
Follow these quick and easy steps to begin tailoring your GA4 report library menu and navigation for a more efficient reporting experience. We also share a helpful refresher on how to work with topics, templates, plus more!
For more quick GA4 tips, be sure to check out other videos from our ‘Calling Kevin’ series.
The post (Video) How to customise your Report library in GA4 appeared first on Lynchpin.
At Lynchpin, we’ve spent considerable time testing and deploying Luigi in production to orchestrate some of the data pipelines we build and manage for clients. Luigi’s flexibility and transparency make it well suited to a range of complex business requirements and to seamlessly support more collaborative ways of working.
This blog post draws from our hands-on experience with the tool – stress-tested to really understand how it performs day to day in real-world contexts. We will begin by walking through what Luigi is, why we like it, where it could be improved, and share some practical tips for those considering using it to enhance their data orchestration processes and data pipeline capabilities.
Luigi is an open-source tool developed by Spotify that helps automate and orchestrate data pipelines. It allows users to define dependencies and build tasks with custom logic using Python, offering flexibility and a fairly low barrier to entry for its quite complex functionality.
Despite Spotify’s introduction of a newer orchestration tool, Flyte, Luigi is still widely used by many major brands and continues to receive updates – allowing it to continually mature and become a reliable choice for a range of data orchestration use cases.
Luigi sits amongst many popular tools used for data orchestration in the data engineering space – some of which are paid, while others are similarly open source.
Another tool we’ve used for data orchestration is Jenkins. Although it isn’t designed for more heavy-duty pipelines, we’ve found it to work very well as a lightweight orchestrator, managing tasks and dependencies.
In the following section, we’ll break down some benefits of using Luigi for your data pipelines and a few reasons why you may choose it over a comparable tool such as Jenkins.
Transparent version control:
One of the key advantages of Luigi is that it’s written in Python. This gives you transparent version control over your data pipelines – every change is committed and traceable: you know exactly what change has been made, you can inspect it, you can see who did it, and when it was done. This becomes even more powerful when linked to a CI/CD pipeline, which we do for some of our clients, as this means that any change to the pipeline in the repository is automatically the truth.
With Jenkins, for example, changes can be made and it’s not necessarily obvious what was changed or by which team member (unless explicitly communicated) – which becomes increasingly important when you’re managing more complex data pipelines with many moving parts and dependencies.
Dependency handling and custom logic capabilities:
Managing data pipeline dependencies is where Luigi truly stands out. In a tool like Jenkins, downstream tasks can be orchestrated but this often requires careful scheduling or wrapper jobs, which can get complicated and quite manual as a process depending on the complexity of your needs. Luigi simplifies this and enables smoother levels of automation by allowing you to define all dependencies directly in Python, allowing for logic such as: ‘Run a particular job only after a pipeline completes, and only do this on a Sunday or if it failed the previous Sunday.’
This level of custom logic is trivial in Python but can be difficult to replicate in Jenkins, where perhaps the only option is to run on a Sunday without any conditions surrounding it.
Pipeline failure handling:
Luigi considers all tasks idempotent. Once a task has run, it’s marked as ‘done’ and won’t be re-run unless you manually remove its output. This is a particularly useful feature if you have big, complex pipelines and only need to re-run certain jobs that have failed. You won’t need to re-run everything, but can find the failed task, delete its output file, and save time when re-executing the job.
Backfilling at the point of a task:
Luigi handles backfilling easily by allowing users to pass parameters directly into tasks.
This allows you to retrieve historical data (for example, backfilling from the beginning of last year to present) without having to change the script or config files.
Luigi will treat tasks with parameters like new tasks, so if the job had previously run, it’ll recognise the changed parameters and simply pass those parameters through.
Efficiency to set up, host, and use alongside existing infrastructure:
While tools such as Apache Airflow may require a Kubernetes cluster (and more) to begin running, Luigi, by contrast, is far simpler to host. You can run it on a basic VM (Virtual Machine) or through a tool like Google Cloud Platform, using a Cloud Run job. This makes it a great choice for smaller data pipelines or client-specific pipelines where you may want to decouple from the main infrastructure.
Market maturity and active use and development by many large brands:
Luigi is used by many users – including a host of major brands over the years, such as Squarespace, Skyscanner, Glossier, SeatGeek, Stripe, Hotels.com, and more. This is integral to its maintenance and viability as a good open-source tool. Its core functionality rarely changes, making it a stable and reliable choice for users; We found that any updates we’ve experienced are primarily focused on maintaining security rather than big rehauls to its functionality, which brings us to a few of its shortfalls…
Limited frontend and UI:
Luigi’s frontend leaves a lot to be desired. Firstly, it only really shows you jobs that are running or have recently succeeded in running, so if you have many running jobs in one day, the History tab fails to give you a strong overview of information.
When something fails, you’ll be notified, and you can inspect logs in a location that you previously specify, however it would be nice if the frontend provided a good summary of this information instead.
Workarounds do exist, such as saving your task history (e.g., tasks that ran, the status, how long they took, etc) in a separate table (for example, Postgres) where it can be visualised in an external run dashboard – providing a more personalised frontend for better monitoring, visibility into run times, failure rates, and so on.
Setting something like this up would provide more feature parity with a tool such as Jenkins, which, by contrast, does a great job at providing stats and visual indicators for task history, job health, what’s running, and more – right out of the box.
Documentation could be improved:
While Luigi provides all the key documentation you need, it’s not always the easiest to find or navigate – this, when compared to tools such as dbt, makes documentation as a whole feel sparse in places, especially when dealing with more advanced features or plugins.
For instance, helpful features such as enabling dependency diagrams or tracking task history involves installing separate modules, which is a process that isn’t particularly well-explained in their official documentation.
In many instances, users may find themselves gaining the most clarity about how the tool works by trying things out and learning as they go.
Python path issues – everything must be clear or else Luigi will struggle to find it:
To avoid a barrage of ‘module not found’ errors, Luigi will need to know exactly where everything lives in your environment.
A workaround we found useful is creating a Shell Script that sets out all necessary paths and everything Luigi may possibly need to run successfully.
While something like this may take a little time to set up, it’s a small level of upfront effort to improve your workflow in Luigi and avoid any issues in the longer run.
We think Luigi is a powerful data orchestration tool for anyone comfortable with Python, who has experience managing data pipelines, and is comfortable getting to grips with a few of its quirks that may make onboarding a bit challenging.
If you’re looking for an alternative to tools like Apache Airflow or Jenkins, Luigi is definitely worth trying out. While we recognise that its UI and documentation are lacking when compared to other tools in this space, we found that Luigi’s version controlling, dependency handling, and logic capabilities make it a handy tool for a range of our clients’ use cases.
The post Data pipelines using Luigi – Strengths, weaknesses, and some tops tips for getting started appeared first on Lynchpin.
In this blog, we discuss the development of an automated testing project, using the AI and automation capabilities of Cursor to scale and enhance the robustness of our data testing services. We walk through project aims, key benefits, and considerations when leveraging automation for analytics testing.
On an ongoing basis, we upgrade a JavaScript library that we manage, to include a number of improvements and enhancements, which is deployed to numerous sites. The library integrates with different third-party web analytics tools and performs a number of data cleaning and manipulation actions. Once we upgrade the library, our main priorities are:
Feature testing: Verify new functionality across different sites/environments
Regression testing: Ensure existing functionality has not been negatively affected across different sites
To achieve this, we conduct a detailed testing review across different pages of the site. This involves performing specific user actions (such as page views, clicks, search, and other more exciting actions) and ensuring that the different events are triggered as expected. We capture network requests for outgoing vendors, such as Adobe Analytics or Google Analytics through the browser’s developer tools or a network debugging tool (e.g., Charles) and verify if the correct events are triggered and relevant parameters are captured accurately in the network requests. By ensuring that all events are tracked with the right data points, we can confirm that both new features and the existing setup are working as expected.
To optimise this process and reduce the manual effort involved, we developed an automated testing tool designed to streamline and speed up data testing. As an overview, this tool automatically simulates user actions on different sites and different browsers, triggering the associated events, and then checks network requests to ensure that the expected events are fired, and the correct parameters are captured.
In the era of AI, automation is a key driver of efficiency and increased productivity. Automating testing processes offers several key benefits to our development and data testing capabilities, such as:
We chose Python as the primary scripting language, as it offers flexibility for handling complex tasks. Python’s versatility and extensive libraries made it an ideal choice for rapid development and iteration.
For simulating a variety of user interactions and conducting tests across multiple browsers, we selected Playwright. Playwright is a powerful open-source automation tool/API for browser automation. It supports cross-browser data testing (including Chrome, Safari, Firefox), allowing us to validate network requests across a broad range of environments.
We used the Cursor AI code editor to optimise the development process and quickly set up the tool. Cursor’s proprietary LLM, optimised for coding, enabled us to design and create scripts efficiently, accelerating development by streamlining the debugging and iteration process. Cursor’s AI-assistant (chat sidebar) boosted productivity by providing intelligent code suggestions, speeding up debugging and investigation. We’ll dive into our experience using Cursor a bit further in the next section
Lastly, we chose Flask to build the web interface where users can select different types of automated testing. Flask is a lightweight web framework for Python, which we’ve had experience with for other projects. It has its pros and cons, but a key benefit of this project was that it allowed us to get started quickly and focus more on the nuts and bolts of the program.
Cursor AI played a crucial role in taking this project from ideation to MVP. By carefully prompting Cursor’s in-editor AI assistant, we were able to achieve the results we wanted. The tool allowed us to focus on the core structure of the program and the logic of each test without getting bogged down in documentation and finicky syntax errors.
Cursor also gave us the capability to include specific files, documentation links, and diagrams as context for prompts. This allowed us to provide relevant information for the model to find a solution. Compared to an earlier version of Github’s copilot that we tested, we thought this was a clear benefit in leading the model to the most appropriate outcome.
Another useful benefit of Cursor AI was the automated code completion, which could identify bugs and propose fixes, as well as suggest code to add to the program. This feature was useful when it understood the outcome we were aiming for, which it did more often than not.
However, not everything was plain sailing, and our experience did reveal some drawbacks to using AI code editors to be mindful of. For example, relying too much on automated suggestions can distance yourself from the underlying code, making it harder to debug complex issues independently. It was important to review the suggested code and use Cursor’s helpful in-editor diffs to clearly outline the proposed changes. This also allowed us to accept or reject these changes, giving us a good level of control.
Another drawback we noticed is that AI-generated code may not always follow best practices or be optimised for performance, so it’s crucial to review and validate the output carefully. For example, Cursor tended to create monolithic scripts instead of separating functionality into components, such as tests and Flask-related parts, which would be easier to manage in the long term.
Another point we noticed was that over-reliance on AI tools could easily lead to complacency, potentially affecting our problem-solving skills and creativity as developers. When asking Cursor to make large changes to the codebase, it can be easy to just accept all changes and test if they worked without fully understanding the impact. When developing without AI assistance (like everyone did a couple of years ago), it’s better to make specific and relatively small changes at a time to reduce the risk of introducing breaking changes and to better understand the impact of each change. This seems to be a sensible approach when working with a tool like Cursor.
The automated testing tool we developed significantly streamlined and optimised the data testing process in a number of key ways:
With AI, the classic engineering view of ‘why spend 1 hour doing something when I can spend 10 hours automating it?’ has now become ‘why spend 1 hour doing something when I can spend 2-3 hours automating it?’. In this instance, Cursor allowed us to lower the barrier for innovation and create a tool to meet a set of tight deadlines, whilst also giving us a feature-filled, reusable program moving forwards.
The post Automated testing: Developing a data testing tool using Cursor AI appeared first on Lynchpin.
In the latest episode of our ‘Calling Kevin’ video series, we show you how to clean up and filter URLs using a few simple expressions in Looker Studio.
By applying these Regular Expressions (RegEx), you can easily remove duplicates, fix casing issues, and tidy up troublesome URL data to standardise GA4 reporting – just as you would have been able to in Universal Analytics.
Expressions used:
For more quick GA4 tips, be sure to check out other videos from our ‘Calling Kevin’ series.
The post (Video) Applying RegEx filtering in Looker Studio to clean up and standardise GA4 reporting appeared first on Lynchpin.
How do you know what’s working and not working and plan for success as the tides of digital measurement continue to change?
The themes of privacy, measurement and marketing effectiveness triangulate around a natural trade and tension: balancing the anonymity of our behaviours and preferences against the ability for brands to reach us relevantly and efficiently.
In this briefing our CEO, Andrew Hood, gives you a practical and independent view of current industry trends and how to successfully navigate them.
Building on the themes introduced in the webinar, our white paper lays out an in-depth look at the privacy trends, advanced measurement strategies, and balanced approach you can take to optimise marketing effectiveness.
Unlock deep-dive insight and practical tips you can begin implementing today to guide your focus over the coming months.
To access a copy of the slides featured in the webinar, click the button below
The post Webinar: Navigating Recent Trends in Privacy, Measurement & Marketing Effectiveness appeared first on Lynchpin.
The concept of marketing mix modelling (often referred to as just ‘MMM’) has been around for a while – as early as the 1960s in fact – which should be no surprise, as the business challenge of what marketing channels to use and where best to spend your money has always been the essence of good marketing, at least if somebody is holding you accountable for that spend and performance!
Marketing mix modelling has its foundations in statistical techniques and econometric modelling, which still holds largely true today. However, the mix of channels and advancements in end-to-end analytics create new challenges to be tackled, not least the expectations of what MMM is and what it can deliver.
In reality, there are various analytics techniques that can be undertaken to answer the overall business question: ‘how do my channels actually impact sales?’. In this blog we will answer some common questions about MMM, address some common (comparable) techniques, and share how and when you might look to choose one method over another.
MMM is a statistical technique, with its roots in regression, that aims to analyse the impact of various marketing tactics on sales over time (other KPIs are also available!). Marketing mix modelling will consider all aspects of marketing to do this, such as foundational frameworks like ‘The 4 Ps of Marketing’ (Product, Price, Place and Promotion).
MMM is similar to econometric modelling in terms of techniques used, however there are some key differences. On the whole, econometrics is broader in its considerations and applications, often encompassing aspects of general economic factors in relation to politics, international trade, public policy and more. MMM, on the other hand, focusses more specifically on marketing activities and their impact on business outcomes.
You might also come across the term ‘media mix modelling’ (with the same unhelpful acronym, ‘MMM’). Much like econometrics, media mix modelling tends to differ from marketing mix modelling due to its scope and general objective. Media mix modelling tends to have an even narrower focus than marketing mix modelling; As the name implies, it’s aimed more specifically on optimising a mix of media channels, focussing on optimising advertising spend.
Whether its marketing mix modelling or media mix modelling you are looking at, the key is to consider the business question you are looking to answer and ensure your model is trained using the best input variables to answer that question – Nothing new in the world of a good analytics project!
In recent years, the general trend has been to measure everything, integrate everything, and to link all of your data together, leaving no doubt about who did what, when, and to what end. However increasing concerns (or at least considerations) around data privacy and ethics has caused marketers to take a second look at how they collect and utilise their data.
There is a growing need to adapt to new privacy regulations, but also a greater desire to respect an individual’s privacy and find better ways to understand what marketing activities drive positive or negative outcomes.
With limitations on the ability to track 3rd party cookies, approaches such as marketing attribution may become more difficult to implement, although the effectiveness of these data sources is in itself doubtful. And with consent management becoming increasingly granular, even 1st party measurement can leave gaps in your data collections.
However, the power that marketing attribution gave marketers is well recognised now and the desire to continue to be data-led is only increasing. Machine learning has become a commonplace tool in beginning to fill the gaps that are creeping back in to the tracking of user behaviour. Organisations are also increasingly eager to build on the power of what they have learnt with these joined-up customer journeys, and there is that need again to look across the whole of marketing, not just these digital touchpoints, and replicate that approach in a more holistic way.
So in summary, while marketing mix modelling has never gone away, it is now seeing a revival as an essential tool in a marketer’s toolbelt.
MMM is a great tool for any organisation looking to be more data led in their approach to planning and analysis of marketing activities. Key benefits of MMM include:
Ability to measure and optimise the effectiveness of marketing and advertising campaigns:
The purpose of MMM is to measure the impact of your marketing activities on your business outcomes. A well-built marketing mix model will enable you to quantify ROI by channel and make better data-led decisions on the mix of marketing activities that will lead to more optimised campaigns.
Natural adeptness at cross-channel insights:
With increasing limitations on tracking users across multiple channels the methodology for MMM neatly side steps these restrictions by using data at an aggregated level. By its very nature it doesn’t require linking user identities across different devices or tracking individuals using offline channels.
Enables more strategic planning and budgeting:
MMM provides data-driven insight to inform budget planning processes. Its outputs are transparent, allowing organisations to understand the impact each of their channels have on business outcomes and how those channels influence each other within the mix. By incorporating MMM with other tools for scenario planning, spend optimisation and forecasting, organisations can better understand what happened in the past to plan more effectively for the future.
Can be used when granular level data is not available:
As mentioned earlier, MMM works with data at an aggregated level. This offers more flexibility when looking to integrate data inputs into your decision making such as:
Has a longer-term focus:
MMM is a powerful technique for longer term planning and assessing the impact of campaigns that don’t necessarily provide immediate impact (e.g. brand awareness campaigns, TV, and display advertising etc). By incorporating MMM into a measurement strategy, businesses can ensure longer-term activity is appropriately considered.
Earlier in this blog we looked at how marketing mixed modelling compares to econometrics and media mixed modelling. Another very important modelling approach to consider when looking at marketing effectiveness is marketing attribution.
Marketing attribution differs from marketing mix modelling in a number of important ways – most importantly by relying on a more granular approach. It looks to assign weightings to each individual touchpoint on the customer journey, incorporating each user’s journey and determining whether that journey leads to a successful conversion or not.
This very detailed understanding of how each customer interacts with your channels can be very powerful, but also very complex and time consuming to both collect and analyse; In addition with the increasing limitations on tracking individuals without their consent, you may end up having to rely on only a partial picture of the user journey.
While of course it is possible to model on a subset of data, you would need to be careful that the user journey you are looking to understand is not unfairly weighted to those channels (or individuals) that are easier to track.
Marketing attribution also uses a wider range of modelling algorithms, from the simple (linear, time-decay) to the more complex (Markov Chains, Game Theory, ML models). This range of models to select from can be both a benefit and a hindrance, with difficulties arising when you’re not sure what marketing attribution model will suit your business needs best.
Marketing mix modelling does have its own drawbacks to consider too. The biggest consideration when determining if MMM is suitable for you is to understand how much historical data you have.
While a marketing attribution model can work on just a few months of data, so long as it has decent volume and is fairly representative of your typical user journeys, MMM relies on trends over longer periods of time – typically a minimum of 2 years’ worth of data is advised before undertaking an MMM project. MMM also works best when looking at the broader impacts marketing has on your goals. Therefore, if you need to analyse specific campaign performance or delve deeper into specific channels, then marketing attribution will be the better bet.
In a previous blog, we discussed the merits of using both marketing attribution and MMM side by side to provide a more powerful and comprehensive understanding of marketing effectiveness.
While a marketing attribution model will focus on individual touchpoints and their contributions, MMM will take a holistic view, considering the overall impact of marketing inputs. By combining these two approaches, marketers can gain a more complete picture of how different marketing elements work together to drive business outcomes and the demystify the balance they needed across marketing activity for maximum business performance.
Marketing mix modelling is very a powerful and well-established statistical technique. Most marketers should be at least exploring the benefits and insight it provides into the relationship between marketing activity and business performance to optimise planning and decision making.
Some barriers to entry in starting an MMM project can be navigating what may appear to be a complex set of approaches and techniques. While variations of MMM do exist – econometrics, marketing mix modelling, and media mix modelling – the key difference lies in the scope and objective of the business question you aim to answer. Successfully choosing and developing a model depends on fully understanding your business needs and the data available to you. Investing time upfront to determine what you are looking to achieve is essential in getting the right outcomes.
MMM is best used for strategic planning and determining longer term impacts of your marketing activities. Therefore, if you require more in-depth campaign and channel analysis, then marketing attribution may be more suitable for your business needs. However, it’s important to note that MMM and marketing attribution can work side by side to develop a more complete picture of your marketing activities. While MMM allows greater flexibility when working with a mix of channels that are both tracked and not tracked, the ability of marketing attribution to provide a more granular analysis of your marketing journeys, channels, and campaigns allows for day-to-day optimisation of your marketing activities alongside the longer-term strategy set out by your MMM insights.
If you are ready to explore MMM, marketing attribution, or anything in between, we’d be delighted to discuss your needs in more detail.
The post Benefits of marketing mix modelling: Why is MMM so popular right now? appeared first on Lynchpin.
Google has recently updated their GA4 ‘Data Import’ feature to finally support Custom Event metadata. This is a significant development, but before we dive in, let’s remind ourselves of a key point: Despite its name, which can give false hope after an outage, ‘Data Import’ is NOT a solution for repopulating lost data. It is however a powerful tool for augmenting existing data with information that isn’t directly collected in GA4. Common sources that we find our clients wanting to integrate include CRM systems, offline sales data, or other third-party analytics tools.
When would Custom Event Data Import be useful?
Well, there are many cases:
The information we import might not be available until after collection. This could include data that is processed or generated by third-party tools after the event has already occurred. A prime example would be cost data for non-Google ad clicks and impressions.
Some information might not be something we want exposed on our site. Importing such data ensures it remains secure and is only used for internal analysis. These might include things like a product’s wholesale price, or a user’s lifetime customer value.
Information collected offline, such as in-store purchases or interactions, could be integrated with your existing GA4 data to allow for a more complete view of customer behaviour across both online and offline touchpoints.
Although Data Import supported Cost, Product, and User-scoped data, what was conspicuously absent up until now was the ability to import data directly scoped to existing Custom Events. This is particularly significant because, as Google likes to remind us, GA4 is ultimately event-based.
To understand if this development could be useful for you, consider the events you already track. Is there any information directly related to these events and their custom dimensions that you don’t collect in GA4, but have available offline or in another tool? If so, Custom Event data import could be very handy.
It’s been a long and somewhat painful journey with GA4, but it’s great to see it gradually becoming feature complete.
Of course, if you’re looking to augment your GA4 data with information available at the point of collection, Lynchpin would recommend harnessing the power of a server-side GTM implementation to augment your GA4 data before it even arrives to GA4 itself.
For more information on server-side GTM and its advantages we highly recommend reading the blogs below:
To discuss any of the topics mentioned in this blog or to find out how Lynchpin can support you with any other data and analytics query, please do not hesitate to reach out to a member of our team.
The post Google (finally) supports Custom Event Data Import in GA4 appeared first on Lynchpin.
Here at Lynchpin, we’ve found dbt to be an excellent tool for the transformation layer of our data pipelines. We’ve used both dbt Cloud and dbt Core, mostly on the BigQuery and Postgres adapters.
We’ve found developing dbt data pipelines to be a really clean experience, allowing you to get rid of a lot of boilerplate or repetitive code (which is so often the case writing SQL pipelines!).
It also comes with some really nice bonuses like automatic documentation and testing along with fantastic integrations with tooling like SQLFluff and the VSCode dbt Power User extension.
As with everything, as we’ve used the tool more, we have found a few counter-intuitive quirks that left us scratching our heads a little bit, so we’d thought we share our experiences!
All of these quirks have workarounds, so we’ll share our thoughts plus the workarounds that we use.
Summary:
Incremental loads in dbt are a really useful feature that allows you to cut down on the amount of source data a model needs to process. At the cost of some extra complexity, they can vastly reduce query size and the cost of the pipeline run.
For those who haven’t used it, this is controlled through the is_incremental() macro, meaning you can write super efficient models like this.
SELECT * FROM my_date_partitioned_table {% if is_incremental() %} WHERE date_column > (SELECT MAX(date_column) FROM {{ this }} {% endif %}
This statement is looking at the underlying model and finding the most recent data based on date_column. It then only queries the source data for data after this. If the table my_date_partitioned_table is partitioned on date_column, then this can have massive savings on query costs.
Here at Lynchpin, we’re often working with the GA4 → BigQuery data export. This free feature loads a new BigQuery table events_yyyymmdd every day. You can query all the daily export tables with a wildcard * and also filter on the tables in the query using the pseudo-column _TABLE_SUFFIX
SELECT * FROM `lynchpin-marketing.analytics_262556649.events_*` WHERE _TABLE_SUFFIX = '20240416';
The problem is incremental loads just don’t work very nicely with these wildcard tables – at least not in the same way as a partitioned table in the earlier example.
-- This performs a full scan of every table - rendering -- incremental load logic completely useless! SELECT *, _TABLE_SUFFIX as source_table_suffix FROM `lynchpin-marketing.analytics_262556649.events_*` {% if is_incremental() %} WHERE _TABLE_SUFFIX > (SELECT MAX(source_table_suffix) FROM {{ this }} {% endif %}
This is pretty disastrous because scanning every daily table in a GA4 export can be an expensive query, and running this every time you load the model doesn’t have great consequences for your cloud budget .
The reason this happens is down to a quirk in the query optimiser in BigQuery – we have a full explanation and solution to it at the end of this blog if you want to fix this yourself.
The sql_header() macro is used to run SQL statements before the code block of your model runs, and we’ve actually found it to be necessary in the majority of our models. For instance, you need it for user defined functions, declaring and setting script variables, and for the solution to quirk #1.
The problem is that sql_header() macro isn’t really fit for purpose and you run into a few issues:
dbt supports different environments, which can be easily switched at runtime using the —target command line flag. This is great for keeping a clean development environment separate from production.
One thing we did find a little annoying was configuring different data sources for your development and production runs, as you probably don’t want to have to run on all your prod data every time you run your pipeline in dev. Even if you have incremental loads set up, a change to a table schema soon means you need to run a full refresh which can get expensive if running on production data.
One solution is reducing amount of data using a conditional like so:
{% if target.name == 'dev' %} AND date_column BETWEEN TIMESTAMP('{{ var("dev_data_start_date") }}') AND TIMESTAMP('{{ var("dev_data_end_date") }}') {% endif %}
This brings in extra complexity to your codebase and is annoying to do for every single one of your models that query a source.
The best solution we saw to this was here: https://discourse.getdbt.com/t/how-do-i-specify-a-different-schema-for-my-source-at-run-time/561/3
The solution is to create a dev version of each source in the yaml file, called {model name_source}_dev (e.g. my_source_dev for the dev version of my_source) and then have a macro that switches which source based on the target value at runtime.
Another example in this vein is getting dbt to enforce foreign key constraints requires this slightly ugly expression switching between schemas in the schema.yaml file
- type: foreign_key columns: ["blog_id"] expression: "`lynchpin-marketing.{ 'ga4_reporting_pipeline' if target.name!='dev' else 'ga4_reporting_pipeline_dev' }}.blogs` (blogs)"
Let’s revisit
SELECT * FROM `lynchpin-marketing.analytics_262556649.events_*` WHERE _TABLE_SUFFIX = '20240416';
This is fine – the table scan performed here only scans tables with suffix equal to 20240416 (i.e. one table), and bytes billed is 225 KB
OK, so how about only wanting to query from the latest table?
If we firstly wanted to find out the latest table in the export:
-- At time of query, returns '20240416' SELECT MAX(_TABLE_SUFFIX) FROM `lynchpin-marketing.analytics_262556649.events_*`
This query actually has no cost!
Great, so we’ll just put that together in one query:
SELECT * FROM `lynchpin-marketing.analytics_262556649.events_*` WHERE _TABLE_SUFFIX = ( SELECT MAX(_TABLE_SUFFIX) FROM `lynchpin-marketing.analytics_262556649.events_*`)
Hang on… what!?
BigQuery’s query optimiser isn’t smart enough to get the value of the inner query first and use that to reduce the scope of tables queried in the outer query
Here’s our solution, which involves a slightly hacky way to ensure the header works in both incremental and non-incremental loads. We implemented this in a macro to make it reusable.
{% call set_sql_header(config) %} DECLARE table_size INT64; DECLARE max_table_suffix STRING; SET table_size = (SELECT size_bytes FROM {{ this.dataset }}.__TABLES__ WHERE table_id='{{ this.table }}'); IF table_size > 0 THEN SET max_table_suffix = (select MAX(max_table_suffix) FROM {{ this }}); ELSE SET max_date = '{{ var("start_date") }}'; END IF; {% endcall %} -- Allows for using max_table_suffix to filter source data. -- Example usage: SELECT * FROM {{ source('ga4_export', 'events') }} {% if is_incremental() %} WHERE _table_suffix > max_table_suffix {% endif %}
We hope you found this blog useful. If you happen to use any of our solutions or come across any strange quirks yourself, we’d be keen to hear more!
To find out how Lynchpin can support you with data transformation, data pipelines, or any other measurement challenges, please visit our links below or reach out to a member of our team.
The post Working with dbt & BigQuery: Some issues we encountered and their solutions appeared first on Lynchpin.
In the latest episode of our ‘Calling Kevin’ video series, our Senior Data Consultant, Kevin, tackles a common issue many users face in Google Analytics: an increase to the number of ‘(not set)’ landing page values in GA4.
In the video below, Kevin covers:
For more quick GA4 tips, be sure to check out other videos from our ‘Calling Kevin’ series.
The post (Video) ‘(not set)’ landing page values in GA4: Explained appeared first on Lynchpin.
In the latest from our ‘Calling Kevin’ video series, our Senior Data Consultant, Kevin, covers a few commons questions about user metrics in GA4.
In this quick walkthrough, Kevin dishes on the definitions and differences between the user metrics available in GA4 – some of which may be familiar to those experienced with Universal Analytics, while some changes are exclusive to GA4, causing some confusion for Google Analytics users both new and old.
For more quick GA4 tips, be sure to check out other videos from our ‘Calling Kevin’ series.
The post (Video) Breaking down user metrics in GA4 appeared first on Lynchpin.