ON DATA ENGINEERING
How to handle the difficulty in leveraging and tracking time
There is a lot of focus on engagement analysis to track customer time spent on different pieces of content. Time Spend is usually a metric that proves to have quite a lot of data quality issues.
Time spent is an essential metric in engagement study. It provides a general measure of engagement with the content on your website or the popularity of your app. It allows us to blend across the different types of material, such as pictures, text content, or even videos.
It helps to provide a metric that informs on the potential the share of time that you can capture from your audience and identify your most engaged users.
The time spend metrics, also allows to identify some possible useability issue on the website. For instance, if too much time is spent during the checkout process, there might need to be some work done on the overall checkout flow.
Time Spent definition
There is some work needed in establishing how to define time spent for the specific use case that needs to be tackled. Do we consider any time spend when a tab is opened? Do we only consider it when it is active? Or do we only consider time spent when there is some activity on the page?
Measuring time spent
There are different approaches to measuring time spend. Calculation can be done either on the front-end or a posteriori, backtracking information based on the data collected.
There are three main methods of tracking time spent.
- Rely on message pings
- Rely on exit events
- Rely on event sessionization
The client (web browser/app), sends regular pinging messages at a regular interval (e.g., every 10s), providing information about which content or page is currently active.
An example of how this could work is provided in the diagram above. In the example, the client sends pings every second of activity. To compute the time spent on each page, we can sum up the number of events and multiply it by the interval (1s). There are five events to consider for the landing page: the time spent is therefore of 5 seconds, the checkout page has six events, and hence of 6 seconds of time spent while the thank-you page has four events associated and 4 seconds of time spent. There is no other pages that following it so we can assume that the page exit was 4 seconds after reaching the thank you page
This type of approach is taken by Parsely’s, a content optimization software used on medium. This type of approach can be used both for events while a tab is active/in focus or even as a background activity. Document visibility state, provide the information as to whether a page is currently visible to the user.
Caution needs to be used if you are looking to get time spend, including when a tab is inactive. Most browsers sets the minimum interval (setTimeout/setInterval) at 1000ms for inactive tabs.
The main advantage of using a heartbeat method is that you have a continuous feed of activity and can quickly pinpoint when there is a drop-off from the user behavior. This approach is also quite resilient to crash or data loss (only a given interval time spent is likely to be lost at a time).
There are advantages to leveraging this approach. The time spent can easily be computed by just summing the time spent on each event. Calculating this time spent is something that can quickly be done even in real-time applications.
However, this kind of approach has the drawback of increasing the overall data volume flowing onto the data platforms, while providing only limited information. A second drawback is the increasing bandwidth used and the number of requests potentially slowing down the website.
Calculating time spend based on exit events relies on two types of events being presents 1) entry events and 2) The exit events
Entry events are typically events such as page views that mark the page as being the active page. Another typical type of entry event is the tab being put back into focus.
Exit events need to be sent in several scenarios, when clicking on a different link when doing a page refresh when typing a different URL in the browser, when closing a tab or the browser directly. The typical way exit events are implemented is by hooking up to a window unload event.
Exit event time spent tracking, requires a limited volume of events to be sent to track the time spent on pages. It can provide an exact timing as to when a user has exited. It also provides a reasonably easy way to check how much time was spent on a given page. You just need to look at the specific exit events to get this information. Calculating the total amount of time spent is also easy to do in real-time, the aggregate time spent can be obtained by summing the individual events time spent.
On the downside, this can of approach is not very resilient to data loss. A missing event can have a disproportionate impact on the time spent tracking. It also has a problem that there needs to be an exit event generated to track the time spent. This approach to tracking time spent can be tricky in case tabs remain opens for days or months.
It is possible to use activity events to track time spent on a page (well, if this is the definition you go for ). It is possible to treat the sequence of events as being part of a given session and primarily looking for entry and exit events within a lookahead window.
The time spent on the page would be the difference in time between the entry of the first page and the greatest of another entry event timestamp and the last event associated with the page before another entry event occurred. For cases that do not have events happening within the lookahead window, the session is thought of as terminated with that event, and the time spend from that event is assumed to be a default value.
This method is the one used for Google Analytics to track Time Spend. For calculating this time spent, Google Analytics only leverages a subset of events called interaction events that are meant to represent users’ true active behavior.
In the example above, there is a 7s lookahead for events being present to be included in the time spent calculation. Both the Landing page and the Checkout page events have another event present in their lookahead window. Their time spent is therefore calculated as the difference between the first event of the next page and the first event of their respective page. In the case of the landing page, this amounts to 5s and 6s form the checkout page. For the thank you page, no specific event has been found within the lookahead window. A default value for the time spent is therefore assumed for this page, 4s in the example.
There are a few advantages of taking this approach: being able to rely on existing tracking, reconstructing the time spent from historical data, and potentially changing the degree of granularity. Leveraging a session-based approach with a lookahead timeout is also able to reduce the number of significant outliers that need to be cleaned up from the data to have something meaningful.
However, there are drawbacks to the approach, such as needing a comprehensive set of events to be triggered to report accurate metrics. Failure to have the right coverage would put you at risk of under-reporting the time spent on the different pages. Another disadvantage is the computational complexity needed to calculate this time spent, making it difficult to rely on this approach for real-time applications.
A good and stable internet connection is not always available. This is the case in developing countries, remote areas, or even when just happening to take the underground. When this happens, and there haven’t been any mitigation strategies implemented to deal with these issues, there will be data loss. The impact of which will be depending on the approach to time spent tracking.
A typical mitigation mechanism relies on having the events queued on local storage until they can be sent again. This approach needs to be taken with caution; however, a few things can derail the data quality that will be obtained.
First off, how the tracking is happening, does it need to rely on client time? Client time is subject to potentially be changed on the device itself, leading to the incorrect data being collected. Ping/Heartbeat is more robust to these issues as it is collected based on system intervals.
The second major point is about how to take into account this time spent. Should it be provided at the time of collection or when the data finally reached the servers? It can be complex to restate the historical data in Big Data systems, and it might be preferred to add the collected time spent to the current date partition. This way also avoids backtracking the actual collection time.
Additional issues can be raised when leveraging exit events calculations. Exit events need to be tied to a specific entry events. In Big data systems, that usually don’t leverage an index, it can be quite expensive to retrieve this entry record. The exit record timestamp also needs to be computed relative to this entry record. This means it would be reasonably fragile to the client timestamps for both entry and exit events.
It is important to be tracking the right amount of information. The more behavioral information is sent, the more accurate the time spend estimation can be. On top of events such as page views and clicks, including events such as scrolling events can provide the right level of data to come up with a time spent estimation.
When doing so, there are a couple of things to take into account, such as making sure that the events sent are tied to a specific page. Otherwise, it can be difficult to attribute the time spent to the right page, when dealing with multiple open tabs.
Another thing to consider is to leverage a “hit sequence”. Collecting on the front-end side what is the number of the event within the session that has been provided. This type of numbering can help to identify missing data.
If using the pins/heartbeat approach, it is possible to fix the missing data using contextual data fills. If the data is provided and consistent for a page, for instance, and no other exit event has been provided, it is possible to reconstruct the missing pings, by assuming that should have happened during the intervals.
In the example shown above, we have received three pings, at server time 2, 4, and 10. We have a gap of 6s between the last event and the second event received. We usually should have received two more events, at server time 6 and 8.
Whether or not it is wise to correct the data is very much dependent on the overall tracking implementation. Whether there is meant to be entry or exit events should have happened along the way that should have provided information that these assumptions prove invalid.
Hit sequences sent alongside the different events to the server allows us to identify potentially missing data. Using this information, it is possible to reconstruct the missing events if there is some expectation about what event should have happened in that gap. For instance, to go from a checkout page to a thank you page, we might expect a form input and form submit event.
A naive way to do this is the fill the gap in terms of timing by spacing each event equally apart in time. Another way to handle it by looking at the expected time spend from each event. So we will be looking for hit 3, for the average time spent between a checkout page and a form input, and from a form input to a form submit event.
Extremely large time spend
There are multiple reasons why extremely large time spend could end up in the data you are collecting. From people “website parking” opened tabs, to scrapers going through your website, or just a tab that stayed open for longer than expected. Capping the time spend per session allows us to limit the impact of these abnormally high time spent. A typical value is to look at the 99th percentile of time spent per session.
There are different approaches for treating this at an event level. From straight-up time spend scaling (multiplying each event time spent by the ratio of capped time spent/collected time spent), to allocating zero time spent after having hit the time spent cap.
Time spent is one of the most important metrics to measure in engagement studies. Having a clear definition of what is meant to be collected is paramount for interpreting the data that is collected and avoiding some of its’ pitfalls.
There are a few approaches to collecting time spent, each with its advantages and drawbacks. One, however, needs to take the data collected with a grain of salt. Bad data can easily be generated for time spent, and the data will generally need to be treated to correct either missing data or some of its abnormalities.