in

Effective approaches to dealing with tracking time spent on content and webpages

dealing with tracking time spent on content and webpages

Effective approaches to dealing with tracking time spent on content and webpages

ON DATA ENGINEERING

How to handle the difficulty in leveraging and tracking time

Julien Kervizic
Photo by Thomas Bormans on Unsplash

There is a lot of focus on engagement analysis to track customer time spent on different pieces of content. Time Spend is usually a metric that proves to have quite a lot of data quality issues.

It helps to provide a metric that informs on the potential the share of time that you can capture from your audience and identify your most engaged users.

The time spend metrics, also allows to identify some possible useability issue on the website. For instance, if too much time is spent during the checkout process, there might need to be some work done on the overall checkout flow.

Time Spent definition

Measuring time spent

There are three main methods of tracking time spent.

  1. Rely on message pings
  2. Rely on exit events
  3. Rely on event sessionization

Message Pings

An example of how this could work is provided in the diagram above. In the example, the client sends pings every second of activity. To compute the time spent on each page, we can sum up the number of events and multiply it by the interval (1s). There are five events to consider for the landing page: the time spent is therefore of 5 seconds, the checkout page has six events, and hence of 6 seconds of time spent while the thank-you page has four events associated and 4 seconds of time spent. There is no other pages that following it so we can assume that the page exit was 4 seconds after reaching the thank you page

Parsely’s heartbeat tracking event

This type of approach is taken by Parsely’s, a content optimization software used on medium. This type of approach can be used both for events while a tab is active/in focus or even as a background activity. Document visibility state, provide the information as to whether a page is currently visible to the user.

Caution needs to be used if you are looking to get time spend, including when a tab is inactive. Most browsers sets the minimum interval (setTimeout/setInterval) at 1000ms for inactive tabs.

The main advantage of using a heartbeat method is that you have a continuous feed of activity and can quickly pinpoint when there is a drop-off from the user behavior. This approach is also quite resilient to crash or data loss (only a given interval time spent is likely to be lost at a time).

There are advantages to leveraging this approach. The time spent can easily be computed by just summing the time spent on each event. Calculating this time spent is something that can quickly be done even in real-time applications.

However, this kind of approach has the drawback of increasing the overall data volume flowing onto the data platforms, while providing only limited information. A second drawback is the increasing bandwidth used and the number of requests potentially slowing down the website.

Exit Events

Entry events are typically events such as page views that mark the page as being the active page. Another typical type of entry event is the tab being put back into focus.

Exit events need to be sent in several scenarios, when clicking on a different link when doing a page refresh when typing a different URL in the browser, when closing a tab or the browser directly. The typical way exit events are implemented is by hooking up to a window unload event.

Exit event time spent tracking, requires a limited volume of events to be sent to track the time spent on pages. It can provide an exact timing as to when a user has exited. It also provides a reasonably easy way to check how much time was spent on a given page. You just need to look at the specific exit events to get this information. Calculating the total amount of time spent is also easy to do in real-time, the aggregate time spent can be obtained by summing the individual events time spent.

On the downside, this can of approach is not very resilient to data loss. A missing event can have a disproportionate impact on the time spent tracking. It also has a problem that there needs to be an exit event generated to track the time spent. This approach to tracking time spent can be tricky in case tabs remain opens for days or months.

Even Sessionization

The time spent on the page would be the difference in time between the entry of the first page and the greatest of another entry event timestamp and the last event associated with the page before another entry event occurred. For cases that do not have events happening within the lookahead window, the session is thought of as terminated with that event, and the time spend from that event is assumed to be a default value.

This method is the one used for Google Analytics to track Time Spend. For calculating this time spent, Google Analytics only leverages a subset of events called interaction events that are meant to represent users’ true active behavior.

In the example above, there is a 7s lookahead for events being present to be included in the time spent calculation. Both the Landing page and the Checkout page events have another event present in their lookahead window. Their time spent is therefore calculated as the difference between the first event of the next page and the first event of their respective page. In the case of the landing page, this amounts to 5s and 6s form the checkout page. For the thank you page, no specific event has been found within the lookahead window. A default value for the time spent is therefore assumed for this page, 4s in the example.

There are a few advantages of taking this approach: being able to rely on existing tracking, reconstructing the time spent from historical data, and potentially changing the degree of granularity. Leveraging a session-based approach with a lookahead timeout is also able to reduce the number of significant outliers that need to be cleaned up from the data to have something meaningful.

However, there are drawbacks to the approach, such as needing a comprehensive set of events to be triggered to report accurate metrics. Failure to have the right coverage would put you at risk of under-reporting the time spent on the different pages. Another disadvantage is the computational complexity needed to calculate this time spent, making it difficult to rely on this approach for real-time applications.

Offline

A typical mitigation mechanism relies on having the events queued on local storage until they can be sent again. This approach needs to be taken with caution; however, a few things can derail the data quality that will be obtained.

First off, how the tracking is happening, does it need to rely on client time? Client time is subject to potentially be changed on the device itself, leading to the incorrect data being collected. Ping/Heartbeat is more robust to these issues as it is collected based on system intervals.

The second major point is about how to take into account this time spent. Should it be provided at the time of collection or when the data finally reached the servers? It can be complex to restate the historical data in Big Data systems, and it might be preferred to add the collected time spent to the current date partition. This way also avoids backtracking the actual collection time.

Additional issues can be raised when leveraging exit events calculations. Exit events need to be tied to a specific entry events. In Big data systems, that usually don’t leverage an index, it can be quite expensive to retrieve this entry record. The exit record timestamp also needs to be computed relative to this entry record. This means it would be reasonably fragile to the client timestamps for both entry and exit events.

Behavioral Events

When doing so, there are a couple of things to take into account, such as making sure that the events sent are tied to a specific page. Otherwise, it can be difficult to attribute the time spent to the right page, when dealing with multiple open tabs.

Another thing to consider is to leverage a “hit sequence”. Collecting on the front-end side what is the number of the event within the session that has been provided. This type of numbering can help to identify missing data.

Missing pings

In the example shown above, we have received three pings, at server time 2, 4, and 10. We have a gap of 6s between the last event and the second event received. We usually should have received two more events, at server time 6 and 8.

Whether or not it is wise to correct the data is very much dependent on the overall tracking implementation. Whether there is meant to be entry or exit events should have happened along the way that should have provided information that these assumptions prove invalid.

Hit Sequence

A naive way to do this is the fill the gap in terms of timing by spacing each event equally apart in time. Another way to handle it by looking at the expected time spend from each event. So we will be looking for hit 3, for the average time spent between a checkout page and a form input, and from a form input to a form submit event.

Extremely large time spend

There are different approaches for treating this at an event level. From straight-up time spend scaling (multiplying each event time spent by the ratio of capped time spent/collected time spent), to allocating zero time spent after having hit the time spent cap.

There are a few approaches to collecting time spent, each with its advantages and drawbacks. One, however, needs to take the data collected with a grain of salt. Bad data can easily be generated for time spent, and the data will generally need to be treated to correct either missing data or some of its abnormalities.

Source link

What do you think?

Written by Ayodeji Edunjobi

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

fifteen − 11 =

Loading…

0
Robotics

AI and Robotics Bringing Automation to the Next Level

8 Effective Tips to Increase Productivity as a Developer

8 Effective Tips to Increase Productivity as a Developer