Rohit Shinde

Counting Numbers: A Debugging Story

What is ads?

In an ad system, there are three main entities. A business which wants to show an ad to people, the group of people who are eligible to see that ad, and the platform which ties both the parties. The platform generally charges money to businesses based on the number of views, or clicks the ad gets.

Goto is a fat app, and serves many use-cases. You can order food, a ride, or pay for things, and more. The app serves a lot of users (10s of millions). The ads team at Goto monetizes some of the pages within the app by serving ads.

Why do we need to count the numbers?

How do you charge money? It’s based on the number of views the platform can get for an ad. That involves counting numbers. A business asks a simple question, “How many people saw my ad campaign today?”. Well, it’s a number, and the platform needs to keep a counter against that ad.

So what was the problem?

When I joined the ad team, an issue was brought to tech team’s attention: Some campaigns were not getting enough impressions (starving), while others were getting more than necessary. More than necessary means the platform is losing money. The same wasted view can be served to the starving campagin and the platform would earn more money.

Some lingo

Before we proceed further, let me set up some terms. Maybe refer these whenever confused because of an unfamiliar term.

Architecture

Now before we point out issues with the setup, lets quickly paint what system looked like.

As described above, the communication between three entities is captured by this high level diagram above. Below is a detailed architecture. Henceforth, we will ignore the campagin creation flow and focus on impression counting and campaign serving.

So, what’s happening in this?

  1. The mobile app used by consumers sends events (views, clicks ~ impressions) to an internal service called Clickstream.
  2. This service puts these events on a kafka topic. Let’s call it topic 1. Now the app is huge, and all sorts of events are recorded, all of them end up on topic 1. The ads impressions are a subset of them.
  3. Dagger essentially acts as a filter on top of topic 1. It picks the relevant impression events and puts them on another kafka topic. Let’s call it topic 2.
  4. The consumer on topic 2 picks up these events, and keeps a time window against each user session and aggregated counts against each campaign (more on this below).
  5. The redis box keeps the impression counts against all campaigns. This is where we increment these numbers. Basically a list of: campaign: impression counts <number>.

Here is how the event message (on kafka) looked like, one event is one impression.

{
  campaign_id: <UUID>, // Campaign identifier
  user_id: <UUID>,
  session_id: <UUID>,
  event_type: <"CLICK" | "VIEW">, // Type of the impression
  event_timestamp: <unix_epoch> // Event generation time
}

What is time window? (point 4 above)

How do counts work?

This is a responsibility of the consumer on topic 2.

  1. Pick a batch of events
  2. Generate a list of keys for every message (to deduplicate events which fall within 40 second window). Let’s call them dedup keys. This was essentially a function of: campaign_id + user_id + event_type + time window identifier, where $$\text{time window identifier} = \Big\lfloor{\frac{\text{unix epoch timestamp}}{40}}\Big\rfloor \times 40$$
  3. Check which of these dedup keys exist in Redis.
  4. Filter out those campaign_ids from the original batch of messages, whose dedup keys exist in Redis.
  5. For filtered campaigns, make another batch call to Redis, and increment counts for them.
  6. Now for the same list of campaign_ids, set dedup keys in Redis - so if you see another event which falls within the same window, we don’t count that twice (step 4 in this list).

Back to the problem

On average, a campagin asked for about 15k impressions a day. Now there were two types of campaigns:

The overall effect of this was that the effective utilisation rate of the ads was ~40%. Which should have been at least 85%.

To fix the lag

The average number of impressions a campaign could receive in 1 minute: 15,000 The average number of impressions a campaign purchased for a day: 16,000

So roughly the same numbers. What it means is you could burn through all of campaign’s impressions in a minute - this would be the worst case, there were other constraints that decided which campaign to serve. We’re not talking about them here.

Is your cron running frequent enough?

Is the TTL long enough?

Get rid of the TTL, make counters idempotent - HyperLogLog

“Unique items can be difficult to count. Usually this means storing every unique item then recalling this information somehow. With Redis, this can be accomplished by using a set and a single command, however both the storage and time complexity of this with very large sets is prohibitive. HyperLogLog provides a probabilistic alternative.”

from https://redis.com/redis-best-practices/counting/hyperloglog/

Summary and conclusion