14 September 2020

Taboola is one of the largest content recommendation companies in the world. We maintain hundreds of servers in multiple data centers around the world, while obligated to strict SLAs. Thus, you might understand why our engineers would appreciate a little heads up when the system gets overloaded. Like most companies today, we use metrics to visualize our services’ health, and our challenge is to create an automatic system that will detect issues in multiple metrics as soon as possible, without any performance impact. A real life example Wouldn’t it be nice if we could predict the impact on our response time metric when major events are about to happen? For example “Black Friday”, “Cyber Monday” or even the “Kobe Bryant’s tragedy”, on the 26/1/2020? – as can be seen below: Figure 1: Kobe Bryant’s downtime 26/1/20 And yes – the gap with no metrics around the 26/1 is the downtime […]

7 December 2019

In this blogpost I will describe how we, at Taboola, changed our metrics infrastructure twice as a result of continuous scaling in metrics volume. In the past two years, we moved from supporting 20 million metrics/min with Graphite, to 80 million metrics/min using Metrictank, and finally to a framework that will enable us to grow to over 100 million metrics/min, with Prometheus and Thanos. The journey to scale begins Taboola is constantly growing. Our publishers and advertisers increase exponentially, thus our data increases, leading to a constant growth in metrics volume. We started with a basic metrics configuration of Graphite servers. We used a Graphite Reporter component to get a snapshot of metrics from MetricRegistry (a 3rd party collection of metrics belonging to dropwizard that we used) every minute, and sent them in batches to RabbitMq for the carbon-relays to consume. The carbons are part of Graphite’s backend, and are […]

