Big Data Archives

Taboola Blog
Big Data

How to Kill Underperforming Features and Why You Should Do It Now

Engineering

22 March 2022

How to Kill Underperforming Features and Why You Should Do It Now

Why is it important to remove underperforming features to improve the product’s key metrics? Find out here.

Attentive Audiences: How Taboola Increased Advertiser Success by Reducing Median CPA by 34% and Increasing Median CVR by 87%

Big Data

17 February 2022

Attentive Audiences: How Taboola Increased Advertiser Success by Reducing Median CPA by 34% and Increasing Median CVR by 87%

Is there a way to lower Cost Per Action or increase Conversion Rate? There is! Taboola created the Attentive Audience program, learn more about it here.

Dynamic Security Operations in Kubernetes

Engineering

3 February 2022

Dynamic Security Operations in Kubernetes

We wanted to see if there was a way we could sync our Kubernetes NetworkPolicies dynamically with tools we already use, like Consul and Calico.

How Taboola Powers the Conversion Data Pipe

Engineering

24 January 2022

How Taboola Powers the Conversion Data Pipe

Read this article to learn more about what conversions are, how Taboola handle billions of daily events at scale, and how it all presents meaningful data to customers.

Sneaky Peak to The Secrets of Kafka Assignment Strategy

Engineering

24 December 2021

Sneaky Peak to The Secrets of Kafka Assignment Strategy

Kafka is an open-source distributed event streaming platform and something went wrong while working with it. Let’s see how it was investigated and resolved.

Our Deployment Process Pillars: How Taboola Deploys & Manages the Servers that Bring You Recommendations

Big Data

3 December 2021

Our Deployment Process Pillars: How Taboola Deploys & Manages the Servers that Bring You Recommendations

Find out the secrets to how Taboola deploys and manages the thousands of servers that bring you recommendations every day.

Big Data

7 October 2021

Samplex: Scale Up Your Spark Jobs

In this article you will learn what Samplex is and how it is used to make processing of large raw datasets more efficient.

Big Data

27 May 2021

ScORe – Schema On Read for Spark SQL

The world is not flat, it’s highly nested With over 4 billion page views per day and over 100TB of data collected daily, scale at Taboola is no joke. Our primary data pipe deals with masses of data and endless read paths. Could we optimize our schema for all these read paths? Guess not… Our schema is HUGE and highly nested. After digesting the data, we keep it in hourly Parquet files on HDFS, where each hour consists of about 1-1.5TB of compressed data. Our schema roughly looks like this: root |– userSession: struct | |– maskedIp: long | |– geo: struct | | |– country: string | | |– region: string | | |– city: string | |– pageViews: array | | |– element: struct | | | |– url: string | | | |– referrer: string | | | |– widgets: array | | | | |– element: […]

The Challenges Of Uploading 150TB/day From Spark To BigQuery – Part 1

Big Data

25 March 2021

The Challenges Of Uploading 150TB/day From Spark To BigQuery – Part 1

Have you ever tried building an infrastructure to upload 150TB a day? Have you ever tried querying over 13PB without going bankrupt? These are some of Taboola’s PV2Google (pageviews to Google) service scale challenges that we deal with in our day to day. In this blog series, we’ll share how we do it, and the challenges we face. In this article (part 1) we’ll focus on the architecture. Part 2 covers the lessons we’ve learned over the years. Hello, Pageviews! Taboola’s goal is to power recommendations for publishers and advertisers. Our platform serves over 360 billion content recommendations and processes over two billion pageviews a day. Pageview is a record describing recommendations, user activity (such as a click), and much more on a user’s visit to a webpage. Currently, the pageview record has about 1,000 fields. Two billion pageviews generate a huge amount of data. This data is processed […]

The Challenges Of Uploading 150TB/day From Spark To BigQuery – Part 2

Big Data

25 March 2021

The Challenges Of Uploading 150TB/day From Spark To BigQuery – Part 2

In part 1 of the series we shared the architecture of Taboola’s PV2Google service which uploads over 150TB/day to BigQuery. In this article (part 2), we’ll share the challenges and lessons we’ve learned over the course of a few years. Lesson 1: queries might be (extremely) expensive We continuously upload pageviews to BigQuery and keep them for six months. This translates to over 13PB of pageviews in BigQuery. Querying the entire dataset would be extremely expensive, about $65K/query (assuming $5/TB). We apply a few methods and guidelines to substantially reduce this cost: Never use `SELECT *`: BigQuery’s query cost is based on the size of the data scanned. Most queries actually need only a few fields. Hence, selecting only the relevant fields will dramatically reduce the cost of the query. Cluster tables: clustering is a neat BigQuery feature that reduces the scanned row count. With clustering, BigQuery optimizes the data […]

1 2 345

Big Data

Start Your Taboola Career Today!