All Stories

27 May 2021

The world is not flat, it’s highly nested With over 4 billion page views per day and over 100TB of data collected daily, scale at Taboola is no joke. Our primary data pipe deals with masses of data and endless read paths. Could we optimize our schema for all these read paths? Guess not… Our schema is HUGE and highly nested. After digesting the data, we keep it in hourly Parquet files on HDFS, where each hour consists of about 1-1.5TB of compressed data. Our schema roughly looks like this: root |– userSession: struct | |– maskedIp: long | |– geo: struct | | |– country: string | | |– region: string | | |– city: string | |– pageViews: array | | |– element: struct | | | |– url: string | | | |– referrer: string | | | |– widgets: array | | | | |– element: […]

Start Your Taboola Career Today!