- Taboola Blog
- Big Data
Intro At Taboola we use Spark extensively throughout the pipeline. Regularly faced with Spark-related scalability challenges, we look for optimisations in order to squeeze the most out of the library. Often, the problems we encounter are related to shuffles. In this post we will present a technique we discovered which gave us up to 8x boost in performance for jobs with huge data shuffles. Shuffles Shuffling is a process of redistributing data across partitions (aka repartitioning) that may or may not cause moving data across JVM processes or even over the wire (between executors on separate machines).Shuffles, despite their drawbacks, are sometimes inevitable. In our case, here are some of the problems we faced: Performance hit – Jobs run longer because shuffles use network and IO resources intensively. Cluster stability – Heavy shuffles fill scratch disks of cluster machines. This affects other jobs on the same cluster , since […]
We all have these amazing machines in our development and testing labs, and we know that our real users do not share this wonderful world. They experience our products very differently from us. These differences result in two major challenges: We do not know what the users experience We cannot debug their machines As a Video Advertisement Player team, these challenges are multiplied. Why? Our product is a third party script that serves other third party scripts for websites. Your code runs on different platforms As a third party web product, you do not know which websites your code runs on. Websites have a variety of frameworks, architectures and styles. Frameworks – change the browser’s core behavior, for example, redefining methods, which challenges the product’s basic behavior. Architectures – affect the website’s performance, which impacts on the product’s natural flow. Styles -manipulate the product’s look and feel. Running […]
Prioritizing Kafka Topic Consumption: How I Developed a Mechanism to Optimize Message Handling. Discover how to handle messages efficiently.
Large production pipelines in TensorFlow are quite difficult to pull off. Training small models is easy, and we mostly do this at first, but as soon as we get to the rest of the pipeline, complexity rapidly mounts. One reason is that the “Computation Graph” abstraction used by TensorFlow is a close, but not exact match for the ML model we expect to train and use. How so? Typically a model will be used in at least three ways: Training – finding the correct weights or parameters for the model given some training data. Often done periodically as new data arrives. Evaluation – calculating various metrics during training on a different data set to evaluate training quality or for cross validation. Serving – on-demand prediction for new data There could be more modes. For example we could re-train an existing model or apply the model to a large amount of […]