- Taboola Blog
- Tips and Tricks
This post is not about K8S – nor is it about AWS. It is not about containers – nor is it about some new, “cool” technology for managing large-scale applications. Rather, this post is about how we deploy a highly sophisticated Java service, a heavy service that is very actively developed on a daily basis, to 1000s of servers across our 7 data centers around the world. So what’s the problem? Isn’t it enough to take a list of servers, get the version to deploy and run it with an automation tool like ansible? Well, it’s not as simple as it might seem. This service serves Taboola’s recommendations and responds to hundreds of thousands requests per second. The service has to be fast – so fast that its p95 should be below 500 milliseconds per request. Which means we can’t have any downtime at all, or even afford slower […]
The story starts with metrics. Every mature software company needs to have a metric system to monitor resource utilisation. At some point, we noticed under-utilization of spark executors and thier CPUs. Usually, dynamic allocation is used instead of static resource allocation in order to improve CPU utilisation through sharing. In this blog post, we’ll define the problem, share the goals we worked towards and highlight many technical peculiarities regarding dynamic allocation usage along the way. At Taboola, we use Grafana, Prometheus with a Kafka-based pipeline to collect metrics from several data-centers around the world. Metrics at scale is a very interesting topic and involves multiple problems in itself and we have previously covered these in our blog and meetup presentations. Our data platform comprises several services that compute data projections and, importantly, those are long-running processes with long-living spark context. Periodically, when triggered, these services process new chunks of data, […]
In Taboola, we deal with scale, huge scale. A small issue might turn into a disaster in a matter of hours. Re-writing and replacing an existing service with a new one is a real challenge, moreover doing it without causing downtime is SCARY. Reading logs is not an option. Logs are gigantic, unwieldy and span over many machines. It would take hours to combine and analyze them. In this post I will share with you three graphs in Grafana that I think are a must for observing new code. Let’s start… Did I break production? You write your shiny code, you (even) test it, but, how would you verify that you didn’t break the production environment? Luckily, we use Grafana, and this actually makes a big difference. My plan was to compare old code vs. new in Grafana, but, where to start? You have Grafana… let’s use it! Frankly, I […]
At Taboola, we work daily on improving our Deep-Learning-based content-recommendation model. We use it to suggest personalized news articles and ads to hundreds of millions users a day, so naturally we must stick to state-of-the-art deep learning modeling methods. But our job doesn’t end there – analyzing our results is a must too, and then we sometimes return to our data science roots and apply some very basic techniques. Let’s lay such a problem out. We are investigating a deep model that behaves rather strangely: it wins over our default model for what looks like a random group of advertisers, and loses for another group. This behavior is stable in the day to day, so it looks like there might be some inherent advertisers qualities (what we’ll call – campaign features) to blame for this. You can see a typical model behavior for 4 campaigns below. So we hypothesize that […]
Sometimes we need to test urgent features fast. It has to be within a very short timeframe, when there is not enough time to run a full test plan for that feature. This might occur on different occasions. When not having enough manpower in QA to cover a full test plan for a feature. New special demands from an important client right before the release deadline. Product management needs new adjustments before the developer deploying a new product version. It can also happen when a client, team lead or PM wants a new feature and it should have been done YESTERDAY! It can also happen actively. Running every once in a while a wide post-production test, or dedicating limited time for a bug hunt. We at the Taboola Video Solution department call it “Search for a Bug Thursday”. This unplanned development might end up launching a “half baked” product. It […]
At Taboola, our goal is to predict whether users will click on the ads we present to them. Our models use all kinds of features, yet the most interesting ones tend to be related to the users’ history. Understanding how to use these features well can have a huge impact on the model’s personalization capabilities, due to the user-specific knowledge they hold. User history features vary strongly between different users; for example, one popular feature is user categories – the topics a user had previously read. An example for such a list might look like this – {“sports”, “business”, “news”}. Each value in these lists is categorical and they have multiple entries, so we name them Multi-Categorical features. Multi-Categorical lists can have any number of values per user – which means our model must handle both very long lists and completely empty lists (for new users). Supplying inputs of unknown length […]
Synchronized Clocks: Someone has to be blamed… I blame Ariel. He took one look at the five analog wall clocks set to timezones for various offices and said “I hate that”. I looked to see why. Yes, all the second has were wildly different and the minute hands were all slightly off too. I’d been thinking about what my next electronics project would be. “How hard could this be?” Goals Thinking about what I’d want this to do I came up with a few goals: Correct and accurate time via NTP (Network Time Protocol) No two clocks should differ more than 100ms, they should all appear to tick at the same instant. Battery operation. About one year between battery changes. Automatic Daylight Savings Time adjustments Remember clock position/state on power loss Inexpensive – less than $20 in parts, not including the clock & batteries. Initial Research Started with […]
A few years ago, one of my friends suggested me to become a cybersecurity teacher in high school once a week as part of a program called Gvahim. I have not planned that it will contribute to my professional career, but I find a lot of analogies to my day to day role. I hope you will enjoy a different angle of management 101 guidelines. Program overview The program’s goals were to increase the knowledge of high school students in cybersecurity and increase the number of girls who study computer science. For three years in the program, students studied about Assembly, networks and operating systems, with an emphasis on security. Unlike traditional materials learned in high school, the lessons in the program put an emphasis on self-learning. The first two semesters were dedicated to learning the theoretical background using self-reading and small coding exercises. The last semester of the […]
In the following article, I describe how we came up with a way to improve the chances that our SDK library gets smoothly integrated in our customers’ Applications and reduce issues when going to production. The main idea is to take a number of significant clients’ applications and replace your existing SDK code with a new code, allowing you to see how the apps perform before you release a new SDK version. Why releasing a reliable SDK is so important Developing an SDK for mobile apps is very different from developing a standalone app. You can think of an SDK as a guest in someone else’s house. You need to behave, you can’t put your legs on the table or wipe your hands on the sofa (well, in most countries you can’t). So what I mean is that you can’t interfere with the app’s normal behavior, break some flows or […]
Delivering good product to live environment requires big effort from R&D. Under the software development life cycle, we can find 6 basic phases: Understanding the requirements, design, coding, testing, deployment (incl. A/B test, if necessary) and maintenance. But how can we measure product quality? By its stability? Scalability? Easy to maintain? Bug free code? There are probably many definitions for what is a good product, but in my opinion, the two foundation stones are product behavior & functionality as defined (be aligned with the product manager’s requirements), and zero critical bugs. The product can serve many goals, but if it doesn’t achieve the main one, it might not have a reason to exist. Naturally, customers are always expecting high quality from the product, so before releasing it to production QA should make sure that indeed critical bugs don’t exist. In order to respect these two, both R&D and QA should […]