During the pandemic, most companies quickly adapted and moved to a work-from-home model, as a sudden necessity of the lockdown restrictions introduced by efforts to combat the spread of COVID-19.
By Ariel Pisetzky and Tarek Shama Taken for Granted Taken for granted. That’s the way most users and even techies think of DNS. Or more precisely, they just don’t think of it at all. DNS is one of those things that for most users is a solved problem. You have a server with very reliable and stable software that can run for a very long time with little maintenance. The resolvers even have a nice built in failover mechanism for a secondary server. So, what more is there to say about this subject? Performance. Performance with DNS services has been seen as a geographical issue for years now. Yes, there have been paid DNS services that are faster at the DNS search level itself, especially if you are talking about complex records that have logic attached to them. Yet, the popular discussion is mostly around the global DNS providers. In […]
If you fall, fall right – a tale of SRE critical incident management By Yehuda Levi, Tal Valani, Ariel Pisetzky & Eli Azulai Imagine this scenario – your data center is down. 1500 servers are down. Each server needs to be handled and monitored and the responsibility for each should be divided between all teammates. Each team is looking after the status of their services. Client facing services are impacted. New information keeps flowing in from different channels and the status of the outage and servers keep changing. How to get the list of the server affected? How to put it all in one place? How to assign responsibility for each? What is the status of each server? How can the internal clients receive ongoing status updates? What happens if a server was intentionally down before the incident? What happens if a more complex issue occurs and the time to […]
Failure. I need to talk about failure, and not any failure, my failure. I need to share it with everyone in the production group, everyone in R&D. My team, my peers, my managers. The meeting will start in just a few minutes and I am under fire to explain what went wrong, how I failed the organization and how we need to be better. General George S. Patton Jr. said “The test of success is not what you do when you are on top. Success is how high you bounce when you hit the bottom.” There is a lot to learn from that saying, and not only for people. Successful systems need to bounce back from a failure and do it well. IT systems need to be able to endure a catastrophic event and just dust it off. This is what we expect of our production systems in Taboola, […]
So, the firefighters are in your data center, there is no electricity, and the pager is more like a DDoS attack on your phone than anything informative. You look at your watch, multiple thoughts running through your head. Why me? Why now? What was the last DR test result? How do you pull the team out and through this IT catastrophe and survive to write about it? This is my story, my personal fight with the IT “Murphy laws” and how we can all benefit from it. It was a Friday, one you know you need to be extra careful with. It’s always the end of the work week or smack in the middle of the night. (No IT catastrophe ever happens when it’s convenient to you, now does it? They always cluster and bunch around the most difficult times.) Anyway, it’s the end of the day Friday and multiple […]
MySQL Slave Replication Optimization Written by Yossi Kalif & Ariel Pisetzky MySQL in Taboola So you love MySQL – what do you know, so do we here at Taboola. We spend a lot of our time with MySQL building our infrastructure to provide over 30 billion recommendations a day on over 3 billion web pages. In this blog post we would like to share how we optimized our MySQL to replicate faster over WAN connections so that we would not need to wake up at night and fix things. Oh, and it also helped us speed things up, so when we do have issues, they resolve faster. So, if your infrastructure has MySQL and you have replication, this blog post is for you. TL;DR – at the bottom of the post Taboola operates in multiple data centers around the world, and at the time of the post, we have […]
As VP of IT at Taboola, my teams and I are overwhelmed with logs, pinned down by the rate and volume of them. The job of the Production Site Reliability Engineering (SRE) team in Taboola is to keep the technology running smoothly and bring in as many insights as we can from the system, making sure that any and every technical issue (that isn’t self healing or contained) is dealt with quickly. We also support this torrent of incoming data to make sure that any insights that can be gleaned from this data are found. With over 1B users discovering what’s interesting and new through the Taboola Feed, we can’t drop the ball or stop thinking about our logs, log management, where to store them and how to process them. This is the challenge of processing over one million lines of logs every second. To address this challenge, we engaged […]
There are few events that impact global internet traffic like the World Cup. People stopped what they were doing for it—here’s what happened.
Synthetic monitoring is something that we all do. It’s almost something that you don’t think about. You set up a monitor and it just tells you if the service is up or down, most times with just a simple GET. There are the giants in this field (lately consolidated under the Keynote brand as part of AppDynamics) and the new comers like Catchpoint, ThousandEyes, Pingdom (now part of SolarWinds) and WorldPing. All solutions have the same basic concept, pull website information from different agents around the world and provide visibility for the web site operator on uptime, response times and other metrics. But what happens with you have a failure, and no alert? These tools have become so widespread and have such long usage history, that it almost seems pointless to compare. This is a solved problem, no? Just take the cheapest one out there and you’re done. Here at […]