About 8 months ago my team and I were facing the challenge of building our first Deep Learning infrastructure. One of my team members (a brilliant data scientist) was working on a prototype for our first deep model. The time arrived to move forward to production. I was honored to lead this effort.

Our achievements: we built an infrastructure that ranks over 600K items/sec, our deep models have beaten the previous models by a large margin. This pioneer project has led the way for the subsequent Deep Learning projects at Taboola.

So the prototype was ready, and I was wondering: how to go from a messy script to a production ready framework? In other words, if you are into establishing a deep model pipeline this post is for you.

This blog post is focused on the training infrastructure, without  the inference infrastructure.


Assume you have basic knowledge in:

  • Python
  • Train Machine Learning model
  • TensorFlow
  • Docker

Let’s take one step back and think…Ideally such a training infrastructure should support:

  • Comfortable development environment
  • Portability among environments
  • Run on CPU/GPU
  • Simple & Flexible configuration
  • Support Multiple Network Architectures
  • Packaging the model
  • Hyper Parameter Tuning

One mantra to keep in mind:

‘Speedy Gonzales Time to Market’ == ‘no over engineering!’ == ‘Keep it simple’

Enough with the baloney, let’s get down to business:

Technology stack

BigQuery, Google Storage (GS), Python, Pandas, TensorFlow, Docker, and some other Python nice libraries: fabric, docker-py, pyformance & spacy

Comfortable development environment

This makes the whole team move faster, plus putting a new team member into action is super easy. To achieve this we did 2 things:

The first was using a small dataset for sanity training, a JSON file with 2K rows. Used both for testing on a developer machine and for sanity automation tests. This is very helpful for quick testing your code changes.

The second is a project management script (just one big Python script).

We implemented a simple command line tool using ‘fabric’ (Python library). Here are few example commands:

$ fab --list

Available commands:

    init_env          initializes Python virtual environment & installs 3rd party

    test                runs unit tests

    train               runs sanity training

    train_all          runs sanity training on all model variants

    docker_build  builds snapshot Docker image

    docker_test    same as test, only from inside the Docker image

    docker_train   same as train, only from inside the Docker image

Using docker-py library was very useful for the Docker stuff.

CI tasks were also added to this script (used by Jenkins build agent).

High level flow

high level flow

Dataset is generated from raw data tables in BigQuery and stored as a JSON file in GS.

Then the training process kicks in: download the dataset, train the model, and export the model back to GS (to some other path).

Fit dataset into memory

Not always possible, but it was for us. Our dataset contains about 5M rows (~ 4GB JSON file). The JSON file is loaded into pandas’ DataFrame for pre-processing, then split into train & test. This approach is simple to implement, gaining fast training run time. If the dataset does not fit into the memory, a quick (maybe temporary) solution could be subsampling it.

Portability among environments

No more “It works on my machine”! By using Docker, training runs the same on the developer machine, automation and cloud environments. Training input & output is stored in GS, no matter where the training process ran.

Running on CPU/GPU

  • Install both TensorFlow GPU and CPU versions (in the project management script: use the GPU for production training and the CPU for development and testing)
  • Use NVIDIA Docker base image – note that the CUDA version should match the TensorFlow version (see more instruction on TensorFlow’s  website)
FROM nvidia/cuda:8.0-cudnn5-devel

RUN pip install virtualenv==15.0.3

RUN virtualenv --system-site-packages /venv-cpu
RUN virtualenv --system-site-packages /venv-gpu

RUN chmod +x /venv-cpu/bin/activate
RUN chmod +x /venv-gpu/bin/activate

COPY ./requirements.txt /tmp/requirements.txt

RUN /venv-cpu/bin/pip install -r /tmp/requirements.txt
RUN /venv-gpu/bin/pip install -r /tmp/requirements.txt

RUN /venv-gpu/bin/pip install https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-$MY_TF_VER-cp27-none-linux_x86_64.whl
RUN /venv-cpu/bin/pip install https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-$MY_TF_VER-cp27-none-linux_x86_64.whl
  • Use nvidia-docker command line tool – a thin wrapper for Docker command line. It enables the Docker container to use the GPU

Simple & flexible configuration

Flexible for research, live models & hyper parameter tuning. Configuration should be simple to reason – I have seen too many configuration mechanisms get out of hand. I think we found a fair trade off between flexibility and simplicity:

First, no hierarchies! (keep it as simple as possible).

For example:

Hierarchies are Bad:

   -- pre_prossesing:

      -- subsample_ration = 0.3

   -- training_loop

      -- batch_size = 124

      -- epochs = 100

   -- hyper_param

      -- learining_rate = 0.1


Flat is Good:

-- subsample_ration = 0.3

-- batch_size = 124

-- epochs = 100

-- learning_rate = 1.1

Config values are set in 3 levels:

Level 1 – NetConfig class:

class NetConfig:

	def __init__(self,


As you can see default values are hard coded as part of the configuration class.

Level 2 – variant_config.json:


	"learning_rate" : 0.001



	"graph": "deep_v2",

	"batch_size": 124


Here we define our model variants. Only overriding non default configurations.

The configuration file is held with the code, and baked into the Docker image. A new Docker image is built for every configuration change.

Level 3 – environment variable:

nvidia-docker run  -e batch_size=64 ….

Using environment variable enables overriding a specific configuration for a single training run. We use it for research and hyper parameter tuning.

Support Multiple Network Architectures

training flow image

Supporting multiple network architectures is a key concept for us. It enables many team members to work in parallel on different architectures (and set A/B tests on live traffic).

So how do we support multi network architectures? Let’s take a closer look into ‘build network graph’ step:

def build_model_graph(net_config, dataset_shapes, tf_session):
   if config.graph == ‘deep_v1’:
	return build_graph_deep_v1(net_config, dataset_shapes, tf_session) 
   elif config.graph == ‘rnn_v1’: 
	return build_graph_rnn_v1(net_config, dataset_shapes, tf_session)

‘build_graph_deep_v1’ & ‘build_graph_rnn_v1’ are just static methods that contain  the TensorFlow code for defining the graph.

‘build_model_graph’ returns an OpsWrapper instance, holding the TensorFlow ops for the training loop:

class NetOps:
    def __init__(self,

Note that the ‘build_model_graph’ has 3 inputs: net_config (discussed earlier), dataset_shapes and tf_session. The dataset_shapes object is created during the preprocessing step. It contains features vector information that is used to build the network graph. For instance, the cardinality of a categorical feature is used to define the size of the embedding lookup table.

class DatasetShapes:
	def __init__(self,

The tf_session in just an instance of tf.Session class.

Packaging the Model

Packaging the model TensorFlow files inside a zip with other extra files, can help debug and monitor the training process. What extra files did we store?

  • TensorBoard files
  • Graph image of train & test offline metrics
  • Offline metrics CSV:
  •            >  row for each epoch
  •            >  add all kinds of offline metrics, such as train & test error & loss
  • DeFacto configuration – used to train the model
  • Dataset metadata – for example dataset size
  • Other things you may find useful for troubleshooting & debugging

Hyper parameter tuning

Hyper parameter tuning is critical. A big improvement on model performance can be gained by tuning parameters such as, batch size, learning rate, embedding dimension, hidden layer depth, width, and so on….

We started by implementing a simple Python script that runs a random search on the hyper parameter space. It runs sequentially using a single GPU. We just ran it for days. To get faster and better results we moved to the cloud, running in parallel on multiple machines & GPUs.

All-in-all we had lots of fun working on this project (at least I did). We ended up with a relatively simple and straightforward infrastructure, being used by the whole team, raising somewhere between 10-15 live A/B tests in parallel.

That is it! I hope you find it useful.

Special thanks to my team: Yedid, Aviv, Dan, Stavros, Efrat and Yoel.

Originally Published:

Start Your Taboola Career Today!