The Yhat Blog

machine learning, data science, engineering

How Yhat Does Cloud Balancing: A Case Study

by Ryan J. O'Neil |

If you're reading this, you're probably aware that we at Yhat offer a public sandbox where anyone can try out our ScienceOps product without charge. Giving potential customers the chance to try out and evaluate our products is an important tool for generating new business. What may not be obvious are the logistical challenges behind such an offering. This post will address some of those challenges, along with operational modeling techniques we use to solve them. Specifically, we'll show how we keep such a free service running in the cloud at reasonable cost.

First, a bit about ScienceOps

At its core, ScienceOps is a system for running models written in Python or R. These models are defined by the ScienceOps users and, as far as the ScienceOps platform is concerned, consist of arbitrary chunks of code. Each deployed model is run inside its own virtualized container, which takes a finite amount of disk space, memory, and CPU.

In our ScienceOps Cloud offering, we associate users atomically with servers. That means all models for a given user live on the same server. If a model is running, it consumes system resources. If it hasn't been used for over 30 days, it is marked as sleeping, and does not consume resources. Sleeping models can be reawakened at user request.

Through empirical observation, we've established that we can run at most $MAX_{models}$ models on one m3.2xlarge instance in Amazon Web Services, where $MAX_{models}$ is a constant. As new users create accounts in ScienceOps, they are assigned to the least busy server at the time. A new ScienceOps instance is spun up when all the existing servers have $MAX_{models}$ running models.

Many users who sign up for ScienceOps Cloud are transient. They create an account, try out some of the examples in our documentation, and don't log in again. After 30 days, their models are put to sleep and the server capacity they were using becomes available. This results in our running more servers than we actually need. If left unchecked, we can actually end up running orders of magnitude more servers than are required.

Other users are not so transient. They upload models which continue to get requests many months later. These users should continue to experience the same level of service that they've grown to expect.

Let's state our goals

It's apparent that we have an opportunity to shut down cloud instances, but the right path to doing so may not be clear. It will help us to state our goals up front. That way we can direct our efforts more effectively.

Stating our goals

Goal 1: Reduce the expense of running cloud servers

This means shutting down servers.

There are two challenges to doing this. One is that we have an upper bound on the number of models we can run on a server, so we can't simply move everyone onto one server. Second, and more problematic, is that each user's data and running models must reside on a single server. So we need to make sure we split up our users in a somewhat intelligent manner.

Goal 2: Minimize risk and human time due to rebalancing

Once we know the number of servers we need, we want to minimize the number of required migrations. Migrating users means risk and, more importantly, man hours. We'd like to do as few migrations as possible to achieve our target number of servers.

Goal 3: Balance sleeping models across servers

Running models have direct costs in terms of server resources. They also go to sleep after 30 days of inactivity. Sleeping models have nominal costs associated with them. However, each sleeping model has a small probability of being woken up in the future. We would like our sleeping models to be distributed somewhat evenly across the remaining servers so that we can expect similar changes to the servers in the future.

Goal programming

We've listed out our goals in order of importance. Unfortunately, they represent very different things. The number of servers, number of migrations, and balance of models are related, but not directly. For instance, there could be any number of ways to migrate users to the minimal number of servers.

To handle this, we adopt a goal programming approach using an integer programming model. This sort of modeling is a bit different from what you've seen on our blog in the past in that it is deterministic. That is, our model assumes that we know the state of the system we are modeling and the outcomes of any actions we take. This is reasonable since we are in control of the servers.

Unfortunately, this simplifying assumption does not make our models easier to solve. Quite the opposite, in fact. The models we'll be solving are actually NP-Hard. But we're not going to let that stand in our way.

When faced with a particularly challenging NP-Hard
problem, first try making yourself look big.

Input data

Before we can structure and code a model for the above goals, we need to decide what our input data will look like. Without going into detail about how we obtain the data, we structure it as shown below. Each server has a list of users currently assigned to it. Each user has a number of running models and a number of sleeping models. You can download a small test file of synthetic data to use with the code in this post here.

    # Server name
    'server001': [
        # (user, running models, sleeping models)
        ['user001', 2,  0],
        ['user002', 0,  4],
        ['user003', 5,  1],
        ['user004', 0, 10],
        ['user005', 8,  1],
        # ...
    'server002': [
        # ...
    # ...


If you want to test this code yourself, you need two pieces of software: The GNU Linear Programming Kit (GLPK) and the PuLP Python library. You can install GLPK with apt-get if you're on Linux or brew if you're on a Mac, and use pip to install PuLP.

Goal 1: Minimize required number of servers

We start with a model for our first goal. Its input consists of the existing servers, their assigned users, and counts of running and sleeping models for each user. Its output gives us the minimum number of required servers, denoted $z^*$. Minimizing $z$ is our model's objective.

We define binary variables $y_s \in \{0, 1\}$ for each server $s$. If $y_s = 1$ then we keep server $s$ on. Otherwise, we shut it down. With $y$ in hand, we define our objective function.

$$\min z = \sum_s y_s$$

Next we define binary variables $x_{s,u} \in \{0, 1\}$ for each server $s$ and user $u$. If $x_{s,u} = 1$ then user $u$ should be assigned to server $s$. This means we move user $u$ to $s$ if she is on a different server, or leave her on $s$ is she is already assigned to it. We add a set of constraints stating that each user must be on exactly one server.

$$\sum_s x_{s,u} = 1\, \forall\, u$$

And no user can be assigned to a server that isn't kept on.

$$x_{s,u} \le y_s\, \forall\, u, s$$

Finally, we can't assign more running models to any server than our known capacity, $MAX_{models}$. For each user, we denote $\alpha_u$ as that user's count of running models and $\beta_u$ as the count of sleeping models.

$$\sum_u \alpha_u x_{s, u} \le MAX_{models}\, \forall\, s$$

The astute reader may notice that this is a form of the capacitated facility location problem. We code up this model using PuLP.

import json

# Load input data. From this we will create a list of all users in the
# system and lookup tables for the running and sleeping model counts.
servers = json.load(open('cloud-data.json'))
users = []
running = {}
sleeping = {}
for user_list in servers.values():
    for user, running_models, sleeping_models in user_list:
        running[user] = running_models
        sleeping[user] = sleeping_models

# Verify that the data looks right.
for user in users[:5]:
    print '%s: %d running, %d sleeping' % (user, running[user], sleeping[user])

This gives the following output:

user013: 2 running, 1 sleeping
user014: 0 running, 2 sleeping
user015: 0 running, 4 sleeping
user016: 1 running, 4 sleeping
user017: 0 running, 3 sleeping

Code for our first goal looks like:

from itertools import product
import pulp

MAX_models = 50

# Create a new model and set it to minimize its objective.
prob = pulp.LpProblem('Cloud Rebalancer', pulp.LpMinimize)

# Define our decision variables.
# y[s] == 1 if we continue using server s, 0 otherwise.
y = {}
for s in servers:
    y[s] = pulp.LpVariable(s, cat='Binary')

# x[s, u] == 1 if user u is on server s, 0 otherwise.
x = {}
for s, u in product(servers, users):
    x[s, u] = pulp.LpVariable('%s-%s' % (s, u), cat='Binary')

# Add constraints to the system.
# Each user must live on exactly one server.
for u in users:
    prob += sum(x[s, u] for s in servers) == 1

# A user can only be on a server if the server is up.
for s, u in x:
    prob += x[s, u] <= y[s]

# Server capacity constraints.
for s in servers:
    prob += sum(running[u] * x[s, u] for u in users) <= MAX_models

# Goal 1: minimize the number of running servers.

status = prob.solve(pulp.GLPK())
if status != pulp.LpStatusOptimal:
    print 'Failed to optimize goal 1. PuLP returned status %d.' % status
    num_servers = sum(pulp.value(s) for s in y.values())
    print 'Servers required:', num_servers

This outputs the minimum number of required servers.

Servers required: 2

Goal 2: Minimize the number of user migrations

Now that we know our minimum number of servers, we can add that as a constraint to our model. $z^*$ is the same as num_servers in the code.

$$\sum_s y_s \le z^*$$

The objective for our second goal is to minimize the required number of user migrations while respecting all the constraints from the first goal and keeping the number of servers minimal. Note that each user is currently on a server, so we only count them as migrated if they end up on a different server after rebalancing. We denote $U_s$ as the set of users currently residing on server $s$.

$$\min w = \sum_{\left(s,u\right):\, u \notin U_s} x_{s,u}$$

# Goal 2: minimize the number of user migrations.
prob += sum(y.values()) <= num_servers

migrations = pulp.LpVariable('migrations', lowBound=0)
prob += migrations >= sum(x[s, u] for s, u in x if u not in servers[s])

status = prob.solve(pulp.GLPK())
if status != pulp.LpStatusOptimal:
    print 'Failed to optimize goal 2. PuLP returned status %d.' % status
    num_migrations = pulp.value(migrations)
    print 'User migrations required:', num_migrations

This gives us the minimum required number of migrations to get us down to 2 servers.

User migrations required: 82.0

Goal 3: Balance sleeping models across servers

The first two goals have immediate operational impact and cost savings. The third goal is intended to hedge against too many models being woken up on a single server. Let's take a look at how many running and sleeping models wil be on each server after migration, according to the current solution.

def workload(x, running, sleeping):
    workload_running = {}
    workload_sleeping = {}
    for s, u in x:
        if pulp.value(x[s, u]) > 0.5:
                workload_running[s] += running[u]
                workload_sleeping[s] += sleeping[u]
            except KeyError:
                workload_running[s] = running[u]
                workload_sleeping[s] = sleeping[u]

    return workload_running, workload_sleeping

workload_running, workload_sleeping = workload(x, running, sleeping)
print 'Running servers post-migration:'
for s in workload_running:
    print '%s: %d running models, %d sleeping models' % (s, workload_running[s], workload_sleeping[s])

This gives the output:

Running servers post-migration:
server006: 41 running models, 66 sleeping models
server007: 48 running models, 98 sleeping models

We'd like to spread out the sleeping models a little better, while keeping the same number of total servers, migrations, and models running on each server. If $w^*$ is our optimal number of migrations (stored in num_migrations), then we can constrain our model the same way we did for the last goal.

$$\sum_{\left(s,u\right):\, u \notin U_s} x_{s,u} \le w^*$$

To simplify the problem, we force servers on or off, depending on the output of goal 2.

$$y_s = y_s^*\, \forall s$$

We add a constraint for each server stating that we won't increase its running workload. Denote $r_s^*$ as the optimal running workload for server $s$, shown in the Python code as workload_running[s].

$$\sum_u \alpha_u x_{s, u} \le r_s^*\, \forall\, s$$

Finally we introduce a new variable that represents the maximum number of sleeping models on any of our remaining servers. For our objective function, we simply minimize the value of $v$. This keeps the same servers on and the same number of migrations, but balances the sleeping models across the remaining servers.

$$\min v$$ $$v \ge \sum_u \beta_u x_{s, u}\, \forall\, s$$

The code below adds these final constraints to the model and re-optimizes it.

# Goal 3: balance sleeping models across remaining servers
# Keep our migration count as low as possible.
prob += num_migrations >= sum(x[s, u] for s, u in x if u not in servers[s])

# Set servers on and off based on our current plan.
servers_off = []
for s in servers:
    if pulp.value(y[s]) > 0.5:
        prob += y[s] == 1
        prob += y[s] == 0

# Don't add any more running models to any of the remaining servers.
workload_running, workload_sleeping = workload(x, running, sleeping)
for s, val in workload_running.items():
    prob += sum(running[u] * x[s, u] for u in users) <= val

# Minimize the maximum sleeping workload.
max_sleep = pulp.LpVariable('max_sleep', lowBound=0)
for s in workload_running:
    prob += max_sleep >= sum(sleeping[u] * x[s, u]for u in users)

status = prob.solve(pulp.GLPK(msg=False))
if status != pulp.LpStatusOptimal:
    print 'Failed to optimize goal 3. PuLP returned status %d.' % status

# Return structured data for migrating users.
workload_running, workload_sleeping = workload(x, running, sleeping)
print 'Running servers post-balancing:'
for s in workload_running:
    print '%s: %d running models, %d sleeping models' % (s, workload_running[s], workload_sleeping[s])

Our final configuration looks like this:

Running servers post-balancing:
server006: 41 running models, 82 sleeping models
server007: 48 running models, 82 sleeping models

As we can see, we have the same servers kept on, and they are running the same number of models. The only thing that is different is the number of sleeping models per server. This looks pretty good, so we'll call it a day as far as the model goes.

Interpreting the output

A version of the model is available here for anyone who wants to run it. There are two components to the output that we can take action on.

  1. Which servers to turn off.
  2. Which users to migrate, and where to migrate them.

The code below shows what that output looks like. We know from the final output of the model that we can expect server006 to have 41 running models after migration, server007 to have 48, and both to have 82 sleeping models.

migrations = {}
for s, u in x:
    # Migration means that a user ends up on a new server.
    if pulp.value(x[s, u]) > 0.5 and u not in servers[s]:
        migrations[u] = s

print 'Servers to turn off:', servers_off
print '%d users to migrate (showing first 10):' % len(migrations)
for u, s in migrations.items()[:10]:
    print '\t%s -> %s' % (u, s)

This outputs a server rebalancing plan of the form:

Servers to turn off: [u'server002', u'server003', u'server001', u'server004', u'server005']
82 users to migrate (showing first 10):
    user040 -> server007
    user041 -> server007
    user068 -> server007
    user025 -> server007
    user044 -> server006
    user045 -> server007
    user046 -> server006
    user047 -> server007
    user048 -> server006
    user063 -> server007


Estimating very conservatively, these techniques cut our entire cloud expenses by 57%. Our modeling approach works reasonably well for the current incarnation of ScienceOps Cloud. Since user models don't expire for 30 days, we don't need to rebalance very often. If the cloud rebalancing model takes several hours to solve, that's not really a problem for us.

Right now, we can create an optimal plan for server rebalancing in about 6 minutes. The resulting migrations take much more time than that. Keeping in mind that these are NP-Hard problems, we expect our model solution time to increase as we get more users with running models. However, if we consider the set of users with running models as a queue, we can also expect that number to reach a steady state eventually. Should the system become complicated enough that we can't create a rebalancing plan in a timely manner anymore, we'll revisit the problem and write another post.

Hire us!

Do you think you'd benefit from an operational analysis, but don't know where to start? Yhat has you covered. We have lots of experience identifying and solving tough problems with big impacts. Drop us a line!

Our Products

Rodeo: a native Python editor built for doing data science on your desktop.

Download it now!

ScienceOps: deploy predictive models in production applications without IT.

Learn More

Yhat (pronounced Y-hat) provides data science solutions that let data scientists deploy and integrate predictive models into applications without IT or custom coding.