# Reducing your R memory footprint by 7000x

#### by Greg |

R is notoriously a memory heavy language. I don't necessarily think this is a
bad thing--R wasn't built to be super performant, it was built for analyzing
data! That said, there are times when there are some implementation patterns
that are quite...redundant. As an example, I'm going to show you how you can
prune a 330 MB `glm`

to 45KB without losing significant functionality.

*Let's trim the R fat*

### Le Model

Our model is going to be super simple. We're just going to build a logistic
regression model that predicts whether or not a record from the `iris`

dataset
belongs to the `setosa`

species. A normal version of this model would look like
this:

But we're going to intentionally make our model bigger. Much, much bigger. Like
big data big. To do that, we'll randomly sample 500,000 rows from `iris`

to
make `iris.big`

and then retrain our model.

I realize this isn't the best way to sample data when building a model, but it'll serve our purposes just fine :).

330 MB seems a little large for a simple `glm`

. Let's see what underlying data
we can strip away.

### Where's the bloat?

First things first. Let's take a look at all of the variables in `fit`

and see
what's causing all this mayhem.

Ok so there's no way that we need to save all of this data just to make a prediction (after all, it's just coefficients right!?!). Let's see what we can get away with chopping.

### But if we delete stuff, won't that break things?

At Yhat, we're all about making predictions using R and Python models. So for
R models, what we're really concerned with is being able to successfully call
the `predict`

function on `fit`

.

I'm going to set aside a validation/test set of predictions that I can use later
to make sure that my modified `fit`

is still working correctly.

```
expected.results <- predict(fit, newdata=iris)
```

### Just take a little off the top

I started looking through the heavy variables and found that most of them were
some sort of stored training data (`data`

, `y`

) or some sort of diagnostic data
for the model (`fitted.values`

, `linear.predictors`

, `residuals`

). A hunch told
me that the model didn't actually need any of these to make a prediction.

Let the carnage begin. We'll start by deleting some of the largest variables.

Ok 45MB isn't bad and we're still getting valid results from our `predict`

call!
But I'm a little greedy. I want to eliminate the 46 MB that's still plaguing us
from the `qr`

variable.

Unfortunately, I found the pesky `qr`

object **COULD NOT** be removed from
`fit`

...entirely. However when you remove the `qr$qr`

variable (I know it's a
ridiculous name), things seem to be ok.

### The finale

Alright so I've managed to nuke 80% of the underlying data in my model. How big is it now?

That's right. We've managed to reduce our model by a factor of **7000**. All
while not losing what we deem to be "core functionality"!

### Final thoughts

Thanks to Harlan Harris who originally gave us the idea for this post! If you're interested in reading more about this topic, check out these resources: