Paula Deen X Machine

My final project for a Data Science course at General Assembly, I set out to machine learn the best (worst) recipe derived from the infamous queen of gluttonous eats – Paula Deen.

bfe5711b61ec551c3b0f6a706700e689

The first step was collecting data. I scoured Food Network’s website for the top 50 dessert recipes and placed them into a spreadsheet.

Many of these recipes included ingredients such as boxed cake mix and Krispy Kreme donuts (one recipe required 24 as a base… a base!). I intentionally left out the desserts that required pre-made ingredients, as they (1) contained unknown amounts of ingredients, dependent on brand and (2) substituted fat and sugar content in recipes, skewing the recipe totals as showing less than it actually contained.

For the curious, the top rated recipe was Baked French Toast Casserole with Maple Syrup with over 2360 reviews (as of August, 2013) and an average 5-star rating. Here it is in it’s full-sized glory:top10_bakedfrenchtoastcasse_s4x3_lg

Nutritional breakdown: 374 grams of fat and 606 grams of carbohydrates, more than half of which is pure sugar. Total calorie count of 6200 calories – or over 1000 calories per serving. But seriously, who eats just one serving?

I normalized the ingredients into grams (from cups, ounces and teaspoons). Here are the aggregate sums for the top 8 ingredients in grams:

egg~sugarb

Of note: despite Paula Deen’s reputed love for butter, you can see that sugar is much more significant in her recipes to a near 2:1 ratio. Gram-for-gram, flour is a much more significant player than butter is. Things get a little different when you combine all three types of sugar used in her recipes: granulated, powdered, and brown sugars.

egg~sugarT

Total sugar count is a significant 20420 g – or around 45 POUNDS of sugar. Total sugar outweighs butter count by a near 3:1 margin. In perspective, that averages about 1lb of sugar per recipe and only 10 tbsp of butter per. The average serving (~8 per recipe) of a Paula Deen dessert only contains 1.25 tbsp of butter – about what I put on a piece of bread while waiting for my food to come at a restaurant.

In other words, Ms. Deen’s love for butter may have been greatly exaggerated.

false

While typically, more data is better for machine learning, this data set provided a challenge that was slightly different than most data science problems. Maybe you can spot what it is. Recipe names on the left, ingredients across the top:

sparse

The data set is incredibly sparse. When adding recipes, a new ingredient not found in previous recipes added a new column with n-1 empty cells. (More data is almost always better, but, you have to make sure that you are getting more information per datapoint!) Which makes sense, recipes are particular in their selection of ingredients. Many say that a recipe is complete when nothing can be taken out of the equation. A data scientist might say that recipe models are often overfit.

Another issue I had to work around was imputation of missing values. Do I go with NA values or do I go with zeros? This was simultaneously a binary (ingredient exists or does not exist) but also a continuous (how much of it?) variable.

Cursory linear regression often showed varying results, including one that called for 15 grams of nutmeg per every gram of flour. Nonsense!

A Quick Baking Lesson

Baking is, for the most part, a play on 6 ingredients: flour, eggs, sugar, butter, time and heat. They are the core components of almost all baking recipes. So to address this issue, I focused the research down to the basic four: flour, eggs, sugar and butter, looking for relationships only where both ingredients existed. This gave results that were easier to work with.

butter~sugar

Here’s a linear regression comparing butter ~ sugar. As sugar increases, typically butter increases in a fairly consistent manner. The linear regression summary:

summary(lm(butter~sugar))

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  97.4202    29.2095   3.335  0.00195 **
sugar         0.2068     0.0584   3.541  0.00110 **
---
Multiple R-squared:  0.2531,    Adjusted R-squared:  0.233

Many would argue that this is a very low R-squared value, unfit to draw conclusions from. However, consider these:

red-velvet-cakeIMG_4323
A red velvet cake and a cupcake. If one were to run a linear regression on the various recipes for red velvet cakes of various sizes, they would have a very high R-squared fit – why? Because they are the exact same recipe! R-squared shows similitude, which would be great if we were making red velvet cakes, but not if we were trying to suss out a recipe from a set. In this application, a low R-squared is acceptable. Rather, we are looking for trends and relationships between the ingredients. With a p-value of 0.001096, and a strong visual correlation, this would be acceptable.

Here’s another look at the butter ~ sugar relationship, with two new trend lines drawn in:

butter~sugar2abline

The trend lines contain the same coefficient slope, but with differences in the standard deviation (+2 and -1) from the original best line of fit. This shows a much stronger R-squared relationship, with the points fitting much closer. The result? There are two types of Paula Deen desserts: (1) one with lots of sugar and butter and (2) another with even more sugar and butter.

Since this is a Paula Deen recipe, let’s opt for the one with more butter and sugar.

Butter ~ Flour and Eggs ~ Sugar gave good (though, not tight) correlations:
Untitled-2

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
butter      120.8508    43.5229   2.777   0.0105 *
flour.AP      0.3853     0.1489   2.587   0.0162 *
Multiple R-squared:  0.2181,    Adjusted R-squared:  0.1855 

Coeffiicients:
            Estimate Std. Error t value Pr(>|t|)   
eggs.whole   78.8844    28.1557   2.802  0.00868 **
sugar         0.1666     0.0557   2.992  0.00540 **
Multiple R-squared:  0.224,     Adjusted R-squared:  0.199

Not as a good a p-value as butter~sugar, but pretty solid.

Here are the relationships between the ratios of the four ingredients.

relationships

Sugar had a mean and median of about 400 grams. Applying the linear regressions along the above relationships, the recipe base was as follows:

  • 400g sugar
  • 238g butter
  • 100g eggs (rounded down from 122, as an egg only comes in 50g packages)
  • 278g flour

Now what?

The Parable of the Four Blind Men

There’s a Buddhist parable that goes something like this:

Four blind men encounter an elephant in the jungle. When they returned to town, the king, who was fascinated by their story of such a great beast asked them, “Tell me, what sort of thing is this elephant?”

The blind man who touched its trunk said the elephant is like a snake. The blind man who touched its leg said it is like a tree. The blind man who touched the tip of its tail said it is like a brush. The blind man who touched its ears said it is like large leaves from a tree.

380px-Blind_monks_examining_an_elephant

The men upon hearing this from each other began arguing, insisting the other was wrong and they were right.

Because the blind men have never seen an elephant whole, they cannot properly describe the elephant. They can only describe it in relation to what they have previously experienced. Similarly, the king, never having seen the elephant himself, must aggregate the information given to him by the blind men to paint a mental picture of what this animal is like. This is the naive Bayes classification algorithm: You submit your disparate pieces of information, and the computer aggregates and learns from this, giving you a probability-based result at the end. The more data you have, the better your result is, much like how the king might have a better idea of what an elephant looks like had there been 20, or even 100 blind men.

We’re making the computer ask itself:

If this recipe has 400 grams of sugar and 240 grams of butter, then the closest thing I’ve encountered that has the most similarity to this is…

Using only the four ingredients, and their ratios to the total sum of ingredients, I got an accuracy of about 70% on the collected data set:

            actual
predicted    bar bread candy cheesecake cookie custard fruit pie
  bar          3     1     0          0      3       0     0   1
  bread        1    11     0          0      0       1     0   0
  candy        0     1     4          0      1       0     0   0
  cheesecake   0     1     0          5      1       0     0   0
  cookie       0     2     1          0      5       0     0   0
  custard      0     0     0          0      0       1     0   0
  fruit        0     0     0          0      0       0     2   0
  pie          0     0     0          0      0       1     0   4

Now let’s classify this badboy.

> asdf  table(predict(clf2, asdf[,-10]), asdf[,10]) # give me the results!

             PDx
  bar          0
  bread        0
  candy        0
  cheesecake   0
  cookie       1
  custard      0
  fruit        0
  pie          0

It’s a cookie! Or at least it thinks it’s a cookie…

To validate, I ran a k-nearest neighbor classifier in tandem, and it only returned a success rate of about 35%. Here’s why:

butter~sugar

kNN identifies classes by grouping classes by Cartesian coordinates. Granted, the above illustration only shows two variables, it’s easy to see that the types of desserts do not form groups very well. If you think of the above matrix as a schoolyard, the jocks are hanging out with the nerds and mingling with the new students. It’s hard to say what group someone belongs to if no one is hanging out in specific areas of the school yard.

A 70% accuracy would be good enough.

But we can’t bake the recipe as is – a 4-ingredient recipe would not be representative of Paula Deen’s desserts, as every PD recipe contained more than 4. So there needed to be a way to identify the other ingredients.

With this known classification, the dataset was reduced to contain only recipes with the “cookie” classification. This reduced the ingredient list dramatically, leaving only about 20 total ingredients. Let’s look at the breakdown of proteins, carbs, and fats of the recipe and compare them with others.

		mean	median	PDx	Control
Protein		6%	5%	5%	5%		
Fats		23%	22%	20%	29%
Carbs		52%	53%	59%	53%	
Sugars		32%	37%	38%	35%
Sum/Whole	81%	81%	84%	87%

PDx is the current iteration of the recipe. Control is the classic Nestle Toll House recipe. Our recipe had way too much carbs and sugars and needed to be rounded out with more fats without significantly increasing other macro-nutrients.

Of the PD cookie corpus, only a few ingredients would fit: cream cheese, peanut butter, and shaved coconut. I discounted the cream cheese as it was used as a sandwich filling, and came up with a ratio that brought it much closer to the corpus. By adding 150g of peanut butter and 93 (a cup) of shredded coconut, the new ratio came out much closer:

		mean	median	PDx	NEWPDx	Control
Protein		6%	5%	5%	7%	5%		
Fats		23%	22%	20%	25%	29%
Carbs		52%	53%	59%	53%	53%	
Sugars		32%	37%	38%	35%	35%
Sum/Whole	81%	81%	84%	85%	87%

Great. I mixed all the ingredients using typical butter/sugar creaming technique, added a pinch of baking powder and baking soda to help it rise. A word count of the recipe set declared I should bake these at 350. So, I did.

FOT2FF2

The result was a crispy ring around a chewy center. A little on the sweet side but rounded out nicely by the peanut butter.

For validation, I searched the web for similar recipes and discovered that this recipe was not very far off from Cook’s Illustrated’s very own recipe – which was probably the best validation I could get. So, without further ado, the recipe:

  • 2 cups of sugar (1 cup brown sugar, 1 cup white)
  • 2 ¼ cups of flour
  • 2 eggs
  • 2 sticks of butter
  • ½ cup of peanut butter (crunchy for texture)
  • 1 cup of shredded coconut (optional)
  • 1 tsp baking powder
  • 1 tsp baking soda
  • 1 tsp vanilla extract

If you want to try Cook’s Illustrated’s recipe, add 150g more peanut butter, and slightly more flour (about 30g). It’s a pretty legit cookie!