My Letter to Babson’s Class of 2011

We all got here today in our own unique ways. Before Babson, I had no idea as to what a balance sheet was. I worked as a graphic artist. I created posters, illustrations and letterheads for a living, and I loved it.

Yet, despite my love for the creative arts, I got fed up being poor and unrecognized as an artist, so I came to Babson to live the glamorous life of an undervalued, underpaid, and underfunded entrepreneur. And one of the most important things I’ve learned here was how different disciplines are often interconnected.

It turns out both of my passions are no different – entrepreneurs and artists have quite a lot in common. I believe there are many lessons entrepreneurs can learn from artists and vice versa.

So, what do entrepreneurs and artists have in common?

For starters, artists and entrepreneurs both — love – being — miserable. The rewards are smaller, the hours are longer, and most of what we make is terrible. Our prime motivators are being behind on bills, not having seen our friends in weeks, unpredictability and stress. And the only way we can tell if we’ve accomplished anything is when other people tell us how bad it is…

Yet, every morning, we get up and we do our work because this is what we love. We love the late-night epiphanies fueled by sleep deprivation. We love being forced to think creatively and on our feet. And when we get rejected – which we all invariably will – we come up with ten new ways to get rejected again. Misery loves company, and that company is full of poets, painters and entrepreneurs.

Now, part of why artists and entrepreneurs love their craft is because they are journeys of self-discovery. Entrepreneurship, like art, is the entrepreneur’s interpretation of the world around them, packaged and presented to the world. It’s intensely personal – our most genuine and honest work. It requires us to understand ourselves intimately, or risk being seen as frauds. It separates the amateurs from the professionals. It’s unyielding and uncompromising, and we will have it no other way.

But finally, and most importantly, entrepreneurs and artists have a shared desire to innovate. As the true artist knows that there can only be one Picasso, the true entrepreneur knows that making the next Facebook adds nothing new to the world. Artists and entrepreneurs are similar because we have an inexplicable, intense spark within us to create something from nothing. We know we will be miserable. And we know that we will be judged and criticized. Yet… we do it because we all want to make our mark in human history.

So to everyone, I’d like to propose a challenge: Let’s all become true artists and entrepreneurs. Let’s all rid ourselves of preconceived notions about what we should do or what we can be. And let’s all dare to be different. Let’s not be the next, or be the best – let’s be the first, let’s be the only.

Finding latent influencers in yelp’s social network

Social media allows us to connect and share information and opinions faster and easier than ever. But as our social networks get larger, it becomes increasingly difficult to determine the people that that matter.

On social networks like twitter, retweets are reflected as a measure of influence; on Facebook, many put stock in the amount of followers or friends connected to an account. While these act as good proxies for the amount of attention these accounts can attract, it doesn’t account for how these metrics effect engagement or the swaying of opinions.

103651942-most-retweetedThe most retweeted photo of all time. Did it sell any Galaxy S4s? Or did it only reinforce the opinions of those who already own one? How can you tell?

Using Yelp’s academic dataset for every business and review written for the Phoenix, AZ market, I hoped to find people of influence using data.

To define the effect of influencers, I searched for a cause and effect link within the yelp reviews – one that I could surely declare was the effect of an outside influence in order to establish a baseline. Fortunately, we had Gordon Ramsay to thank for this giant meltdown on his show Kitchen Nightmares:


Since yelp “forgot” to include this business in their dataset, I had to download the individual HTML pages and run a parse using BeautifulSoup. Here are the results:

multipage-108

The red line marks the air date of Kitchen Nightmares for Amy’s Baking Company in Scottsdale, AZ. Immediately afterward, there’s a clear change in not only the review scores, but also the frequency of reviews. So it is fair to say that for a business’ yelp score, an influencer has can have effect on the score and the frequency of reviews soon after an action is taken by an influencer. This action infers correlation or causation to the change in the score or review frequency. By studying user reviews that occur before a radical change in a business’ review scores and frequencies, I hope to find influencers or measures of influence.

But because average scores quickly converge when a large number of reviews are posted, I used the rolling average to look for sudden changes. I selected 7 days to account for weekly seasonality. To look for variance along the rolling average, I used a rolling variance of 20 reviews. Here’s why:

Untitled-1

Over the set of 500 yelp reviews, it only takes about 15-20 reviews before the total average score converges to the final average score with a delta of only 0.1 star. This would eliminate large variances that happen simply due to having a small, varied sample which would occur at the beginning of a business’ lifecycle.

Plotted as a time series of variance, large inflection points can be easily calculated. At the base of these inflection points, I collected a sample of 15 users and kept a tally. All businesses within this data set have at least 150 reviews, so each sample would represent at most 10% of total reviews.

varianceonlyReview scores plotted as change in variance of scores.
 I collected the users at the base of the biggest peak.

In addition to changes in the moving average rating, changes in review frequencies were also tracked. Using a Bayesian method, I used Monte Carlo/Markov Chain simulations to estimate the most probable point of inflection (tau). At each tau, I collected a sample of 10 users at either side to account for the random and stochastic nature of MCMC. Additionally, if the inflection was found to be at the beginning or the end of the review set (dated chronologically), I used the next most likely inflection point to collect my samples. Here is the code, (much thanks to Cam Davidson-Pilon for the pyMCMC code and his example on finding switch points in text messaging data).

lambda

Where is the inflection in review frequency here?

When tallied up, the frequency inflection point model showed rather inconclusive results, with the most commonly appearing user only appeared about 9 times in a set of 500 reviews. This user is also an employee of yelp, having written several thousand reviews. The probability of showing up ~2% of the time, given the number of reviews she has written, is pretty high and can be attributed to mere chance.

In retrospect, this makes a lot of sense. For one, MCMC is stochastic, meaning a lot of it is just up to random chance unless fine-tuned well. Furthermore, MCMC found a lot of the inflection points to be at the beginning of the dataset, suggesting that reviews were written pretty steadily for many businesses. However, I think my biggest mistake is setting *one* inflection point, rather than several. In the graphic above, the largest inflection point occurs around halfway. Had two or three taus (inflections) had been set, perhaps it would have found less dramatic, but more meaningful inflection points, such as the one ~3/4 to the right.

On the other hand, the samples taken at the review score inflections showed some interesting results:

Five users appeared more than 50 times in a set of 3300+ total users and 11000+ reviews in a Zipfian distribution. In a striking contrast to the last example, this is much less likely due to chance. At the far end of the long tail is someone named Sarah.

distro

Sarah appears an astonishing 75 times out of my dataset of 500 businesses and she has only reviewed 187 of them (40%!). What’s more is that 86% of her reviews are within ±0.5 stars off of the final average. Since she quit yelping in 2009, it means she was able to fairly accurately predict a business’ aggregate yelp score 5 years ago. And considering that users can only score in 1.0 star increments, while aggregates are measured in 0.5 increments, that’s kind of a big deal.

But what’s more fascinating is that Sarah bucks the trend of what we believe influencers to be. Her average review only gets about a total of 12 “funny”, “useful”, or “cool” markers from other users. She doesn’t have thousands of yelp friends (only 60) nor has she written a ton of reviews (about 450 total – she doesn’t even break the top 1000 users in reviews written for Phoenix), and she only has 15 total “fans” of her reviews. (Though she was yelp “elite” as an active user.)

While I don’t have any evidence to suggest causation, the correlation is pretty strong. Strong enough to at least infer that she is a “trend setter” who reviews businesses before a sudden change in public opinion. Bottom line, had she continued to use yelp, she was someone whose reviews should have been watched very closely.

What I’d like to do next is to see if there are other Sarah’s in other markets, such as San Francisco and New York. Once they have been identified, perhaps some clustering and classification algorithms could be used to find out what makes these people different from your average yelper.

Maybe we can start quantifying the effects of online influencers on other social networks, too.

 

Machine Learning Isn’t That Hard or What Data Scientists REALLY Do

Machine learning is not complicated. No really, it isn’t! I bet you can do machine learning without even opening your calculator app. Simply study this line of numbers for no more than a few seconds:

* 0000111100001111

Got it? Compare that line to these four new lines:

1. 0000111100001110
2. 0101000011110000
3. 1111000011110000
4. 0000000000000000

Which is the example most similar to?

If you said Line 1, then you understand the fundamentals of most machine learning algorithms. That’s all machine learning is: turning data into patterns and making predictions based upon those patterns. In this example, you can say with more than 90% confidence that your prediction is correct because 15 out of 16 digits match up (as long as no one digit is more important than any other in your prediction).

In this case, that was supervised learning – supervised because you coerced your answer (also known as a prediction) to fit into an existing or known pattern.

Now, consider the four lines again, WITHOUT the original example as context:

1. 0000111100001110
2. 0101000011110000
3. 1111000011110000
4. 0000000000000000

If you were asked to group these four lines into THREE distinct groups (Group X, Y, and Z), you might organize them like this:

Group X: Line 1
Group Y: Line 2 and Line 3
Group Z: Line 4

Group Y (Lines 2 and 3), though not exactly similar, have enough in common to be grouped together. Because you are not trying to fit each line into an existing system, and trying to come up with a convention for them, this is an example of unsupervised learning.

If I ask you to group them into 2 distinct groups, it gets a little bit tougher – will you group them by the total sum of each the lines? If you do, Lines 1 (adds up to 7), 2 (sum = 6), and 3 (sum = 8) would be a group, and Line 4 would be its own group. But if you look at the total number of 0s per line, you might group 1, 2, and 4 together (they are more than 50% 0s). Or you might group them in a completely different way that I hadn’t thought of. Without more information on what those 1s and 0s represent, it’s hard to make a decision on how to group them together to make the most sense.

That’s what data scientists do, they take their knowledge of statistics, math, coding and domain expertise (or work with other experts) to make these sorts of tough decisions.

It gets a little bit tougher when the rows of numbers get longer and the entire collection of rows gets larger. And the complexity rises further when the numbers begin to represent ideas or categories (1 for blue, 2 for red, 3 for green…) or be given different weights (instead of 1 or 0, we can use the total weight in pounds). And that’s where computers come into play. But to be honest, computers aren’t really necessary when it comes to machine learning, it just makes things a whole lot faster.

The Importance of Breathing

Society has taught us to prize our neocortex and advanced cognition, while suppressing the parts of our brains that control passions, functions, and actions. We are told to sit down, shut up, and listen to the smart man on TV or in the books. So, is there any surprise that we are stressed out, neurotic, and feel caged?

Work on developing the body, control (not suppression) of your reflexes and habits, and learn how to observe (not necessarily react) to be a stronger person.

The first step is learning how to breathe.