Lots of good stuff on predicting using increasingly larger and larger sets of data.
Here’s a brain teaser, as phrased by wikipedia (spoilers):
> You have two indistinguishable envelopes that each contain money. One contains twice as much as the other. You may pick one envelope and keep the money it contains. You pick at random, but before you open the envelope, you are offered the chance to take the other envelope instead.
This should remind you of the Monty Hall problem, where you are asked if you would switch after making a decision. So, I ran a simple Monte Carlo simulation to see what would happen:
def run(simulations=1000): env1, env2 = 0 env2 = 0 for i in range(simulations): envelopes = [random.randint(2,10)*2] select = random.randint(0,1) if select==1: envelopes.append(envelopes/2) elif select==0: envelopes.append(envelopes*2) env1 += envelopes env2 += envelopes return env1, env2, env2*1.0/env1
# plotting the simulation results sim1, sim2 = ,  for i in range(1000): a, b, c = run(1000) sim1.append(a) sim2.append(b) hist(sim1) hist(sim2)
In this case, which envelope you choose first doesn’t matter. You aren’t sure which one has more money, and there’s no distinguishable way to tell, therefore, the odds of you picking the envelope with more money is 50/50. Let’s call that the *first* envelope, meaning the one you choose first, just to have a point of reference.
Because one envelope always has twice the amount of the other with 50/50 odds, I flip a coin. If the coin is heads (1), the second envelope has 1/2 of the amount in envelope one; if it’s tails (0), then the second envelope has 2x the amount. The odds of the second envelope (the one you switch to) either have twice the money, or half the money. I run 1000 simulations and keep the tally for the first and second envelope. The end result is the sum of the contents for 1000 results of envelope A and envelope B.
Here are some results of the simulations:
In : run(10000) Out: (119290, 150884, 1.2648503646575573) In : run(10000) Out: (119894, 148105, 1.2352995145712045) In : run(10000) Out: (120490, 152672, 1.2670927047887792)
Let’s run 1000 simulations of the 1000 results and plot each resulting tally.
Blue is first envelope, green is second. If you did this experiment 1000 times, you would probably have ~$11,000 if you stick with the first envelope, and over ~$13500 if you decided to switch envelopes. This went completely against my initial thoughts, so I checked the code again:
if select==1: envelopes.append(envelopes/2) elif select==0: envelopes.append(envelopes*2)
Ah ha! I thought (mistakenly). I’m multiplying the second one and dividing the first. What if I switched the signs?
if select==1: envelopes.append(envelopes*2) elif select==0: envelopes.append(envelopes/2)
I should have expected that. Because the choice to multiply or divide is random, it didn’t matter, as roughly the same amount of enveloped get either doubled or halved regardless of order.
So, here’s what we know so far:
- Switching seems to increase expected payout
- The expected payout increases by a factor of about 25%
- There is enough of a difference in the distributions to accept that switching has a significant effect
It helps to think about the expected payouts on a vector. Assuming the first envelope has $10 in it, the situation would look like this:
If you switch, the envelope will either contain $20 or $5:
It’s plain to see that the potential gain on switching is significantly larger than the loss. Gaining 100% is much more significant in absolute terms than losing 50%. Therefore, expected value of switching – over time, on average – comes to $12.50 or +2.50 (marked by X) which is splitting the difference between -50% ($5 envelope) or +100% ($20)
It pays to switch. But here’s the paradox: if the act of switching increases your expected payout by 25%, then couldn’t it make sense to constantly switch, therefore perpetually increasing the payout?
No! When you switch a second time, there are two possible cases:
Case 1, Switching back to the larger denomination: If you switch to the higher payout, and you switch back, you go from $10 to $20 to $10, but since you (unknowingly) had $10 to begin with, the expected additional payout of switching back is always 0.
Case 2, Switching back to the smaller denomination: If you switch to the smaller payout, and you switch back, you go from $10 to $5 to $10, but since you (unknowingly) had $10 to begin with, the expected additional payout of switching back is always 0.
Therefore, the act of switching again has no additional effect on the payout. Which implies that it is not the act of switching that increases value. Full stop.
There is certainly an opportunity to increase your payout, but the probability of switching to it is about 50/50 – the same odds of picking it in the first place. It is completely random and therefore no way to foolproof method to increase your payout. The validity of the choice can only be demonstrated after the envelope is opened. While the expected value may increase, the actual contents of the envelopes do not change.
So back to the philosophical debate of choice. How do you know which is the right one to pick? Flip a coin. You’ll be 100% right 50% of the time.
Half the money I spend on advertising is wasted; the trouble is I don't know which half -- John Wanamaker
I’ve often been told that the best odds in Vegas are at the craps table, with only about a 1% edge going to the house. Yet, I don’t know of any player that went to Vegas with $5000 to come home with $4950. They’re either up much more, or down much less. Ironically, expected value should rarely be expected. The healthiest, and most sensible, thing to do is to think about as my friend JVS did, “Just give me one of the envelopes and I’ll be happy. Either way, it’s more money than I have now.” And he’s right. The actual payout in relation to what you had before will always be +X, because no matter what envelope you choose, you will always have +X more dollars than you’ve had before. Trying to optimize for something that is completely random will just drive you insane…
I saw a pretty perplexing question having to do with probability the other day
> How do you randomly generate a number from 1:7 with a six-sided die?
It’s a pretty devious question. Each face number 1:6 has exactly 1/6 probability to show up. How do you convert this to a 1/7 probability? I thought about this for a while. I wrestled with common factors, common multiples and really long combinatorial examples drawn by hand. I even googled it. The answer I found proved to be lengthy:
There are a bunch of options. Here's an easy one: roll the die twice, keeping track of the first and second roll. There are 36 outcomes: (1,1), (1,2), (1,3), ..., (6,6) (The first number is what you got on the first roll; the second is what you got on the second.) If you get a (6,6), just re-roll the die twice again until you get a non-(6,6). Now there are 35 equally-likely outcomes, so divide them into 7 groups of 5 corresponding to the 7 choices among which you want to choose. Here's an easy calculation that will do it. Suppose you roll a (x, y) with x = number on first roll and y = number on second roll. First calculate the following number: N = 6(x-1) + y.
I hate complicated solutions. Here’s another method (in python, of course):
import random # import this to use the random feature dicerolls = [  for i in range(7) ] for i in range(len(dicerolls)): dicerolls[i] = [ int(random.random()*6) for number in range(500)] # print out winners and identify my random number from 1:7 winner = [sum(i) for i in dicerolls ] print winner print max(winner), "- random number:", winner.index(max(winner))+1
How it works: Roll N*7 times. Each result gets put into one of 7 lists, like so (when N = 13):
Because each individual roll is independent and random, it doesn’t matter how many times you roll, as long as each list is given the same amount of rolls. What’s important is that there are a sufficient number of rolls to prevent ties (in my code, I go with 2^7). After N rolls, you can sum the contents of the list, and the largest position is the outcome of your randomly generated number:
Run with N=500, here are the sums of each list:
[1236, 1207, 1246, 1233, 1273, 1248, 1298] 1298 - random number: 7
In this case, since the 7th list has the highest sum, your randomly generated number between 1:7 is 7. Easy!
I discovered an interesting property about the Monty Hall problem.
When the answer to the Monty Hall problem is explained, often people say “the probability of the car being behind the door if you switch doubles.” Not necessarily true! Let’s look at what happens when there are multiple doors using my handy dandy Monty Hall Monte Carlo class:
When there are three doors (blue), yes, the probability of winning goes from 1/3 to 2/3 (double) by always switching. However, if you use more doors, the probability of winning the car by switching shrinks! At 7 doors, there’s hardly any difference in win percentage whether you switching, hold or randomly switch.
Here’s another view of it.
The bottom dot on each line represents never switching, the center represents random switching. You can see that no matter the number of doors, the odds of winning the car are are always higher if you always switch the door you choose. However, the advantage seems to decay quite quickly, almost to the point of exhibiting no advantage near 7 doors.
UPDATE: Fellow Zipfian Alum Alex Mentch Explains:
With 3 doors, your initial pick has a 1/3 chance of having the car and the other two doors have 2/3. Remove one of those other two doors and the remaining door has the 2/3 chance.
At 4 doors, your initial pick has 1/4 and the other 3 have 3/4. Remove one door and the 3/4 chance is shared amongst 2 doors, so 3/8 each. At 5 doors, your initial pick has 1/5 and the other 4 have 4/5.
Remove one door and the remaining doors have 4/5 * 1/3 = 4/15.
> Reservoir Sampling? What’s that?
Reservoir sampling is randomly pulling out a known number of examples from an unknown (or very large) pool of streaming items.
An example: Let’s say you get a lot of emails per day. I mean a LOT. So many that you cannot possibly read all of them. You don’t how many you will get each day, but it’s a lot. To try to get your email inundation under control, you decide that you’re only going to read a few randomly chosen emails and ignore the rest - cuz you’re saucy and that’s how you roll.
How will you go about this? You could choose to read just the first X-number of emails you receive, but that just means everyone will just bombard you in the morning. You could just pick a few emails during your lunch and whenever you go to the bathroom, but you’ll could end up stuck in an email reading vortex and read many more than you intended to.
No, what you want to do is pick a number (K), and have that many emails ready for you to read at night when you are about to go to bed. No more, no less. And you’re going to do it with python. And because you don’t want to play favorites, you will choose randomly what emails to read. Let’s go with K = 3. And for the sake of confusion, let’s forego 0-indexing.
Here’s the code to do it (thanks, Larry):
def reservoir( stream_of_incoming_emails, K ): emails =  N = 0 for email in stream_of_incoming_emails: N += 1 if len( emails ) < K: result.append( email ) else: s = int(random.random() * N) if s < K: result[ s ] = item return result
The first three emails are rather easy to deal with. Three open slots, three emails:
if len( emails ) < K: result.append( email )
Let me introduce some concepts:
The set is the set of emails you decide to read at the end of the collection. Then there’s the discard pile and the memory. The memory is where the newest incoming email begins. Note that an email can only occupy either memory or the discard pile, never both at the same time. Furthermore, when the email is discarded, it only exists theoretically – the computer has deleted all trace of it, but has kept a record that it has existed via the in-place operator N+=1. One of the reasons reservoir sampling is powerful is because it can handle an infinite number of items keeping only the pertinent items, eliminating the problem of running out of memory/space. By keeping count of the total number of emails, you can compute a probability space, which allows you to assign a fair probability.
The next email rolls in. This is the schematic that occurs before any decisions are made: When the 4th email (email D) rolls in memory, it can occupy one of three spaces in your set, or be discarded, for a possible 4 outcomes. Because there are four possible positions, each space occupies a 1/4 chance of being taken up. The following code randomly generates the Position ID for the new card for us:
s = int(random.random() * N) if s < K: result[ s ] = item
If the number generated is 4, the new email will be discarded, because Position4 is within the discard pile: If the number generated is 2, the new email will occupy Position2, and email B will be discarded like so: Since the number was generated randomly and each card has an equal probability of being discarded (1/4), this gives a fair and equal chance to all cards of being discarded. The important part to remember is that though the email is discarded, the record of its existence is kept. Here’s what happens when N==8 and random()*N==1: When N==8, the new email has an equal 1/8 chance to take up any of the slots (5 in the discard pile), which means each card in the set has a 1/8 chance to be discarded. When the next card comes (N+=1), the probabilities for each slot being taken up is 1/9. Then 1/10 and so on…
Let’s look a little bit later down the road, when N=24 and random()*N == 8: Twenty-four slots in the probability space, with the new card taking up Position 8 – in the discard pile. Repeat ad nauseum until you are ready to read your collection of emails.
What are some real world applications to this? Website A/B testing comes to mind. If you have thousands of visitors a minute, it’s hard to A/B test because you can get a needlessly large sample very quickly. There’s also a cost of doing A/B testing – as the nature of A/B testing implies that one could be more profitable than the other and you’re leaving money on the table. Furthermore, getting many samples on say, a Tuesday morning, will be different than getting samples on a Saturday evening. Reservoir sampling allows you to get a randomly distributed sample with a fixed size over the course of any time period without risking much opportunity costs and very low computational overhead.
Or, you can just make time to give 3 e-mails 110% of your attention every night.
But this is only half of the problem, the other half will address the question: “don’t the cards from the original set have a worse chance of survivorship because they have to survive N-number of potential deletions?”
Short answer: no. See you next time!