Up and At ‘Em With D3.js

rainier-wolfcastle-quotes-i11

Part 2: Here Comes the Data

Here’s an adapted version of the infamous iris dataset. Save it to the same directory your HTML file is in.

Note: For security, D3 does not allow you to run local files directly. If you are running Mac OSX, there’s an easy fix:

  • open your terminal
  • navigate to the directory with the HTML files and
  • execute python -m SimpleHTTPServer

Then, you may point a web browser to http://localhost:8000/filename.html, and your D3’d HTML file should be visible. You can also use a network share or install something like nginx to serve your files.

To the original dataset, I’ve adding headers to the CSV file because this is a approximately three hundred times easier than learning how to manipulate and add column names programmatically. In a text editor, I added the following to the first line:

sepalL, sepalW, petalL, petalW, species

This allows me to call up the data via column name with a function. For example:

.append("circle")
  .attr({
    "r": function(d) { return d.sepalL }
    })

Will allow me to append (add) circles to my SVG canvas with a radius equal to each item in the data column sepalL. We’ll get deeper into these calls in the next chapter. Just hold on to the knowledge that

function(whatever) {return ( whatever.columnName })

will return every instance within columnName.

Now, let’s load up the data in D3.

<html>
  <head>
    <script type="text/javascript" src="http://d3js.org/d3.v3.min.js">
  </head>
  <body>
    <script type="text/javascript">
      var color = d3.scale.category10()
      var charts = d3.select("body")
        .append("svg")
        .attr({
          "height": 300,
          "width": 600; });
      d3.csv('iris.csv', function(information){
        console.log(information)
        // PLACEHOLDER
      })
    </script>
  </body> 
</html>

Simple, right? Because I called a console.log with the variable information, you should see an array with 150 items in your console. Each row should be converted into a single item in an array in a format like this:

0Object
petalL"1.4"
petalW"0.2"
sepalL"5.1"
sepalW"3.5"
species"Iris-setosa"

D3 is smart enough to manipulate data on its own*, so you can worry more about what to do with the data, rather than parsing it.

* … most of the time. You can see here that the number values have been saved as a string, and not a number

Line by line:

d3.csv('iris.csv', function(information){
  // d3, i want you to access the file 'iris.csv' and do stuff with it,
  // i will tell you what to do with it later. furthermore,
  // the contents from this command should be referred to as "information"

 console.log(information)
   // pull "information" that comes from iris.csv into my console in case I want a better look
   //// PLACEHOLDER

})
  // this last bit is a very important part to keep track of. 
  // All of your d3 code MUST be within the closing ) tag of d3.csv()
  // otherwise, the code you write will be out of scope
  // and d3 will not have access to the data!

Within the d3.csv() scope,  you can now do things like:

console.log(information[0].sepalL)

Which will return the 0th-element of information matching the value for sepalL. If you want to look at all the values of sepalL in the console, it gets a little trickier. I resolved it by pushing each value to an array:

sepalLengths = []
information.forEach(function(information) {
sepalLengths.push(information.sepalL) 
 })

Today we’ve learned:

  • how spin up a basic HTTP server in Python
  • basic understanding of how to use d3.csv() to load up data
  • viewing the data with a console.log()

Up and At ‘Em With D3.js

Preface

For me, one of the most exciting things going on right now in data is D3.js (Data-Driven Documents), a Javascript library written by mountain-man and beard aficionado Mike Bostock that makes it easier to create lightweight, interactive online documents with data (examples). As a data guy with a degree in visual communications, the possibilities for this library seem endless. Yet, as a non-web developer, this meant that I had to learn JavaScript while learning D3, using a mix of books, online tutorials and the D3 documentation. (I also learned what they mean by “hacking” – as I hacked a lot of pieces here and there together to get something working.)

I’ve made some good progress, but I noticed the self-education method typically caters to two clusters of people: 1) hardcase newbies or 2) those already doing JS for a living. These tutorials either use very simple test cases or complex JavaScript jargon.

> var mydata = [1,2,3,4]
// who procedurally develops viz for 4 data points?!

OR

> When the CSV data is available, the specified callback will be invoked with the parsed rows as the argument. If an error occurs, the callback function will instead be invoked with null. An optional accessor function may be specified, which is then passed to d3.csv.parse;

Anchorman-2-Say-What I’m writing this (and further tutorials) to cater to people who are like me: somewhere between newbie and expert, with some knowledge of coding, some HTML knowledge, and an understanding how datasets generally work. The point is to set up a mental framework upon which you can understand and build your own D3 projects, rather than an in-depth analyses of what SVGs do or explain JavaScript fundamentals (go here for that). I want you to get up and running in D3 as quick as possible, while explaining enough of it so you fundamentally understand the process and make own changes as it pertains to your D3 project. I’ve attempted to demystify D3 as much as I can, and I will put lots and lots of comments explaining step by step, in plain English, what each call means. There will be mistakes, and I encourage you to call me out when you see them. Tweet me @adailyventure.

Part 1: Setting Up

Start with this block of code in a new .html file. Some HTML tags are required because D3 operates using objects in the document (learn more about the Document Object Model/DOM here). You attach your D3 canvas and document to parts of the HTML, which allows you to place D3 objects where you want them. This file sets up a basic HTML template and lets the browser know that we are using d3.v3.min.js as a library.

<html>
<body>
  <head>
    <script type="text/javascript" src="http://d3js.org/d3.v3.min.js">
  </head>
  <body>
    <script type="text/javascript">
    // code here
    </script>
  </body>
</html>

Note: You can use the url http://d3js.org/d3.v3.js instead, but the minified version (d3.v3.min.js has all unnecessary spaces and formatting stripped out) loads faster because it’s a smaller file. Take a look at the un-minified version if you want to try and understand what’s going on behind the scenes and/or are suffering bouts of insomnia.

The way I use D3 is through plotting an SVG in the HTML document and drawing my data points and shapes inside, using the SVG as a canvas.

Where it says //code here, between the <script> and </script> tag, invoke the canvas:

var color = d3.scale.category10()
charts = d3.select("body")
  .append("svg")
  .attr({
      "height": 300,
      "width": 600; });

Note: Notice how new lines begin with . – this is typical d3 syntax through which you chain instructions, rather than writing new code for every little change. Also, the tabs and carriages are unnecessary. They are there so the code is easier to read. You could just as easily write:

  // LONG FORM CHAINING 
one = d3.select("body");
two = one.append("svg");
three = two.attr('height', 300);
four = three.attr('width', 600);
  // CHAINING ON ONE LINE
charts = d3.select("body").append("svg").attr("height", 300;).attr("width", 600;);

You can combine multiple attributes in dictionary form:

baka = { key: value, key1: value, key2: value };
...
.attr(baka)
  // OR 
.attr({'key': 'value', 'blurb': 4})

D3 is pretty agnostic on which formatting method you prefer. People often say it’s best to pick one formatting method and stick with it because it makes your code easier to troubleshoot, but I go with whatever style has me typing less characters:

39415160

Now, line by line (let’s skip var color=d3.scale for now):

var charts =
  // here i am creating a new variable called charts, 
  // so I don't have to reference the initial SVG repeatedly
  // next time, instead of typing up
  // d3.select("body").append("svg").attr().append('circle') 
  // to draw a circle, i can now simply do
  // charts.append('circle')
d3.select("body")
  // d3 - i want you to find the "body" tag in the browser DOM 

.append("svg") 
  // and within the "body" tag, place an svg at the end (aka append) 

.attr({ "height": 300, "width": 600;}) 
  // with the attributes height of 400 and width of 600 (pixels) 
  // this 300 and 600 can be replaced with your variables 
  // somenumber = 400/5*78+5
  ... 
  // .attr({'height': somenumber})

Hello World! You have successfully drawn an SVG in-browser. Save as an .html file and load it up in your browser. It should look like this: white

It may not seem like much, but for data viz, this is the equivalent of the Egyptians inventing papyrus, the invention of the brush, and the industrial revolution all rolled into one. The next few steps (which we’ll go over in later posts) aren’t that much harder, either – which is what makes D3 so exciting!

Your code should look like this. My comments prefaced with //:

<html>
  <head>
    <script type="text/javascript" src="http://d3js.org/d3.v3.min.js">
  </head>
  <body>
    <script type="text/javascript">
      var color = d3.scale.category10()
      var charts = d3.select("body")
        // initiate a variable 'charts' which is also a call for
        // d3 to select the 'body' tag of the HTML (line 5)
        .append("svg")
          // at the end of the body tag, append an SVG
        .attr({
          "height": 300,
          "width": 600; });
            // with attributes: height: 300, width: 600
            // in this example, attrs are defined in-line as a dictionary
    </script>
  </body> 
</html>

Today we’ve learned how to:

  • set up documents for D3
  • chaining
  • DOM selection using D3, and
  • created an SVG using JavaScript and D3

Next time, we’ll cover how to incorporate data sets into D3.

The Two Envelopes Problem

Here’s a brain teaser, as phrased by wikipedia (spoilers):

> You have two indistinguishable envelopes that each contain money. One contains twice as much as the other. You may pick one envelope and keep the money it contains. You pick at random, but before you open the envelope, you are offered the chance to take the other envelope instead.

This should remind you of the Monty Hall problem, where you are asked if you would switch after making a decision. So, I ran a simple Monte Carlo simulation to see what would happen:

def run(simulations=1000):
  env1, env2 = 0
  env2 = 0
  for i in range(simulations):
    envelopes = [random.randint(2,10)*2]
    select = random.randint(0,1)
    if select==1:
      envelopes.append(envelopes[0]/2)
    elif select==0:
      envelopes.append(envelopes[0]*2)
    env1 += envelopes[0]
    env2 += envelopes[1]
  return env1, env2, env2*1.0/env1
# plotting the simulation results
sim1, sim2 = [], []
for i in range(1000):
  a, b, c = run(1000)
  sim1.append(a)
  sim2.append(b)
hist(sim1)
hist(sim2)

In this case, which envelope you choose first doesn’t matter. You aren’t sure which one has more money, and there’s no distinguishable way to tell, therefore, the odds of you picking the envelope with more money is 50/50. Let’s call that the *first* envelope, meaning the one you choose first, just to have a point of reference.

Because one envelope always has twice the amount of the other with 50/50 odds, I flip a coin. If the coin is heads (1), the second envelope has 1/2 of the amount in envelope one; if it’s tails (0), then the second envelope has 2x the amount. The odds of the second envelope (the one you switch to) either have twice the money, or half the money. I run 1000 simulations and keep the tally for the first and second envelope. The end result is the sum of the contents for 1000 results of envelope A and envelope B. 

Here are some results of the simulations:

In [44]: run(10000)
Out[44]: (119290, 150884, 1.2648503646575573)
In [45]: run(10000)
Out[45]: (119894, 148105, 1.2352995145712045)
In [46]: run(10000)
Out[46]: (120490, 152672, 1.2670927047887792)

Let’s run 1000 simulations of the 1000 results and plot each resulting tally.

png-1

Blue is first envelope, green is second. If you did this experiment 1000 times, you would probably have ~$11,000 if you stick with the first envelope, and over ~$13500 if you decided to switch envelopes. This went completely against my initial thoughts, so I checked the code again:

if select==1:
  envelopes.append(envelopes[0]/2)
elif select==0:
  envelopes.append(envelopes[0]*2)

Ah ha! I thought (mistakenly). I’m multiplying the second one and dividing the first. What if I switched the signs?

if select==1:
  envelopes.append(envelopes[0]*2)
elif select==0:
  envelopes.append(envelopes[0]/2)

png-2

I should have expected that. Because the choice to multiply or divide is random, it didn’t matter, as roughly the same amount of enveloped get either doubled or halved regardless of order.

So, here’s what we know so far:

  • Switching seems to increase expected payout
  • The expected payout increases by a factor of about 25%
  • There is enough of a difference in the distributions to accept that switching has a significant effect

It helps to think about the expected payouts on a vector. Assuming the first envelope has $10 in it, the situation would look like this:

0---------10---------20------>

If you switch, the envelope will either contain $20 or $5:

0----5----10---------20------>

It’s plain to see that the potential gain on switching is significantly larger than the loss. Gaining 100% is much more significant in absolute terms than losing 50%. Therefore, expected value of switching – over time, on average – comes to $12.50 or +2.50 (marked by X) which is splitting the difference between -50% ($5 envelope) or +100% ($20)

0----5------{X}------20------>

It pays to switch. But here’s the paradox: if the act of switching increases your expected payout by 25%, then couldn’t it make sense to constantly switch, therefore perpetually increasing the payout? 

No! When you switch a second time, there are two possible cases:

Case 1, Switching back to the larger denomination: If you switch to the higher payout, and you switch back, you go from $10 to $20 to $10, but since you (unknowingly) had $10 to begin with, the expected additional payout of switching back is always 0.

Case 2, Switching back to the smaller denomination: If you switch to the smaller payout, and you switch back, you go from $10 to $5 to $10, but since you (unknowingly) had $10 to begin with, the expected additional payout of switching back is always 0.

Therefore, the act of switching again has no additional effect on the payout. Which implies that it is not the act of switching that increases value. Full stop.

There is certainly an opportunity to increase your payout, but the probability of switching to it is about 50/50 – the same odds of picking it in the first place. It is completely random and therefore no way to foolproof method to increase your payout. The validity of the choice can only be demonstrated after the envelope is opened. While the expected value may increase, the actual contents of the envelopes do not change.

So back to the philosophical debate of choice. How do you know which is the right one to pick? Flip a coin. You’ll be 100% right 50% of the time.

Half the money I spend on advertising is wasted; the trouble is I don't know which half 
-- John Wanamaker

I’ve often been told that the best odds in Vegas are at the craps table, with only about a 1% edge going to the house. Yet, I don’t know of any player that went to Vegas with $5000 to come home with $4950. They’re either up much more, or down much less. Ironically, expected value should rarely be expected. The healthiest, and most sensible, thing to do is to think about as my friend JVS did, “Just give me one of the envelopes and I’ll be happy. Either way, it’s more money than I have now.” And he’s right. The actual payout in relation to what you had before will always be +X, because no matter what envelope you choose, you will always have +X more dollars than you’ve had before. Trying to optimize for something that is completely random will just drive you insane…

Randomly Picking 1-7 from a 6-Sided Die

I saw a pretty perplexing question having to do with probability the other day

> How do you randomly generate a number from 1:7  with a six-sided die?

It’s a pretty devious question. Each face number 1:6 has exactly 1/6 probability to show up. How do you convert this to a 1/7 probability? I thought about this for a while. I wrestled with common factors, common multiples and really long combinatorial examples drawn by hand. I even googled it. The answer I found proved to be lengthy:

There are a bunch of options.  Here's an easy one:  roll the die
twice, keeping track of the first and second roll.  There are 36 outcomes:

  (1,1), (1,2), (1,3), ..., (6,6)

(The first number is what you got on the first roll; the second is
what you got on the second.)

If you get a (6,6), just re-roll the die twice again until you get a
non-(6,6).

Now there are 35 equally-likely outcomes, so divide them into 7 groups
of 5 corresponding to the 7 choices among which you want to choose.

Here's an easy calculation that will do it.  Suppose you roll a (x, y)
with x = number on first roll and y = number on second roll.  First 
calculate the following number:

  N = 6(x-1) + y.

I hate complicated solutions. Here’s another method (in python, of course):

import random 
# import this to use the random feature

dicerolls = [ [] for i in range(7) ]
for i in range(len(dicerolls)):
 dicerolls[i] = [ int(random.random()*6) for number in range(500)]
# print out winners and identify my random number from 1:7
winner = [sum(i) for i in dicerolls ]
print winner
print max(winner), "- random number:", winner.index(max(winner))+1

How it works: Roll N*7 times. Each result gets put into one of 7 lists, like so (when N = 13):

1: [1,3,3,1,4,1,4,3,4,5,5,6,6]
2: [5,3,4,4,2,3,4,2,3,3,2,4,3]
3: [5,2,4,6,2,3,4,2,3,2,4,5,4]

7: [2,4,2,3,4,5,1,6,2,4,6,2,4]

Because each individual roll is independent and random, it doesn’t matter how many times you roll, as long as each list is given the same amount of rolls. What’s important is that there are a sufficient number of rolls to prevent ties (in my code, I go with 2^7). After N rolls, you can sum the contents of the list, and the largest position is the outcome of your randomly generated number:

Run with N=500, here are the sums of each list:

[1236, 1207, 1246, 1233, 1273, 1248, 1298]
1298 - random number: 7

In this case, since the 7th list has the highest sum, your randomly generated number between 1:7 is 7. Easy!