Up and At ‘Em With D3.js – Pt. 3: Let’s get plotting 

vJYOI

Part 3: Plotting the Data

Now, with the foundations out of the way, we can make a scatterplot with the iris data. Let’s get right into the code:

charts = d3.select("body")
  .append("svg")
  .attr({
      "height": 300,
      "width": 600; });
d3.csv('iris.csv', function(information){
// doconsole.log(information)
// generating circles
  charts.selectAll("circles")
    .data(information)
    .enter()
    .append("circle")
      .attr("cx", function(d, i) { return i*5 } )
      .attr("cy", function(d) { return d.sepalL*2} )
      .attr("r", 5)
      .style("fill", function(d) { return color(d.species) })
}) // CODE GOES BEFORE THIS BRACKET

Which should generate something like this:

Untitled

It may not be what you expected, but let’s try to extrapolate what’s going on here, line by line.

charts.selectAll("circles")

This is a little confusing at first. You haven’t drawn any circles yet, so why are you calling selectAll on things that don’t exist? D3 gets around this paradox with the internal understanding that if the object “circles” does not exist, it will still give you something, which will now be referred to as “circles.” You can call this whatever you’d like. You can call it “eggs” if you want to. And when you want to edit this particular group, you can make calls to “eggs.”

You: Which came first? The chicken or the egg?
D3: Never mind that. Here, have an egg.

Next lines:

.data(information)

Attach data(information) to the new group we call “circles.”

.enter()

Tells the data to enter charts, the SVG canvas we drew in part 1. This might seem a little unnecessary – but it exists because there is also an .exit() method.

.append("circle")

This is the command that makes it all happen. I’m telling D3 to start drawing circles using each individual data point from start to finish, in the order that it is given. It can be all manners of shapes, each with their own unique attributes. For circles, you need a minimum of 3 attributes:

  • circle’s x-position (cx);
  • circle’s y-position (cy); and
  • the radius of the circle (r)

The values of these attributes come from any source you tell it to, in this case, a function. Here, I’m asking for the iterator (i) for each item in the data array. An iterator is an enumeration of the order a value is encountered. The first value will have iterator 0, the second will have iterator 1, etc.

“For each instance of a circle, draw its center at x-position of i * 5 pixels.”

.attr("cx", function(i) { return i*5 } )

Circle-0 has 0 (the iterator) * 5 == 0
Circle-1 has 1 * 5 == 5
Circle-2 has 2 * 5 == 10.

With that understanding, you can start to see why the circles are so close together

dots

Let’s try i*10:

Untitled

i*20:

Untitled

.attr("cy", function(d) { return d.sepalL*2} )

Here, instead of calling iterator with i, I’m calling the data with d. And by calling d.sepalL, I’m calling the data points from column sepalL to determine the circle’s y-position. There isn’t much variance in the y-location of the circles because the data looks like this: [“5.1″, “4.9”, “4.7”, “4.6”, “5.0”, “5.4”… ]

Since we are using real data to plot out positions, the results are often unpredictable as we can’t always be sure of what the data looks like. We’ll get into how to account for this in Pt 4.

.attr("r", 5)

Draw circles with radius of 5. And finally, the colors:

.style({'fill': function(d) {return color(d.species)}})

Here, I’ve set ‘fill’ to equal the return of color(d.species).At the top of my code, I used var color=d3.scale.category10(). The function color(d.species) then assigns each different value of d.species (virginica, setosa, versicolor) a different color from the category10() palette. In this case Iris-setosa gets assigned the first color of d3.scale.category10(), Iris-versicolor gets the second and Iris-virginica gets the third. Of course, you can always write your own functions here.

.style({
  'fill': function(d) { 
    if (d.species == 'Iris-setosa') {
      return 'red'
    }
  }
})

Which would make all of the setosa dots return red (and the other dots as default color).setosa

So there you have it. When saved to an HTML file, it should look like this. You can View Source if you want to look at my code.

Today, we covered:

  • D3.js syntax
  • Basic SVG declarations
  • Loading data from an external CSV source
  • Drawing circles and using attributes and functions to style them.

Next time, we’ll go over how to un-bunch the data to derive more meaning.

Still a little confused? Check out this tributary so you can get your hands dirty in a simplified version of the code.

Up and At ‘Em With D3.js – Pt. 2

rainier-wolfcastle-quotes-i11

Part 2: Here Comes the Data

Here’s an adapted version of the infamous iris dataset. Save it to the same directory your HTML file is in.

Note: For security, D3 does not allow you to run local files directly. If you are running Mac OSX, there’s an easy fix:

  • open your terminal
  • navigate to the directory with the HTML files and
  • execute python -m SimpleHTTPServer

Then, you may point a web browser to http://localhost:8000/filename.html, and your D3’d HTML file should be visible. You can also use a network share or install something like nginx to serve your files.

To the original dataset, I’ve added headers to the CSV file because this is a approximately three hundred times easier than learning how to manipulate and add column names programmatically. In a text editor, I added the following to the first line:

sepalL, sepalW, petalL, petalW, species

This allows me to call up the data via column name with a function. For example:

.append("circle")
  .attr({
    "r": function(d) { return d.sepalL }
    })

Will allow me to append (add) circles to my SVG canvas with a radius equal to each item in the data column sepalL. We’ll get deeper into these calls in the next chapter. Just hold on to the knowledge that

function(d) {return ( d.columnName })

will return every instance within columnName.

Now, let’s load up the data in D3.

<html>
  <head>
    <script type="text/javascript" src="http://d3js.org/d3.v3.min.js">
  </head>
  <body>
    <script type="text/javascript">
      var color = d3.scale.category10()
      var charts = d3.select("body")
        .append("svg")
        .attr({
          "height": 300,
          "width": 600; });
      d3.csv('iris.csv', function(information){
        console.log(information)
        // PLACEHOLDER
      })
    </script>
  </body> 
</html>

Simple, right? Because I called a console.log with the variable information, you should see an array with 150 items in your console. Each row should be converted into a single item in an array in a format like this:

0Object
petalL"1.4"
petalW"0.2"
sepalL"5.1"
sepalW"3.5"
species"Iris-setosa"

D3 is smart enough to manipulate data on its own*, so you can worry more about what to do with the data, rather than parsing it.

* … most of the time. You can see here that the number values have been saved as a string, and not a number

Line by line:

d3.csv('iris.csv', function(information){
  // d3, i want you to access the file 'iris.csv' and do stuff with it,
  // i will tell you what to do with it later. furthermore,
  // the contents from this command should be referred to as "information"

 console.log(information)
   // pull "information" that comes from iris.csv into my console in case I want a better look
   //// PLACEHOLDER

})
  // this last bit is a very important part to keep track of. 
  // All of your d3 code MUST be within the closing ) tag of d3.csv()
  // otherwise, the code you write will be out of scope
  // and d3 will not have access to the data!
  // this is only needed when you are calling an external file
  // if you create a variable myDataSet=[1,2,3,4,5,6,7,100]
  // then you do not need d3.csv()

Within the d3.csv() scope,  you can now do things like:

console.log(information[0].sepalL)

Which will return the 0th-element of information matching the value for sepalL. If you want to look at all the values of sepalL in the console, it gets a little trickier. I resolved it by pushing each value to an array:

sepalLengths = []
information.forEach(function(information) {
sepalLengths.push(information.sepalL) 
 })

Today we’ve learned:

  • how spin up a basic HTTP server in Python
  • basic understanding of how to use d3.csv() to load up data
  • viewing the data with a console.log()

Next time, let’s start plotting these numbers!

Up and At ‘Em With D3.js

Preface

For me, one of the most exciting things going on right now in data is D3.js (Data-Driven Documents), a Javascript library written by mountain-man and beard aficionado Mike Bostock that makes it easier to create lightweight, interactive online documents with data (examples). As a data guy with a degree in visual communications, the possibilities for this library seem endless. Yet, as a non-web developer, this meant that I had to learn JavaScript while learning D3, using a mix of books, online tutorials and the D3 documentation. (I also learned what they mean by “hacking” – as I hacked a lot of pieces here and there together to get something working.)

I’ve made some good progress, but I noticed the self-education method typically caters to two clusters of people: 1) hardcase newbies or 2) those already doing JS for a living. These tutorials either use very simple test cases or complex JavaScript jargon.

> var mydata = [1,2,3,4]
// who procedurally develops viz for 4 data points?!

OR

> When the CSV data is available, the specified callback will be invoked with the parsed rows as the argument. If an error occurs, the callback function will instead be invoked with null. An optional accessor function may be specified, which is then passed to d3.csv.parse;

Anchorman-2-Say-What I’m writing this (and further tutorials) to cater to people who are like me: somewhere between newbie and expert, with some knowledge of coding, some HTML knowledge, and an understanding how datasets generally work. The point is to set up a mental framework upon which you can understand and build your own D3 projects, rather than an in-depth analyses of what SVGs do or explain JavaScript fundamentals (go here for that). I want you to get up and running in D3 as quick as possible, while explaining enough of it so you fundamentally understand the process and make own changes as it pertains to your D3 project. I’ve attempted to demystify D3 as much as I can, and I will put lots and lots of comments explaining step by step, in plain English, what each call means. There will be mistakes, and I encourage you to call me out when you see them. Tweet me @adailyventure.

Part 1: Setting Up

Start with this block of code in a new .html file. Some HTML tags are required because D3 draws the visualization using the document object model (DOM). For example, you can tell D3 to draw a chart at the <body> tag or a <div> tag — aka the ‘document objects’. This allows precise placement anywhere in the document – which in turn, allows more creative freedom. Learn more about the DOM here.

<html>
<body>
  <head>
    <script type="text/javascript" src="http://d3js.org/d3.v3.min.js">
  </head>
  <body>
    <script type="text/javascript">
    // code here
    </script>
  </body>
</html>

Note: You can use the url http://d3js.org/d3.v3.js instead, but the minified version (d3.v3.min.js has all unnecessary spaces and formatting stripped out) loads faster because it’s a smaller file. Take a look at the un-minified version if you want to try and understand what’s going on behind the scenes and/or are suffering bouts of insomnia.

The way I use D3 is through plotting an SVG in the HTML document and drawing my data points and shapes inside, using the SVG as a canvas.

Where it says //code here, between the <script> and </script> tag, invoke the canvas:

var color = d3.scale.category10()
charts = d3.select("body")
  .append("svg")
  .attr({
      "height": 300,
      "width": 600; });

Note: Notice how new lines begin with . – this is typical d3 syntax through which you chain instructions, rather than writing new code for every little change. Also, the tabs and carriages are unnecessary. They are there so the code is easier to read. You could just as easily write:

  // LONG FORM CHAINING 
one = d3.select("body");
two = one.append("svg");
three = two.attr('height', 300);
four = three.attr('width', 600);
  // CHAINING ON ONE LINE
charts = d3.select("body").append("svg").attr("height", 300;).attr("width", 600;);

You can combine multiple attributes in dictionary form:

baka = { key: value, key1: value, key2: value };
...
.attr(baka)
  // OR 
.attr({'key': 'value', 'blurb': 4})

D3 is pretty agnostic on which formatting method you prefer. People often say it’s best to pick one formatting method and stick with it because it makes your code easier to troubleshoot, but I go with whatever style has me typing less characters:

39415160

Now, line by line (let’s skip var color=d3.scale for now):

var charts =
  // here i am creating a new variable called charts, 
  // so I don't have to reference the initial SVG repeatedly
  // next time, instead of typing up
  // d3.select("body").append("svg").attr().append('circle') 
  // to draw a circle, i can now simply do
  // charts.append('circle')
d3.select("body")
  // d3 - i want you to find the "body" tag in the browser DOM 

.append("svg") 
  // and within the "body" tag, place an svg at the end (aka append) 

.attr({ "height": 300, "width": 600;}) 
  // with the attributes height of 400 and width of 600 (pixels) 
  // this 300 and 600 can be replaced with your variables 
  // somenumber = 400/5*78+5
  ... 
  // .attr({'height': somenumber})

Hello World! You have successfully drawn an SVG in-browser. Save as an .html file and load it up in your browser. It should look like this: white

It may not seem like much, but for data viz, this is the equivalent of the Egyptians inventing papyrus, the invention of the brush, and the industrial revolution all rolled into one. The next few steps (which we’ll go over in later posts) aren’t that much harder, either – which is what makes D3 so exciting!

Your code should look like this. My comments prefaced with //:

<html>
  <head>
    <script type="text/javascript" src="http://d3js.org/d3.v3.min.js">
  </head>
  <body>
    <script type="text/javascript">
      var color = d3.scale.category10()
      var charts = d3.select("body")
        // initiate a variable 'charts' which is also a call for
        // d3 to select the 'body' tag of the HTML (line 5)
        .append("svg")
          // at the end of the body tag, append an SVG
        .attr({
          "height": 300,
          "width": 600; });
            // with attributes: height: 300, width: 600
            // in this example, attrs are defined in-line as a dictionary
    </script>
  </body> 
</html>

Today we’ve learned how to:

  • set up documents for D3
  • chaining
  • DOM selection using D3, and
  • created an SVG using JavaScript and D3

Next time, we’ll cover how to incorporate data sets into D3.

The Two Envelopes Problem

Here’s a brain teaser, as phrased by wikipedia (spoilers):

> You have two indistinguishable envelopes that each contain money. One contains twice as much as the other. You may pick one envelope and keep the money it contains. You pick at random, but before you open the envelope, you are offered the chance to take the other envelope instead.

This should remind you of the Monty Hall problem, where you are asked if you would switch after making a decision. So, I ran a simple Monte Carlo simulation to see what would happen:

def run(simulations=1000):
  env1, env2 = 0
  env2 = 0
  for i in range(simulations):
    envelopes = [random.randint(2,10)*2]
    select = random.randint(0,1)
    if select==1:
      envelopes.append(envelopes[0]/2)
    elif select==0:
      envelopes.append(envelopes[0]*2)
    env1 += envelopes[0]
    env2 += envelopes[1]
  return env1, env2, env2*1.0/env1
# plotting the simulation results
sim1, sim2 = [], []
for i in range(1000):
  a, b, c = run(1000)
  sim1.append(a)
  sim2.append(b)
hist(sim1)
hist(sim2)

In this case, which envelope you choose first doesn’t matter. You aren’t sure which one has more money, and there’s no distinguishable way to tell, therefore, the odds of you picking the envelope with more money is 50/50. Let’s call that the *first* envelope, meaning the one you choose first, just to have a point of reference.

Because one envelope always has twice the amount of the other with 50/50 odds, I flip a coin. If the coin is heads (1), the second envelope has 1/2 of the amount in envelope one; if it’s tails (0), then the second envelope has 2x the amount. The odds of the second envelope (the one you switch to) either have twice the money, or half the money. I run 1000 simulations and keep the tally for the first and second envelope. The end result is the sum of the contents for 1000 results of envelope A and envelope B. 

Here are some results of the simulations:

In [44]: run(10000)
Out[44]: (119290, 150884, 1.2648503646575573)
In [45]: run(10000)
Out[45]: (119894, 148105, 1.2352995145712045)
In [46]: run(10000)
Out[46]: (120490, 152672, 1.2670927047887792)

Let’s run 1000 simulations of the 1000 results and plot each resulting tally.

png-1

Blue is first envelope, green is second. If you did this experiment 1000 times, you would probably have ~$11,000 if you stick with the first envelope, and over ~$13500 if you decided to switch envelopes. This went completely against my initial thoughts, so I checked the code again:

if select==1:
  envelopes.append(envelopes[0]/2)
elif select==0:
  envelopes.append(envelopes[0]*2)

Ah ha! I thought (mistakenly). I’m multiplying the second one and dividing the first. What if I switched the signs?

if select==1:
  envelopes.append(envelopes[0]*2)
elif select==0:
  envelopes.append(envelopes[0]/2)

png-2

I should have expected that. Because the choice to multiply or divide is random, it didn’t matter, as roughly the same amount of enveloped get either doubled or halved regardless of order.

So, here’s what we know so far:

  • Switching seems to increase expected payout
  • The expected payout increases by a factor of about 25%
  • There is enough of a difference in the distributions to accept that switching has a significant effect

It helps to think about the expected payouts on a vector. Assuming the first envelope has $10 in it, the situation would look like this:

0---------10---------20------>

If you switch, the envelope will either contain $20 or $5:

0----5----10---------20------>

It’s plain to see that the potential gain on switching is significantly larger than the loss. Gaining 100% is much more significant in absolute terms than losing 50%. Therefore, expected value of switching – over time, on average – comes to $12.50 or +2.50 (marked by X) which is splitting the difference between -50% ($5 envelope) or +100% ($20)

0----5------{X}------20------>

It pays to switch. But here’s the paradox: if the act of switching increases your expected payout by 25%, then couldn’t it make sense to constantly switch, therefore perpetually increasing the payout? 

No! When you switch a second time, there are two possible cases:

Case 1, Switching back to the larger denomination: If you switch to the higher payout, and you switch back, you go from $10 to $20 to $10, but since you (unknowingly) had $10 to begin with, the expected additional payout of switching back is always 0.

Case 2, Switching back to the smaller denomination: If you switch to the smaller payout, and you switch back, you go from $10 to $5 to $10, but since you (unknowingly) had $10 to begin with, the expected additional payout of switching back is always 0.

Therefore, the act of switching again has no additional effect on the payout. Which implies that it is not the act of switching that increases value. Full stop.

There is certainly an opportunity to increase your payout, but the probability of switching to it is about 50/50 – the same odds of picking it in the first place. It is completely random and therefore no way to foolproof method to increase your payout. The validity of the choice can only be demonstrated after the envelope is opened. While the expected value may increase, the actual contents of the envelopes do not change.

So back to the philosophical debate of choice. How do you know which is the right one to pick? Flip a coin. You’ll be 100% right 50% of the time.

Half the money I spend on advertising is wasted; the trouble is I don't know which half 
-- John Wanamaker

I’ve often been told that the best odds in Vegas are at the craps table, with only about a 1% edge going to the house. Yet, I don’t know of any player that went to Vegas with $5000 to come home with $4950. They’re either up much more, or down much less. Ironically, expected value should rarely be expected. The healthiest, and most sensible, thing to do is to think about as my friend JVS did, “Just give me one of the envelopes and I’ll be happy. Either way, it’s more money than I have now.” And he’s right. The actual payout in relation to what you had before will always be +X, because no matter what envelope you choose, you will always have +X more dollars than you’ve had before. Trying to optimize for something that is completely random will just drive you insane…