Sankey Diagrams

Data Viz
Sankey
Author

Tim Anderson

Published

February 16, 2021

Visualizing Flow Through a System or Channel

A Sankey diagram is a type of flow diagram that visualizes the movement or flow of resources, such as energy, materials, money, or data, between different stages or categories. The distinguishing feature of a Sankey diagram is the use of arrows or paths, where the width of each arrow is proportional to the quantity of flow it represents. This makes it an effective tool for showing proportions, pathways, and efficiencies.

This type of visualization has been very useful in my experience showing the various routes to market through a multi-tier sales strategy.


Key Features of Sankey Diagrams

  1. Flow Representation:

    • Sankey diagrams show flows (e.g., energy, materials, or costs) between source and destination nodes.

    • Each flow’s width represents its magnitude, making it easy to see relative quantities.

  2. Nodes and Links:

    • Nodes: Represent categories, stages, or entities (e.g., energy sources, budget categories).

    • Links: Arrows connecting nodes represent the flow between them, with width proportional to the amount.

  3. Directionality:

    • Flows often move left-to-right or top-to-bottom, indicating direction or progression.
  4. Color-Coding:

    • Different colors can represent distinct categories or flows, aiding interpretation.

Setting up in R

The ‘networkD3’ package does a great job of building these diagrams…but the challenge, in my experience, is formatting the required data frame properly.

The supporting library needs three inputs: source, target and value…the key is to align the three vectors so that the index of source aligns with the index of target and value.

The diagram below depicts a hypothetical business division that sells their product to seven hypothized customer segments. To reach these customers they have a commercial group and a consumer group. The commercial group sells through three distributors and eight resellers.

The consumer group sells through four retailers.

The division also has a relatively small website direct business that sells to end users as well.

Again, the key in setting up the source, target, and value vectors.

Each link in the flow diagram needs three data points…link it’s coming from, link it’s going to, and the value that’s moving between them.

In the example below:

source_vector[1] = “Total Sales”

target_vector[1] = “Commercial Route”

value_vector[1] = 60

Those three values define the upper left flow. Each subsequent flow is defined in the vectors as well…again the trick is making sure everything is in alignment if you’re building by hand.

# Library
suppressMessages(library(networkD3))
suppressMessages(library(dplyr))
 
# Normally would form these vectors through code, but building it manually here to illustrate the format

source_vector <- c("Total Sales", "Total Sales", "Total Sales",                             # From Top of the funnel
                   "Commercial Route", "Commercial Route", "Commercial Route",              # From Commercial Route Start
                   "Consumer Route", "Consumer Route", "Consumer Route", "Consumer Route",  # From Consumer Route Start
                   
                   "Distributor A","Distributor B", "Distributor C", "Distributor C",       # From Disty's
                   
                   "Reseller X", "Reseller Y", "Reseller Z", "Reseller Y",
                   "Reseller X", "Reseller Y", "Reseller Z", "Reseller Y",
                   
                   "Retailer A", "Retailer B", "Retailer C", "Retailer D",
                   "Retailer A", "Retailer B", "Retailer C", "Retailer D",
                   
                   "Direct", "Direct")



target_vector <- c("Commercial Route", "Consumer Route", "Direct",           # To the top level routes
                   "Distributor A","Distributor B", "Distributor C",         # To Commercial Distributors
                   "Retailer A", "Retailer B", "Retailer C", "Retailer D",   # To Consumer Retailers
                   
                   "Reseller X", "Reseller Y", "Reseller Z", "Reseller Y",    # To Commercial Resellers
                   
                   "Comm'l Cust Seg 1", "Comm'l Cust Seg 2", "Comm'l Cust Seg 3", "Comm'l Cust Seg 2",
                   "Comm'l Cust Seg 3", "Comm'l Cust Seg 1", "Comm'l Cust Seg 1", "Comm'l Cust Seg 3",
                   
                   "Consumer Cust Segment 1", "Consumer Cust Segment 2", "Consumer Cust Segment 3", "Consumer Cust Segment 4",
                   "Consumer Cust Segment 2", "Consumer Cust Segment 3", "Consumer Cust Segment 4", "Consumer Cust Segment 1",
                  
                   "Consumer Cust Segment 3", "Consumer Cust Segment 4")


value_vector <- c( 60, 35, 5,
                   30, 20, 10, 
                   15, 10, 5, 5,
                   
                   30, 20, 5, 5,
                   
                   15, 10, 2, 3,
                   15, 10, 3, 2,
                   
                   7, 5, 2, 3,
                   8, 5, 3, 2, 
                   
                   3, 2)


# A connection data frame is a list of flows with intensity for each flow
links <- data.frame(
  source=source_vector, 
  target=target_vector, 
  value=value_vector
  )
 
# From these flows we need to create a node data frame: it lists every entities involved in the flow
nodes <- data.frame(
  name=c(as.character(links$source), 
  as.character(links$target)) %>% unique()
)
 
# With networkD3, connection must be provided using id, not using real name like in the links dataframe.. So we need to reformat it.
links$IDsource <- match(links$source, nodes$name)-1 
links$IDtarget <- match(links$target, nodes$name)-1
 
# Make the Network
p <- sankeyNetwork(Links = links, Nodes = nodes,
              Source = "IDsource", Target = "IDtarget",
              Value = "value", NodeID = "name", 
              sinksRight=FALSE,
              fontSize = 12)
p

Advantages of Sankey Diagrams

  1. Proportional Clarity:
    The width of flows conveys proportional quantities, making comparisons intuitive.

  2. Complex Data Simplified:
    Sankey diagrams simplify the visualization of multi-stage processes or systems with multiple flows.

  3. Identifies Key Pathways:
    Helps focus attention on significant flows or bottlenecks, such as energy losses or high expenses.

  4. Visually Engaging:
    The combination of proportional flows and color coding makes Sankey diagrams appealing and easy to interpret.


Limitations of Sankey Diagrams

  1. Data Complexity:
    If the data has too many nodes or flows, the diagram can become cluttered and hard to read.

  2. Lack of Time Representation:
    Sankey diagrams are static and do not show changes over time.

  3. Requires Accurate Data:
    Proportional flows rely on precise quantities; inaccuracies can mislead viewers.

  4. Limited Categories:
    Works best with a manageable number of categories. Overuse of nodes and links can overwhelm the audience.


Best Practices for Using Sankey Diagrams

  1. Simplify Where Possible:
    Focus on the most relevant flows and combine smaller, less significant flows into an “Other” category if necessary.

  2. Color Consistency:
    Use consistent colors for categories or types of flows to avoid confusion.

  3. Add Context:
    Include labels, legends, and annotations to ensure the audience understands the diagram’s meaning.

  4. Use for Comparisons:
    Highlight areas of inefficiency, such as large resource losses, or showcase opportunities for optimization.