Big Data

Big Data Visualization Using Datashader in Python

Pinterest LinkedIn Tumblr

A few months ago, I wrote an article on my favorite Python Viz tools — HoloViz. Many people are interested in learning more about Dashshader — the big data visualization tool in the HoloViz family. I absolutely love Datashader and love how Datashader creates meaningful visualizations of large datasets very quickly. So in this article, I am going to walk you through a simple Datashader example, explain how Datashader works, and why it is fast.

Why is big data visualization hard?

From my understanding, there are two main obstacles to visualize big data.

  • The first is speed. If you were to plot the 11 million data points from my example below using your regular Python plotting tools, it would be extremely slow and your Jupyter kernel would most likely crash.
  • The second is image quality. Even if it doesn’t crash and you are willing to wait, most plotting libraries will simply keep drawing each new data point as a circle or other shape on top of each other, which will result in over-plotting. Even adding alpha transparency for overlapping points won’t always help in this situation. Imagine that you have a lot of points displaying on top of each other on an image: what you see will be a blob, and it will be very hard to extract information from this blob.

Datashader provides elegant and seemingly magic solutions to these two obstacles. Next, I will show you an example and peek into the magic involved.

Big data visualization using Datashader — An example

This example comes from the NYC Taxi data example on pyviz.org. For the full example, please see https://examples.pyviz.org/nyc_taxi.

  • Import needed packages

* You might need to conda install the packages that you don’t have in your conda environment first.

import holoviews as hv, pandas as pd, colorcet as ccfrom holoviews.element.tiles import EsriImageryfrom holoviews.operation.datashader import datashadehv.extension('bokeh')
  • Read in data

For the very largest files, you will want to use a distributed processing library like Dask with Datashader, but here we have a Parquet file with “only” 11 million records, which Datashader can easily handle on a laptop using Pandas without any special computing resources. Here we’ll load in two columns representing taxi drop-off locations.

  • Plotting

Here we plot our data using Datashader. It only took four lines of code and six milliseconds to plot the 11million rows of data for me using my laptop, overlaid on a map of the New York area:

If you were running it live, you would then be able to zoom in to any region of this map, with the plot dynamically updating to use the full resolution for that zoom level.

How does Datashader work?

Fig 1. Datashader pipeline (Image from datashader.org with permission).

Datashader turns your data into a plot using a five-step pipeline. The Datashader docs illustrate how the pipeline works in each of the steps — projection, aggregation, transformation, colormapping, and embedding. I’m going to break down my previous example into these small steps so that we can see exactly what Datashader is doing under the hood.

Let’s first install the underlying Datashader functions so we can run through the individual steps:

import datashader as dsimport datashader.transfer_functions as tf
  • Projection
canvas = ds.Canvas(plot_width=900, plot_height=480)

First, we define a 2D canvas with width and height for the data to be projected onto. The canvas defines how many pixels we would like to see in the final image, and optionally defines the x_range and y_range that will map to these pixels. Here the data ranges to plot are not set in the Canvas, so they will be filled in automatically in the next step from the max and min of the data x and y values in the dataframe. The canvas defines what the projection will be, but for speed each point is actually projected during the aggregation step.

  • Aggregation

After we define the projected canvas, we project each point into the two-dimensional output grid and aggregate the results per pixel. Datashader supports many options for such aggregation, but in this example, we simply count how many data points are projected into each pixel, by iterating through the data points and incrementing the pixel where that point lands. The result in this case is a two-dimensional histogram counting dropoffs per pixel:

  • Transformation (optional)

The result from the previous step is now a fixed-size grid, no matter how large the original dataset was. Once the data is in this grid, we can do any kind of transformation we like on it, such as selecting only a certain range of counts, masking the data based on the result of other datasets or values, etc. Here, the dropoff data ranges from zero in some pixels to tens of thousands in others, and if we try to plot the grid directly we would see only a few hotspots. To make all the different levels visible as in the image above, the data is transformed using the image-processing technique “histogram equalization” to reveal the distribution of the counts rather than their absolute values. Histogram equalization is actually folded into the colormapping step below, but we can do explicit transformations at this stage if we want, such as squaring the counts:

  • Colormapping

Next, we can render the binned grid data to the corresponding pixels of an image. Each bin value is mapped into one of the 256 colors defined in a colormap, either by linear interpolation or with an automatic transformation (e.g. by calling the log function on each value, or as here using histogram equalization). Here we’re using the “fire” colormap from Colorcet, which starts at black for the lowest counts (1 and 2 dropoffs) and goes through red for higher values (in the hundreds) and then yellow for even higher values (in the thousands) and finally white for the highest counts per pixel (in the tens of thousands in this case). We set the background to black to better visualize the data.

  • Embedding

As you can see, Datashader only renders the data, not any axes, colorbars, or similar features you’d expect in a full plot. To get those features that help you interpret the data, we can embed the images generated by Datashader into a plot. The easiest way is to use HoloViews, which is a high-level plotting API that provides the flexibility to use either Matplotlib, Bokeh, or Plotly as the backend. Here is an example of using HoloViews to define a “points” object and then datashading all the points. Here we demonstrate an alternative method `rasterize` instead `datashade` so that Bokeh is in charge of the transformation and colormapping steps and allows hover and colorbars to work.

Why is Datashader crazy fast?

First, we need to talk about the original data format. Datashader is so fast that reading in the data is usually the slowest step, particularly if your original data is a bunch of JSON files or CSV files. The Parquet file format is usually a good choice for columnar data like the dropoff points, because it is compact, quick to load in, efficiently reads in only the columns and ranges you need, and supports distributed and out-of-core operation when appropriate.

Second, with the right input file formatting, we can investigate the next most expensive task, which is the combined projection+aggregation step. This step requires calculating values for each of the millions of data points, while all subsequent calculations use the final fixed-size grid and are thus much faster. So, what does Datashader do to make this step fast?

  • Datashader’s aggregation calculations are written in Python but then just-in-time compiled into wicked-fast machine code using Numba. E.g., here is the code where the counting per bin is happening.
  • The example above uses a CPU, but Datashader + Numba also supports CUDA cudf dataframes as a drop-in replacement for a Pandas dataframes that run even faster if you have a GPU.
  • Datashader can also parallelize its pipeline (code example) so that you can make use of all the computing cores you have available, scaling to even larger datasets and giving even faster results.

Because Datashader is so fast, we can actually visualize big data interactively, dynamically redrawing whenever we zoom or pan. Here is an example where you can view the NYC Taxi data interactively in a Panel dashboard. My favorite example on ship traffic illustrates that even though all you see is a pixelated image that Datashader renders, you can still inspect individual data points and understand what you are seeing. The other examples at examples.pyviz.org show Datashader for much larger files, up to billions of points on an ordinary laptop.

Where can you learn more about Datashader?

In all, this article shows you an example of using Datashader to visualize 11 million rows of coordinate data and explains why Datashader is able to produce meaningful visualizations so quickly. One thing worth noting is that Datashader can be used for any type of data, not just geographical points as in the above example. I highly recommend checking out the many great examples at Datashader.org.

Acknowledgment

Thank you so much Jim Bednar for your guidance and feedback on this article.

References

https://examples.pyviz.org/nyc_taxi/nyc_taxi.html

https://datashader.org/getting_started/Pipeline.html

http://numba.pydata.org/

Original Source

Ph.D. | Senior Data Scientist @ Anaconda | Twitter @ sophiamyang | All views are my own

Write A Comment