While Spot employs a large variety of languages, the majority of our applications are based on C++ and Python. Much of the ad hoc data analysis is performed using Python, leveraging its many libraries such as NumPy and Pandas. In the past, we’ve blogged regarding our efforts in Data Analysis and Visualization, leveraging a variety of tools.
In July, I attended the SciPy conference in Austin, TX. The SciPy conference, which focuses on scientific applications leveraging open-source Python, is now in its 14th year. While it has grown significantly over the past few years, it is still relatively small – with 600 participants in 2015. The conference had several tracks; including Data Science, Visualization, Computational Biology, Astrophysics, and Oceanography.
I found the conference compelling in several ways. First, the focus of much of the conference was practical problem solving. The vast majority of the talks illustrated real-world solutions or concrete functionality. Second, the atmosphere of the conference was very open and inclusive. I attribute this partially to a more academic focus, but largely to the open-source nature of the majority of the tools. On a related note, it didn’t take itself too seriously, which was evident in the “lightning rounds” – a series of 5 minute presentations by various individuals. While these could well have degenerated into vendor pitches, they were instead a mix of clever solutions, outright hacks, feature descriptions, and even a technical recipe for generating wedding paperwork. The conference also hosted a 2-day open source hackathon/sprint over the weekend, allowing attendees to contribute to open source packages with in-person guidance from their creators.
While many of the tracks/talks focused on very specific domains and applications, a few non-mathematical areas raised general interest. These included visualization, performance, and collaboration.
Python has a huge selection of visualization libraries. Matplotlib, perhaps the most well-known, was created in 2002 and is still under active development. In fact, the Matplotlib team demonstrated new styles and interactive capabilities. A newer framework, Bokeh, provides web-based visualization, and supports Python, R, Julia, and Scala. It also includes a server supporting data streaming and filtering. The VisPy library leverages OpenGL and can therefore leverage GPUs for performance. The VisPy presentation demonstrated the scalability of the library, including a simultaneous streaming of 10,000 plots of 2000 samples each (see below). On the financial front, this isn’t that far from what traders look at every day. There was general excitement at the conference regarding these tools. Who doesn’t like a nice picture, right?
That’s a lot of plots!
On the performance front, the conference featured a tutorial and presentations on the long established Cython, but Numba was generating the most buzz. Numba provides a Just In Time LLVM compiler for Python. With minimally invasive code tags, Numba can provide orders of magnitude speed up for calculations, approaching Fortran or C level performance. It is NumPy aware and can leverage GPUs. One can even ship Numba functions to remote execution engines.
On the computational front, I attended the Machine Learning with Scikit-Learn tutorial which was very well prepared and presented. I also attended some lectures on computing frameworks, including Dask and Xray. Dask provides distributed task scheduling and very large data sets via disk buffering. Xray provides n-dimensional, labeled variants of Pandas data structures.
So, ultimately, what was my takeaway for me and Spot? I believe the technologies I mentioned above are worth further investigation. On the visualization front, I definitely want to look more at Bokah and VisPy (if only for the OpenGL abstractions). Time will tell how Spot leverages them relative to products like Tableau, but I think they can be a good fit for ad-hoc projects. Numba could provide near-term benefit for Spot applications in a fairly non-intrusive way. Regarding Jupyter, Spot is always looking for ways to improve collaboration and increase efficiency. I think a Jupyter server, potentially leveraging AWS/Docker, could provide a good tool for collaboration and documentation. We’ve built up our own package management around pip, easy_install, and virtualenv, but I heard some good things about conda. While Spot employs distributed computing heavily, it is mostly home-grown C++ applications targeted to real-time or near real-time. However, there may be opportunities to leverage open-source infrastructures like Dask.
All of the SciPy 2015 tutorial and presentation videos are freely available at. I encourage you to take a look.