During Lent 2021, I stepped away from writing and social media. Winter is over and I’m looking forward to a lot of writing in the months ahead.
This month’s T-SQL topic by Steve Jones is on using Jupyter Notebooks. I first saw their use for the first time at the PASS conference in 2019 (remember PASS?). What I thought was cool was the ability to both run code and save results too. Even have text, images, etc- just like a paper notebook.
I’ve been a SQL Server DBA for most of my data career. Some of you have wondered what the future looks like for DBA(s) and something I’ve wanted to have a reason to get into is data science. I don’t remember exactly how I first heard about RAPIDS from NVIDIA, it was probably something I stumbled across from following them on Twitter.
NOTE!!! Don’t miss the free conference starting on 12Apr21 all week, they will have a lot of new announcements and plenty of free training resources too: GTC.
What Is RAPIDS?
From their website, “The RAPIDS suite of open source software libraries and APIs gives you the ability to execute end-to-end data science and analytics pipelines entirely on GPUs. Licensed under Apache 2.0, RAPIDS is incubated by NVIDIA® based on extensive hardware and data science experience. RAPIDS utilizes NVIDIA CUDA® primitives for low-level compute optimization, and exposes GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.”
The teams at NVIDIA have been working really hard to try to maintain a familiar Python syntax for existing libraries but have them run on the GPU(s). Here is their release roadmap; version 0.18 is what I’m using.
GPU stands for Graphical Processing Unit. Anybody with an NVIDIA graphics card (Pascal or higher) can run RAPIDS. Right now, it only runs on Linux; I failed to get RAPIDS to use Windows Subsystem on Linux (WSL) as I only have a Maxwell based GPU on my gaming rig. I’m hopeful that once I get a new video card, I’ll re-visit running RAPIDS on WSL. If you are wondering, I tried to follow the steps both here and here.
Here is a video talking about how Wal-Mart is using RAPIDS.
Dual Boot Using An External SSD Drive
It just so happens I received a new laptop for work and it has a Pascal GPU! But alas, this is a work system so I can’t install Windows Insider Preview versions. I also didn’t want to risk fiddling with partitioning the internal drives either.
Low and behold! I stumbling across this video trying to research installing Ubuntu on an external drive. Since I had all of the parts, I didn’t have to buy anything. NOTE!!! The video is over a year old and some of the steps have changed. Just be sure to stick with Ubuntu 18.04; don’t upgrade to 20.04.
I have been running this for a few weeks now without any problems. My work laptop is unaffected and I can learn RAPIDS using this configuration albeit with a limited amount of GPU memory.
Installing SQL Server on Linux
Sticking with the instructions for Ubuntu 18.04, I installed SQL Server on Linux. I also got Azure Data Studio for Linux installed too. Downloaded and restored the AdventureWorks databases. I then ran through several steps to get the Python drivers installed. This site was very helpful (remember to stick to the 18.04 instructions) and I was able to get things running from within a Jupyter notebook. Finally!
Running Jupyter Notebooks Using Docker and RAPIDS Container
Click on the link [http://localhost:8888] in a browser like Firefox and the notebook will open.
Since I’m focused on leveraging existing data skills to learn GPU things, I was interested in things that had a SQL syntax. Note there are several notebooks covering many of the tools within RAPIDS which can be found in the container.
(caveat: the following examples are from https://app.blazingsql.com which will be shutting down and will be re-branded as something else? Time will tell.)
What follows are several screenshots from me copying code from their site and running it locally on my machine. These examples use NY taxi data (46MB).
What is so neat and cool to me is to be able to run queries and plot data without a whole lot of headache or software installation.
SQL Server -> Pandas -> cuDF
The component known as BlazingSQL is an in-memory data analytics tool which uses the CUDA dataframe (cuDF)- it doesn’t persist data. By using SQL Server, one can use all of the tools many of us are familiar with. The following screenshots show how I connect to SQL Server on Linux using Python inside of a notebook and then load a GPU dataframe, then show some metrics about it.
These are the results from running gdf.info(verbose=True) from above:
I know this was a whirlwind tour. Jupyter notebooks has made it easier to query data, see results, and do Pythonic stuff. RAPIDS is getting a lot of attention as it can do things much faster on the GPU than on the CPU like SQL Server. I hope to write more about RAPIDS and its use in data science.