Dask is a Python package that provides advanced parallelism for analytics, enabling performance at scale for the tools you love. People think it’s magic - drop it in and it scales. This will mostly work, but it will not scale well!
We would like to share what we’ve learned about using Dask to scale dataframe and computations, to avoid you making the same mistakes.
This is a talk about scaling Pandas using Dask by Krishan Bhasin at Dask Summit 2021.
What is the Dask Summit?
The Dask Distributed Summit is where users, contributors, and newcomers can share experiences to learn from one another and grow together. The Dask Distributed Summit provides content, information, and learning opportunities for attendees of all levels of Dask familiarity and expertise.
What is Dask?
Dask is a free and open-source library for parallel computing in Python. Dask is a community project maintained by developers and organizations.
Share your feedback with us on this scaling Pandas talk and let us know:
- Did you find this talk on scaling Pandas using Dask helpful?
- What is your experience with scaling Pandas?
Learn more at summit.dask.org and dask.org
KEY MOMENTS
00:00:00 Scaling Pandas Using Dask
00:00:16 About Krishan Bhasin
00:00:59 Why This Talk?
00:01:38 Overview of Session
00:01:59 Dask Recap
00:03:37 A Closer Look at Dask Dataframe
00:04:33 A Closer Look at Distributed Scheduler
00:05:41 Submitting Work to a Cluster
00:08:05 Dask Learnings Part 0 - Just Don't
00:09:16 Dask Learnings Part 1 - You Can't Improve What You Can't See
00:10:34 Use The Dashboard
00:15:01 Dask Learnings Part 2 - Understand and Leverage Dask's Principles
00:23:39 Dask Learnings Part 3 - If It's Broke or Missing, Fix It!
00:29:10 Q & A