Scaling Pandas Using Dask: How to Avoid All My Mistakes | Krishan Bhasin | Dask Summit 2021

Dask 6,846 4 years ago

Video Not Working? Fix It Now

Dask is a Python package that provides advanced parallelism for analytics, enabling performance at scale for the tools you love. People think it’s magic - drop it in and it scales. This will mostly work, but it will not scale well! We would like to share what we’ve learned about using Dask to scale dataframe and computations, to avoid you making the same mistakes. This is a talk about scaling Pandas using Dask by Krishan Bhasin at Dask Summit 2021. What is the Dask Summit? The Dask Distributed Summit is where users, contributors, and newcomers can share experiences to learn from one another and grow together. The Dask Distributed Summit provides content, information, and learning opportunities for attendees of all levels of Dask familiarity and expertise. What is Dask? Dask is a free and open-source library for parallel computing in Python. Dask is a community project maintained by developers and organizations. Share your feedback with us on this scaling Pandas talk and let us know: - Did you find this talk on scaling Pandas using Dask helpful? - What is your experience with scaling Pandas? Learn more at summit.dask.org and dask.org KEY MOMENTS 00:00:00 Scaling Pandas Using Dask 00:00:16 About Krishan Bhasin 00:00:59 Why This Talk? 00:01:38 Overview of Session 00:01:59 Dask Recap 00:03:37 A Closer Look at Dask Dataframe 00:04:33 A Closer Look at Distributed Scheduler 00:05:41 Submitting Work to a Cluster 00:08:05 Dask Learnings Part 0 - Just Don't 00:09:16 Dask Learnings Part 1 - You Can't Improve What You Can't See 00:10:34 Use The Dashboard 00:15:01 Dask Learnings Part 2 - Understand and Leverage Dask's Principles 00:23:39 Dask Learnings Part 3 - If It's Broke or Missing, Fix It! 00:29:10 Q & A

Comment