Join our Meetup group:
https://www.meetup.com/data-umbrella
Kimberly Fessel: Polars for Data Analysis in Python
## Resources
- Repo: bit.ly/DUPolars (https://github.com/kimfetti/Conferences/tree/master/DataUmbrella_2024)
- Slides: https://github.com/kimfetti/Conferences/blob/master/DataUmbrella_2024/DataUmbrella_2024_KFessel_Deck.pdf
- Kimberly Fessel's YouTube: https://www.youtube.com/c/kimberlyfessel
- a video on "Intro to Rust": https://youtu.be/7E8nLExn3WI
- Polars on GitHub: https://github.com/pola-rs/polars
- Save the date: Nov 19, 2024, We have another polars event: Understanding Polars Expressions when you're used to pandas
- Marco Gorelli video: Polars, Narwhals and Pandas: https://youtu.be/kPtUPe5Egak
## About the Event
Discover Polars, the high-performance DataFrame library revolutionizing data analysis in Python. Built on Rust, Polars offers unparalleled speed and efficiency, outperforming pandas, Dask, and even PySpark. Explore its innovative features like lazy evaluation, memory efficiency, and automatic multi-threading, designed to handle large datasets with ease.
In this session, you'll learn practical techniques for data manipulation and advanced transformations. We will demonstrate Polars' syntax and capabilities, making it accessible even if you’re new to Polars. Join us to elevate your Python data analysis to the next level.
This presentation covers:
- Section 1: What is Polars and how does it compare to pandas?
- Section 2: Getting Started with Polars in Python
- Section 3: Advanced Data Analysis with Polars
- Section 4: Should you Switch to Polars?
## Timestamps
00:00 Data Umbrella introduction
04:15 Kim begins talk; about Kimberly Fessel
06:17 What is Polars?
07:40 Polars is built on Rust
08:55 Polars development and adoption
09:46 Key features of polars (speed)
13:11 Lazy evaluation
15:56 Q&A: What architecture were the speed tests run on?
16:50 Q&A: How is it able to achieve multi-threading since Python has an interpreter lock?
17:35 Polars vs Others (pandas, PySpark, SQL), syntax
19:15 Polars compared with Pandas: similarities
20:06 Polars compared with Pandas: differences (index names, parallelism, lazy evaluation, better syntax)
21:40 Polars compared with Dask
22:55 Polars compared with Apache Spark
24:30 Getting started with Polars in Python: installation, Jupyter notebook
27:20 break: fix tech issue
28:22 back to Jupyter notebook with Python
30:20 Q: Can you clarify what size datasets are for Polars?
31:17 Q: Does Polars have “series” concept as in pandas?
31:37 Q: Understanding multi-processing in polars
32:30 back to Jupyter notebook: exploring polars dataframe (sampling and more)
33:20 select() columns
34:30 filter() data
36:45 Adding new columns; computing new columns
38:40 .alias() new column name
39:50 Polars can create columns in parallel for speedier calculations
41:32 Advanced operations in polars (missing data, join dataframes, analysis
42:38 Notebook 2: Advanced Data Analysis with Polars
43:00 lazy evaluations: pl.scan_csv()
46:10 Advanced examples of Polars code
49:33 Q: Does Polars have the pandas feature in_place=True?
50:14 Q: What are the options for plotting data in Polars?
50:35 Explore more advanced options: transformations, SQL queries, user-defined functions
51:25 Data visualization, plotting in Polars
53:32 Converting to pandas
54:30 Should you switch to Polars?
54:40 Other data source options: Excel, parquet, JSON, database
54:58 Working with large files: lazy evaluation, streaming data
55:51 Advanced operations
56:17 When should I use polars?
58:32 Resources
01:00:00 Q: Can we use Polars as input data into scikit-learn, or do we need to convert to pandas or NumPy first?
01:00:00 Upcoming meetup on Nov 19, 2024: Understanding Polars Expressions When You Are Used to pandas (https://www.meetup.com/data-umbrella/events/)
01:01:22 Q: Do you now use Polars exclusively or do you still use pandas?
01:02:45 Q: When you make a copy of a dataframe, is it still a shallow copy?
01:03:05 Q: Any disadvantages to using Polars?
01:04:15 Q: Should I learn Polars instead of pandas?
## About the Speaker
Kimberly Fessel is a data scientist and the founder of Dr Kim Data. She and her company specialize in technical instruction, machine learning, and data visualization. Kimberly has over a decade of experience educating groups and individuals in corporate settings, at academic universities, via online platforms, and as director of a data science bootcamp. Her educational YouTube channel features videos about data science in Python and boasts over 20,000 subscribers. Kimberly holds a PhD in applied mathematics from Rensselaer Polytechnic Institute and expects the publication of her first book, Head First SQL, 2nd Edition, in 2026.
- LinkedIn: https://www.linkedin.com/in/kimberlyfessel/
#DataScience #DataAnalysis