MENU

Fun & Interesting

Building Robust ETL Pipelines with Apache Spark - Xiao Li

Databricks 56,629 8 years ago
Video Not Working? Fix It Now

Stable and robust ETL pipelines are a critical component of the data infrastructure of modern enterprises. ETL pipelines ingest data from a variety of sources and must handle incorrect, incomplete or inconsistent records and produce curated, consistent data for consumption by downstream applications. In this talk, we'll take a deep dive into the technical details of how Apache Spark "reads" data and discuss how Spark 2.2's flexible APIs; support for a wide variety of datasources; state of art Tungsten execution engine; and the ability to provide diagnostic feedback to users, making it a robust framework for building end-to-end ETL pipelines. Overview: 1) What’s an ETL Pipeline? 2) Using Spark SQL for ETL - Extract: Dealing with Dirty Data (Bad Records or Files) - Extract: Multi-line JSON/CSV Support - Transformation: High-order functions in SQL - Load: Unified write paths and interfaces 3) New Features in Spark 2.3 - Performance (Data Source API v2, Python UDF) View slides: https://www.slideshare.net/databricks/building-robust-etl-pipelines-with-apache-spark Related articles: Integrating Apache Airflow and Databricks: Building ETL pipelines with Apache Spark https://databricks.com/blog/2016/12/08/integrating-apache-airflow-databricks-building-etl-pipelines-apache-spark.html Writing Data Engineering Pipelines in Apache Spark on Databricks https://databricks.com/blog/2016/09/06/writing-data-engineering-pipelines-in-apache-spark-on-databricks.html About: Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business. Read more here: https://databricks.com/product/unified-data-analytics-platform Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc/ Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. https://databricks.com/databricks-named-leader-by-gartner

Comment