MENU

Fun & Interesting

End to End ETL pipeline in AWS - Redshift, PySpark, Glue, EMR, Hudi, Airflow #aws #awstutorial #etl

ETL SQL 3,559 2 months ago
Video Not Working? Fix It Now

This is AWS Data Engineering crash course video in which I have explained about data warehouse & data lake development in AWS. I have given overview of Amazon Redshift, AWS Glue, Apache Hudi, Amazon EMR & Managed Airflow (mwaa) as well. Post the overview I have shared the demo of how to use these services together to build data lake & data warehouse in AWS. There are 5 exercises mentioned in the video & you can download the sample data & code as well for practice purpose. In first exercise, I have shown how you can use AWS Glue crawler to parse input file & create a table in Glue Catalog. In second exercise, we have used AWS Glue pyspark application to load data from CSV files into datalake hudi tables In third exercise, we have used Amazon EMR to read datalake hudi table and created analytics hudi table. It is like reading silver layer data & transforming into golden layer if you follow medallion architecture. In fourth exercise, I have shown how you can read hudi tables directly in Amazon redshift & created snapshots tables to consume & utilize analytics dataset Finally in fifth exercise, we will use Managed airflow to orchestrate and run end to end pipelines covering the steps mentioned earlier. If you are AWS beginner, I am sure you will learn a lot from this video. However this is not aws - zero to hero masterclass. This is more like crash course in which I wanted to share how you can quickly build solutions in AWS using the popular services. I have referred to following additional videos. Do check these video as well to get better understanding. Amazon Redshift for beginners: https://youtu.be/dmsuzIOzmIs AWS DataLake for beginners : https://youtu.be/m-WEGgYq25c Feel free to reach out to me as well : [email protected] If you wish to download the presentation slides , sample data files & source code for AWS Glue job , Amazon EMR pyspark application , Amazon Redshift sql script & Managed Airflow DAG code used in the crash course video then check the link below: https://mailchi.mp/45b9673b727b/aws-data-engineering-crash-course Are you interested in attending 1-1 training on AWS with me ? Send an email to [email protected] with the heading "1-1 AWS Training session" & I will get back to you with details about our initial introduction meeting. Video timeline: 00:00 Introduction 01:39 Datawarehouse Migration Projects 04:12 Creating data lakes in AWS 08:38 Amazon Redshift overview 11:00 AWS Glue overview 13:50 Apache Hudi overview 15:46 Amazon EMR overview 17:39 Managed Airflow overview 19:45 Demo AWS Console 23:00 Exercise 1 (Glue crawlers) 28:20 Exercise 2 (Glue pyspark) 46:30 Exercise 3 (Amazon EMR) 01:02:20 Exercise 4 (Amazon Redshift) 01:16:56 Exercise 5 (MWAA end to end pipeline) Do like, share, comment & subscribe to the channel if you are new here. Cheers

Comment