MENU

Fun & Interesting

Google Dremel: Interactive Analysis of Web-Scale Datasets

Gaurav Sen 75,971 5 months ago
Video Not Working? Fix It Now

This ten-page Google research paper has inspired hundreds of software systems. Relevant for senior software and data engineers, the Dremel paper explains how Google performs ad-hoc queries on massive data sets. For example, finding the number of people who live in Bengaluru, speak English, and are interested in Competitive Programming. The problem is: how do you run these queries efficiently, and reliably? We take advantage of the fact that most statistics queries are aggregates: sum, min, max, and count. Dremel stores that data in a column-oriented format, which is great for running aggregate queries. It runs these queries like an SQL database (where the optimizer chooses an execution path). The difference is, Dremel can run on thousands of nodes! To tackle this scalability issue, Dremel uses interesting algorithms like Finite Automata and a version of Segment Trees (yeah, you read it right). You will find more details in the paper. Dremel has been called as one of the most influential papers in recent times (it's been an inspiration for systems like Apache Impala). It also won the Test of Time award at VLDB 2020. Definitely worth a read. References: Compression Algorithm: https://blog.x.com/engineering/en_us/a/2013/dremel-made-simple-with-parquet Paper: https://www.vldb.org/pvldb/vol13/p3461-melnik.pdf Try the course below to learn more about system design. System Design Course: https://interviewready.io/learn/system-design-course/ Cheers! 00:00 Paper Background 00:37 What is Dremel 03:23 Features 06:08 High-level Architecture 10:26 Approximations 12:20 Columnar Storage 15:47 Final thoughts #ResearchPaper #Google #SystemDesign

Comment