RDDs, DataFrames and Datasets in Apache Spark - NE Scala 2016

InfoQ 119,661 9 years ago

Video Not Working? Fix It Now

Traditionally, Apache Spark jobs have been written using Resilient Distributed Datasets (RDDs), a Scala Collections-like API. RDDs are type-safe, but they can be problematic: It's easy to write a suboptimal job, and RDDs are significantly slower in Python than in Scala. DataFrames address some of these problems, and they're much faster, even in Scala; but, DataFrames aren't type-safe, and they're arguably less flexible. Enter Datasets, a type-safe, object-oriented programming interface that works with the DataFrames API, provide some of the benefits of RDDs, and can be optimized via the Catalyst optimizer. This talk will briefly recap RDDs and DataFrames, introduce the Datasets API, and then, through a live demonstration, compare the performance of all three against the same non-trivial data source. Talk by Brian Clapper March 4th, 2016 http://www.nescala.org/ Produced by NewCircle - Spark Training & Resources: https://newcircle.com

Comment