MENU

Fun & Interesting

Data Versioning and Reproducible ML with DVC and MLflow

Databricks 22,468 4 years ago
Video Not Working? Fix It Now

Machine Learning development involves comparing models and storing the artifacts they produced. We often compare several algorithms to select the most efficient ones. We assess different hyper-parameters to fine-tune the model. Git helps us store multiple versions of our code. Additionally, we need to keep track of the datasets we are using. This is important not only for audit purposes but also for assessing the performances of the models, developed at a later time. Git is a standard code versioning tool in software development. It can be used to store your datasets but it does not offer an optimal solution. An alternative solution is to use Data Version Control (DVC). Despite its name, it is not just a data versioning tool, but also enables model and pipeline tracking. It runs on top of Git, which makes it easy to learn for Git users. At the same time, it overcomes the limitations of storing big files by storing them remotely (e.g. Azure, S3) and keeping in Git only their metadata. MLflow is a tool that is easily integrated with the code of your model and can track dependencies, model parameters, metrics, and artifacts. Every run is linked with its corresponding Git commit. Once the model is trained, MLflow can pack it in different flavors (e.g. Python/R function, H2O, Spark, TensorFlow…) ready to be deployed. DVC also runs along with Git. When MLflow helps you manage Machine Learning lifecycle, DVC helps you manage your datasets. In this tutorial, we will learn how to leverage the capabilities of these powerful tools. We will go through a toy ML project and look at the sample code on how to increase the reproducibility of individual steps. About: Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business. Read more here: https://databricks.com/product/unified-data-analytics-platform See all the previous Summit sessions: https://databricks.com/sparkaisummit/north-america/sessions Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks/ Instagram: https://www.instagram.com/databricksinc/ Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. https://databricks.com/databricks-named-leader-by-gartner

Comment