In today's video, we'll learn about fuzzy string matching (also known as approximate string matching) and how to perform it in R.
A common use case for fuzzy string matching is when we want to join two datasets. Perhaps these datasets have a variable in common, but the information in one dataset is expressed slightly differently than the information in the other (e.g., “Amazon” vs. “Amazon.com, Inc”). How can we determine if these two variables are referring to the same thing? We can use fuzzy string matching, a popular Natural Language Processing (NLP) technique!
We'll start with a conceptual overview of fuzzy string matching, and then look at some examples in R using several different algorithms. We’ll use fuzzywuzzy, polyfuzz, and difflib – currently the most popular packages for performing this task. Among others, some of the string matching algorithms that are implemented in these packages include Levenshtein Distance (sometimes called "Edit Distance") and Gestalt Pattern Matching (sometimes called "Ratcliff/Obershelp Pattern Matching").
The code, slides, and dataset used in this video can be found here: https://github.com/melissavanbussel/YouTube-Tutorials/tree/main/fuzzy_string_matching
The dataset originated from Kaggle: https://www.kaggle.com/code/leandrodoze/fuzzy-string-matching-with-hotel-rooms/data
The blog post about PolyFuzz referenced in the video is located here: https://towardsdatascience.com/string-matching-with-bert-tf-idf-and-more-274bb3a95136
If you like this video, please subscribe to my channel so that I can continue to make content like this! 😊
0:00 - Overview of fuzzy string matching
3:49 - Fuzzy string matching in R
9:53 - Using the difflib package
16:32 - Using the fuzzywuzzy package
19:58 - Using the polyfuzz package