In this tutorial, we take a messy text data and wrangle it into a clean form suitable for data analysis. You will learn to think through and implement data preparation tasks using R statistical language. You will be exposed to some functions from the tidyverse package and be equipped to work efficiently with data.
Some concepts covered:
- install package (tidyverse)
- read a comma separated (read_csv)
- how to spot a pattern in text data (regex)
- replace text (str_replace_all)
- clean column names (janitor, clean_names)
- separate column into multiple columns
- visualize data (ggplot)
- write a regular expression pattern for extracting text
- create new columns using the mutate function
- use mutate and across functions to transform existing columns
- remove extra spaces (str_squish)
- capitalize text (str_to_title)
- extract date-timestamp from text with the lubridate package
Note: The data used is a subset of the emergency data available on Kaggle
**Downloads**
Data:
https://www.kaggle.com/mchirico/montcoalert/data
Code & Data: https://drive.google.com/file/d/1AyBMHXA5cv1bKX_Nd2WlxAQIBOLKKYd4/view?usp=sharing