6 Data Cleaning and Transformation
Standardizing messy datasets, deduplication, edit rule validation, and exception reporting. These solutions turn raw data into analysis-ready inputs.
NotePublished Solutions
The following solutions have been developed through the clinic. Additional solutions will be added as they are completed.
- Stack and Check Spreadsheets: a Shiny app for stacking multiple Excel/CSV files and checking column structure differences
Have a data cleaning or transformation problem? Submit it to the clinic or open a GitHub Issue.
6.1 What Belongs Here
Solutions in this part address tasks involving data preparation and quality:
- Data standardization to normalize inconsistent formats, codes, or naming conventions across sources
- Deduplication using deterministic or probabilistic matching to identify and resolve duplicate records
- Edit rule validation to automatically check data against defined business rules (e.g., diagnosis date precedes treatment date, age within valid range)
- Exception reporting to flag records that fail validation checks and generate actionable reports
- Data reshaping to pivot, merge, split, or restructure datasets for downstream analysis
- Missing data handling with standardized imputation or flagging strategies
- Geocoding and address standardization using free tools to prepare location data for mapping