7 The Problem

Public health teams frequently distribute Excel-based data collection forms to multiple sites, facilities, or respondents. A common workflow looks like this: a program coordinator creates a standardized spreadsheet template, emails it to ten county offices, and asks each office to fill in their data and return it. Sometimes the spreadsheets are stored in a shared folder instead. Either way, the coordinator eventually needs to combine all responses into a single dataset for analysis.

This process breaks down in predictable ways:

  • Column headers get renamed. One site changes “date_of_diagnosis” to “dx_date.” Another adds a space. A third capitalizes it differently.
  • Columns are added or removed. A site inserts an extra column for local tracking, or deletes a column they consider irrelevant.
  • Sheet names differ. The template uses “Data Entry” but a respondent renames the tab to “Sheet1.”
  • Manual merging is tedious and error-prone. Copy-pasting rows from ten spreadsheets into a master file can take hours and risks misaligned columns, dropped records, or duplicated data.

The coordinator may then try to analyze the combined data in Excel or R, only to discover that the structural inconsistencies have created problems downstream: mismatched column counts, unexpected blanks, or silently misaligned values.

ImportantWhy Not Use a Proper Data Collection Tool?

If you have the option, a structured data collection platform like REDCap, KoBoToolbox, or even Google Forms is almost always preferable to distributing Excel templates. These tools enforce field types, validation rules, and consistent structure at the point of entry, eliminating the column mismatch problem entirely.

However, many public health teams still end up with spreadsheets for practical reasons:

  • No access to REDCap or similar tools. Not every organization has a REDCap instance, and setting one up requires server infrastructure and IRB coordination.
  • Speed of setup. Creating an Excel template and emailing it takes minutes. Configuring a REDCap instrument with branching logic and validation can take days.
  • Existing systems export spreadsheets. Cancer registries, laboratory information systems, EHR extracts, and surveillance platforms often produce Excel or CSV exports as their standard output format. The data was never “collected” in a spreadsheet; it was generated by another system.
  • External partners use Excel. When collecting data from facilities, providers, or agencies outside your organization, you often cannot dictate the tool. You get what you get.
  • One-time or ad-hoc collections. For a quick data call that will only happen once, the overhead of setting up a formal data collection instrument may not be justified.

This tool is designed for those situations: when spreadsheets are the reality, whether by choice or by circumstance.

7.1 The Scale of the Problem

Consider a state cancer registry that collects annual data submissions from 88 county health departments using a standardized Excel template. Each year, staff must:

  1. Download or receive 88 completed workbooks
  2. Open each one to verify the correct sheet exists
  3. Check that column headers match the template
  4. Copy and paste (or append) the data into a master file
  5. Troubleshoot the files that do not match

At 5 to 10 minutes per file, this is 7 to 15 hours of manual work, repeated every reporting cycle. And that estimate assumes nothing goes wrong.

8 The Solution

The Stack & Check Spreadsheets app is a Shiny web application that automates this entire workflow. Upload your spreadsheets, click a button, and get:

  • A combined dataset with a source file column tracking which record came from which file
  • A column comparison matrix showing exactly which columns appear in which files, making structural differences immediately visible
  • A file issues report listing any files with missing sheets, read errors, or empty data
  • An interactive pivot table for exploring the combined data
  • A branded Excel download containing all analysis sheets

The app supports .xlsx, .xls, and .csv files, and can handle mixed file types in a single upload.

8.1 Why This App Exists

Over the past 20 years, I have watched colleagues across public health programs wrestle with this exact task: merging a pile of returned spreadsheets into a single, usable dataset. I have also received enough requests to perform this merge myself that the pattern became impossible to ignore. The task is universal, the pain is real, and the solution is straightforward to automate.

I built this app so that I can share it with colleagues who have little or no programming experience. Rather than walking someone through an R script or doing the merge for them, I can point them to this tool and let them handle it independently.

NoteA Note on GUI Apps and Automation

In general, I do not recommend using GUI applications for automated workflows. If a task is recurring, the better investment is a scripted pipeline that runs without manual interaction. However, a point-and-click app like this one earns its place in two situations:

  • Truly one-off requests. Sometimes you receive a batch of spreadsheets once and will never repeat the exercise. Setting up a project directory, writing a script, and configuring outputs is not justified for a single use.
  • Quick pre-project checks. Before committing to a full data pipeline, you may want to take a first look at received files: Are the columns consistent? Are there empty submissions? How many records came back? This app provides that structural overview in seconds, helping you decide how to proceed.

For anything that will be repeated on a schedule, build a script.

8.2 Why Local-First Sharing?

Shiny applications can be shared in several ways: as hosted web applications on shinyapps.io, on a Shiny Server, or as source files that users run locally. I explicitly chose local-first distribution for this app because of the data it handles.

Public health spreadsheets frequently contain protected health information (PHI), personally identifiable information (PII), or other sensitive data. Hosting this app as a public webpage would invite users to upload that data to an external server, which creates unnecessary risk and may violate organizational data governance policies. By distributing the app as source code that runs on the user’s own machine, no data ever leaves the local environment.

TipHosting with Posit Connect

If your organization wants to make this app available to a team internally, Posit Connect is a publishing platform that supports Shiny applications with enterprise features:

  • Password authentication
  • SSL support
  • Administrator tools
  • Priority support

Posit Connect allows you to publish Shiny applications, R Markdown reports, dashboards, Jupyter Notebooks, and more in a centralized, access-controlled environment. You can publish directly from the RStudio IDE and configure flexible security policies to control who can access the app.

This is a good option for teams working in a for-profit or regulated setting where centralized access control and audit trails are required. Learn more at posit.co/products/enterprise/connect.

8.3 User Story

As a public health data coordinator who collects Excel-based data from multiple sites, I want a tool that automatically combines spreadsheets and flags structural differences, so that I can spend my time on analysis instead of copy-pasting and troubleshooting column mismatches.

8.4 GPS (Given-Person-Should)

Given a set of Excel workbooks returned by multiple respondents using a common template, the data coordinator should upload them to the Stack & Check app and review the column comparison matrix to identify and resolve structural inconsistencies before combining the data for analysis.

9 Getting Started

9.1 Install Requirements

The app requires R (version 4.1 or later) and the following packages:

install.packages(c(
  "shiny",
  "bslib",
  "readxl",
  "dplyr",
  "janitor",
  "openxlsx",
  "DT",
  "tidyr",
  "purrr",
  "tibble",
  "rpivotTable"
))

RStudio Desktop (free) is the recommended IDE for running the app, though any R environment will work.

9.2 Download the App

The app is available in the Public Health Automation Clinic GitHub repository:

  1. Download the app folder from the repository:
  2. Clone the full repository (recommended if you use Git):
git clone https://github.com/andre-inter-collab-llc/Public-Health-Automation-Clinic.git

The app files are located at apps/stack-and-check-spreadsheets/ within the repository.

9.3 Run the App

Open app.R in RStudio and click Run App, or run the following from an R console:

shiny::runApp(here::here("apps", "stack-and-check-spreadsheets"))

The app will open in your default web browser. Everything runs locally on your machine; no data leaves your computer.

Screenshot of the Stack and Check Spreadsheets app About page showing the app description and feature overview.

The About page introduces the app, explains its purpose, and lists the key features.

10 How to Use the App

10.1 Step 1: Upload Files

Click Upload Spreadsheets in the sidebar and select one or more Excel (.xlsx, .xls) or CSV (.csv) files. You can upload a mix of file types in a single batch.

Screenshot of the empty Workbench tab with the file upload control and processing options visible in the sidebar.

The Workbench tab before any files are uploaded, showing the file upload area and processing controls.

10.2 Step 2: Select Sheet Name

For Excel files, the app reads all available sheet names across uploaded workbooks and presents them in a dropdown. Select the sheet containing the data you want to combine. CSV files do not require sheet selection.

10.3 Step 3: Set the Header Row

If your spreadsheets have title rows, instructions, or logos above the actual column headers, set the Header Row to the row number where headers appear. The default is row 1.

Screenshot showing five uploaded Excel files listed in the sidebar with sheet name dropdown and header row selector.

After uploading five county spreadsheets, the sidebar shows the selected sheet name and header row settings.

10.4 Step 4: Process Files

Click Process Files. The app will:

  1. Read data from each file using the selected sheet name and header row
  2. Clean and standardize column names (lowercase, underscores, consistent formatting)
  3. Add a source_file column to track the origin of each record
  4. Combine all data into a single dataset
  5. Build a column comparison matrix
  6. Log any file issues (missing sheets, read errors, empty data)

A progress bar shows status during processing.

10.5 Step 5: Review Results

The Workbench tab provides four views:

10.5.1 Combined Data

The full merged dataset with all records from all files. Use the column filters to search and sort. The source_file column identifies which file each record came from.

Screenshot of the Combined Data tab displaying a merged dataset with columns from all uploaded files and a source_file tracking column.

The Combined Data view showing all records merged into a single table with the source_file column.

10.5.2 Column Comparison

The Column Comparison tab contains three sub-views, designed to scale from a handful of files to hundreds:

10.5.2.1 Discrepancies Only

The default and most compact view. Shows only columns that are not present in every file. For each mismatched column you can see:

  • How many files contain it and how many are missing it
  • The percentage of files where it appears
  • The exact filenames that are missing the column

If all files have identical column structures, this view shows a green “no discrepancies” message and an empty table. This is the view to check first: if it is empty, your data is structurally consistent and you can proceed directly to analysis.

Screenshot of the Discrepancies Only sub-tab showing a table of columns with mismatches across files, including file counts and missing file names.

The Discrepancies Only view highlighting columns that are not present in every file.

10.5.2.2 Full Matrix

A complete presence/absence matrix with files as rows and column names as headers:

  • (teal) indicates the column is present in that file
  • (red) indicates the column is missing from that file
  • The columns_present and columns_missing summary columns appear at the end of each row
  • The file column is frozen so it stays visible while scrolling horizontally
  • Files with more missing columns sort to the top

This layout scales better than column-per-file because typical datasets have 10 to 50 columns (manageable as table headers) while file counts can grow to hundreds or thousands (which work well as scrollable rows). For very large uploads, use the Excel download to filter and search the matrix in Excel.

Screenshot of the Full Matrix sub-tab showing a file-by-column presence/absence grid with teal check marks and red X marks.

The Full Matrix view with files as rows and columns as headers, using check marks and X marks to show presence and absence.

10.5.2.3 File Profiles

A per-file summary showing:

Column Meaning
Rows Number of data rows in the file
Columns Number of columns in the file
Expected Number of “expected” columns (those present in more than half of all files)
Missing How many expected columns are absent from this file
Missing Columns Semicolon-separated list of which expected columns are missing
Extra How many columns this file has that are not in the expected set
Extra Columns Semicolon-separated list of non-standard columns added to this file

Files with more missing or extra columns sort to the top, making it easy to spot outliers that need manual correction before analysis.

Screenshot of the File Profiles sub-tab showing a per-file summary table with rows, columns, expected, missing, and extra column counts.

The File Profiles view summarizing each file’s row count, column count, and any missing or extra columns.

10.5.3 File Issues

This tab is separate from column discrepancies. It lists files that could not be processed at all:

Issue Type Meaning
Missing Sheet The selected sheet name was not found in the workbook
Read Error The file could not be read (corrupt, password-protected, or unsupported format)
Empty Data The file or sheet contained no data rows after the header

If all files processed successfully, this tab shows “All files processed successfully.” Files listed here are excluded from the combined dataset and from the column comparison.

Screenshot of the File Issues tab showing a table of files with processing problems such as missing sheets or read errors.

The File Issues tab listing files that could not be processed, with issue type descriptions.

10.5.4 Explore Data

An interactive pivot table (rpivotTable) for ad-hoc exploration of the combined data. Drag columns into rows, columns, or filters to create cross-tabulations, charts, and summaries without writing code.

Screenshot of the Explore Data tab showing the rpivotTable interface with drag-and-drop column fields for building cross-tabulations.

The Explore Data tab with an interactive pivot table for ad-hoc analysis of the combined dataset.

10.6 Step 6: Download

Click Download Combined XLSX to export a formatted Excel workbook with five sheets:

  1. Combined Data: the full merged dataset
  2. Discrepancies: only the mismatched columns (empty if all files match)
  3. Column Matrix: the full file-by-column presence/absence matrix
  4. File Profiles: per-file row/column counts with missing and extra columns listed
  5. File Issues: the processing issue log

The workbook uses professional formatting with styled headers. The Discrepancies and File Profiles sheets are particularly useful for filtering and searching in Excel when dealing with hundreds of files.

11 Supported File Formats

Format Extension Notes
Excel (modern) .xlsx Full support including sheet selection
Excel (legacy) .xls Full support including sheet selection
CSV .csv Sheet selection is not applicable

Mixed file types can be uploaded together. For example, you can upload eight .xlsx files and two .csv files in the same batch.

12 Tips and Best Practices

TipPreventing Problems at the Source

The best way to handle column mismatches is to prevent them. When distributing data collection templates:

  • Lock the header row in Excel (Review > Protect Sheet, allow only data entry in the data range)
  • Use data validation on key columns to constrain inputs
  • Name your sheets consistently and include instructions not to rename them
  • Include a “do not modify” note in the template header

The Stack & Check app is most valuable when prevention was not possible, which, in practice, is most of the time.

TipLarge File Uploads

The app does not impose a file size limit. However, uploads exceeding approximately 250 MB total may cause slow performance or out-of-memory errors depending on your machine’s available RAM. For very large collections of spreadsheets, consider processing them in batches and combining the downloaded outputs.

NoteLocal-First

This app runs entirely on your local machine. No data is uploaded to external servers. This makes it suitable for use with protected health information (PHI) and other sensitive data, subject to your organization’s policies for running local software.