Data Documentation

How RMP Map processes, deduplicates, and displays EPA Risk Management Program data.

Data Provenance

The data displayed on RMP Map originates from the U.S. EPA Risk Management Program database and follows this chain of custody:

  1. 1
    U.S. EPA collects Risk Management Plans from ~12,500 facilities under Section 112(r) of the Clean Air Act
  2. 2
    Data Liberation Project obtains the database through FOIA requests and publishes it as a SQLite database under CC BY-SA 4.0
  3. 3
    RMP Map imports this data into PostgreSQL with minimal transformation, applying deduplication logic only at display time

Current Data Version

  • Data version: 3
  • DLP export date: 2026-02-05
  • Data coverage through: 2025-12-30

What We Import

We import the Data Liberation Project's SQLite database into PostgreSQL with minimal transformation:

  • Direct table copy — All 50+ tables copied row-for-row
  • No merging or deduplication — All submission records preserved
  • No schema changes — Column names and types match source
  • Null normalization — Empty strings converted to NULL (PostgreSQL best practice)

The One Enrichment: Geocoding Validation

We maintain a separate facility_geocoding table that validates facility-reported coordinates against state bounding boxes. This table is additive—it doesn't modify source data, just supplements it. Coordinates that don't match the facility's stated location are flagged for address-based geocoding.

Data Limitations & Reporting Issues

RMP Map displays only what we receive. The data shown here is not guaranteed to be comprehensive. Facilities may have incomplete records, missing information, or reporting errors in their original RMP submissions to the EPA.

If You Find an Issue

Type of IssueWho to ContactExamples
Our calculations or displayrmpmap@drexel.edu Wrong accident counts, duplicate facilities showing, incorrect deduplication, map display errors
Source data qualityU.S. EPA RMP Program Missing facilities, wrong addresses, incorrect chemical lists, missing accidents, outdated information

Confidential Business Information (CBI)

Certain RMP data fields are withheld from public release because facilities may claim them as Confidential Business Information (CBI) under 40 CFR Part 2. In the public dataset we receive, the following fields are entirely empty across all facilities:

  • Worst-case release scenarios — quantity released, distance to toxic/flammable endpoint, affected residential population, nearby schools, hospitals, and other public receptors
  • Alternative release scenarios — similar fields for more realistic release modeling

The database tables for these scenarios exist and contain row-level linkages to regulated chemicals, but the key impact fields (distance, quantity, population) are null across all 237,000+ rows. This means we cannot display worst-case or alternative release scenario details, even though the RMP regulation requires facilities to report them. The data is collected by EPA but not included in the public data release.

If you need worst-case scenario data for a specific facility, you may be able to obtain it through a FOIA request to EPA or by contacting the facility's Local Emergency Planning Committee (LEPC).

We do not modify source data. If a facility's information appears incorrect in our system, it's almost certainly incorrect in the EPA's database as well. We faithfully reproduce what we receive; corrections must be made at the source.

We're happy to hear about potential data issues so we can verify whether they originate in our processing or in the source data itself. Contact us at rmpmap@drexel.edu and we'll investigate.

Understanding the Data Structure

The EPA RMP database has a key structural characteristic that can create apparent duplicates if not handled correctly.

Two Types of Facility IDs

ID TypeDescriptionBehavior
FacilityIDInternal submission IDNew ID assigned with every RMP submission
EPAFacilityIDCanonical facility identifierStable across submissions for the same physical facility

When a facility submits a new RMP (every 5 years, or when changes occur), it receives a new FacilityID but retains its same EPAFacilityID.

Example: A Facility with Multiple Submissions

# Same physical facility, three RMP submissions
EPAFacilityID: 100000193471 (stable - identifies the facility)
# Each submission gets a new FacilityID:
FacilityID 51234 (submitted 2015)
FacilityID 67891 (submitted 2018)
FacilityID 78902 (submitted 2022)

The Accident History Problem

Each RMP submission includes the facility's complete accident history. This means the same accident appears in multiple rows with different AccidentHistoryID values (one per submission).

Why This Matters

If we naively counted accidents by AccidentHistoryID: 3 submissions × 3 accidents = 9 accident records (wrong!)

If we deduplicate by the actual accident event: 3 unique accidents = 3 accidents (correct)

Methodology: How We Handle the Data

We apply deduplication logic at display time, not at import. This preserves the source data while presenting a coherent view to users.

Strategy: EPAFacilityID as Canonical Identifier

Throughout the application, we treat EPAFacilityID as the "real" facility identifier:

  • URLs use EPAFacilityID: /facilities/100000193471
  • Search counts unique EPAFacilityID values
  • Map markers are deduplicated by EPAFacilityID

Facility Deduplication: Latest Submission Wins

When displaying facility information, we use PostgreSQL's DISTINCT ON to select the most recent submission:

SELECT DISTINCT ON ("EPAFacilityID")
  "FacilityName", "FacilityStr1", "FacilityCity", ...
FROM tbls1facilities
WHERE "EPAFacilityID" = '100000193471'
ORDER BY "EPAFacilityID", "ReceiptDate" DESC

This returns one row per facility—the most recently submitted data.

Accident Deduplication: Unique Events by Date + Time

Since the same accident appears in multiple submissions, we identify unique accidents by their date and time:

-- Get unique accidents for a facility
SELECT DISTINCT ON ("AccidentDate", "AccidentTime")
  "AccidentHistoryID", "AccidentDate", "AccidentTime", ...
FROM tbls6accidenthistory
WHERE "FacilityID" IN (all FacilityIDs for this EPAFacilityID)
ORDER BY "AccidentDate" DESC, "AccidentTime"

Why Date + Time?

We use (AccidentDate, AccidentTime) as the deduplication key because:

  1. It identifies the real-world event — Two records with the same date/time at the same facility are the same accident
  2. It's always populated — Unlike other fields that may vary between submissions
  3. It's stable — The accident date doesn't change across submissions

Note: AccidentHistoryID is not a good deduplication key because each submission generates new IDs for the same historical accidents.

Summary Table

EntityRaw DatabaseDisplay LogicDedup Key
FacilitiesMultiple rows per physical facilityShow most recent submissionEPAFacilityID + latest ReceiptDate
AccidentsSame accident in multiple submissionsCount/show unique events(AccidentDate, AccidentTime) per facility
ChemicalsListed per submissionShow from most recent submissionN/A (inherited from facility dedup)

What You See

Facility Detail Page

  • Current facility info (from most recent submission)
  • Count of unique accidents (deduplicated)
  • List of all RMP submissions for this facility (for transparency)
  • List of accidents (deduplicated by date/time)

Search Results

  • One result per physical facility (by EPAFacilityID)
  • Accurate accident counts (deduplicated)
  • Facility name/address from most recent submission

Map View

  • One marker per physical facility
  • Popup shows current name and accurate accident counts

Design Philosophy

  1. Preserve source data — We don't modify or delete any data from the DLP database
  2. Deduplicate at display time — Apply logic when querying, not when importing
  3. Transparent about submissions — Users can see all RMP submissions for a facility
  4. Use stable identifiersEPAFacilityID over FacilityID, date/time over AccidentHistoryID

This approach lets us:

  • Update data by simply re-importing (no complex merge logic)
  • Provide historical submission views if needed in the future
  • Maintain data provenance back to the source

Using the API

RMP Map provides a free, public API for programmatic access to this data. The API returns JSON and requires no authentication.

Key Points

  • Deduplication is applied — API responses use the same deduplication logic described above
  • EPAFacilityID is canonical — Use this ID to reference facilities, not FacilityID
  • Accident counts are accurate — Counts reflect unique events, not submission records
  • GeoJSON support — Request format=geojson for map-ready data

Common Endpoints

/api/search — Search facilities with filters
/api/facilities/:id — Get facility details by EPAFacilityID
/api/accidents/:id — Get accident details
/api/facilities/geo — All facilities as GeoJSON

For complete endpoint documentation, parameters, and examples, see the full API documentation.

API Metadata

All API responses include a _meta block with data provenance information:

{
  "_meta": {
    "source": "U.S. EPA Risk Management Program via Data Liberation Project",
    "license": "CC BY-SA 4.0",
    "disclaimer": "Data as reported by facilities to EPA"
  }
}

See API Terms of Use for rate limits and attribution requirements.