Data Documentation

How RMP Map processes, deduplicates, and displays EPA Risk Management Program data.

Data Provenance

The data displayed on RMP Map originates from the U.S. EPA Risk Management Program database and follows this chain of custody:

1
U.S. EPA collects Risk Management Plans from ~12,500 facilities under Section 112(r) of the Clean Air Act
2
Data Liberation Project obtains the database through FOIA requests and publishes it as a SQLite database under CC BY-SA 4.0
3
RMP Map imports this data into PostgreSQL with minimal transformation, applying deduplication logic only at display time

Current Data Version

Data version: 3
DLP export date: 2026-02-05
Data coverage through: 2025-12-30

What We Import

We import the Data Liberation Project's SQLite database into PostgreSQL with minimal transformation:

Direct table copy — All 50+ tables copied row-for-row
No merging or deduplication — All submission records preserved
No schema changes — Column names and types match source
Null normalization — Empty strings converted to NULL (PostgreSQL best practice)

The One Enrichment: Geocoding Validation

We maintain a separate facility_geocoding table that validates facility-reported coordinates against state bounding boxes. This table is additive—it doesn't modify source data, just supplements it. Coordinates that don't match the facility's stated location are flagged for address-based geocoding.

Data Limitations & Reporting Issues

RMP Map displays only what we receive. The data shown here is not guaranteed to be comprehensive. Facilities may have incomplete records, missing information, or reporting errors in their original RMP submissions to the EPA.

If You Find an Issue

Type of Issue	Who to Contact	Examples
Our calculations or display	rmpmap@drexel.edu	Wrong accident counts, duplicate facilities showing, incorrect deduplication, map display errors
Source data quality	U.S. EPA RMP Program	Missing facilities, wrong addresses, incorrect chemical lists, missing accidents, outdated information

Confidential Business Information (CBI)

Certain RMP data fields are withheld from public release because facilities may claim them as Confidential Business Information (CBI) under 40 CFR Part 2. In the public dataset we receive, the following fields are entirely empty across all facilities:

Worst-case release scenarios — quantity released, distance to toxic/flammable endpoint, affected residential population, nearby schools, hospitals, and other public receptors
Alternative release scenarios — similar fields for more realistic release modeling

The database tables for these scenarios exist and contain row-level linkages to regulated chemicals, but the key impact fields (distance, quantity, population) are null across all 237,000+ rows. This means we cannot display worst-case or alternative release scenario details, even though the RMP regulation requires facilities to report them. The data is collected by EPA but not included in the public data release.

If you need worst-case scenario data for a specific facility, you may be able to obtain it through a FOIA request to EPA or by contacting the facility's Local Emergency Planning Committee (LEPC).

We do not modify source data. If a facility's information appears incorrect in our system, it's almost certainly incorrect in the EPA's database as well. We faithfully reproduce what we receive; corrections must be made at the source.

We're happy to hear about potential data issues so we can verify whether they originate in our processing or in the source data itself. Contact us at rmpmap@drexel.edu and we'll investigate.

Understanding the Data Structure

The EPA RMP database has a key structural characteristic that can create apparent duplicates if not handled correctly.

Two Types of Facility IDs

ID Type	Description	Behavior
`FacilityID`	Internal submission ID	New ID assigned with every RMP submission
`EPAFacilityID`	Canonical facility identifier	Stable across submissions for the same physical facility

When a facility submits a new RMP (every 5 years, or when changes occur), it receives a new FacilityID but retains its same EPAFacilityID.

Example: A Facility with Multiple Submissions

# Same physical facility, three RMP submissions

EPAFacilityID: 100000193471 (stable - identifies the facility)

# Each submission gets a new FacilityID:

FacilityID 51234 (submitted 2015)

FacilityID 67891 (submitted 2018)

FacilityID 78902 (submitted 2022)

The Accident History Problem

Each RMP submission includes the facility's complete accident history. This means the same accident appears in multiple rows with different AccidentHistoryID values (one per submission).

Why This Matters

If we naively counted accidents by AccidentHistoryID: 3 submissions × 3 accidents = 9 accident records (wrong!)

If we deduplicate by the actual accident event: 3 unique accidents = 3 accidents (correct)

Methodology: How We Handle the Data

We apply deduplication logic at display time, not at import. This preserves the source data while presenting a coherent view to users.

Strategy: EPAFacilityID as Canonical Identifier

Throughout the application, we treat EPAFacilityID as the "real" facility identifier:

URLs use EPAFacilityID: /facilities/100000193471
Search counts unique EPAFacilityID values
Map markers are deduplicated by EPAFacilityID

Facility Deduplication: Latest Submission Wins

When displaying facility information, we use PostgreSQL's DISTINCT ON to select the most recent submission:

SELECT DISTINCT ON ("EPAFacilityID")
  "FacilityName", "FacilityStr1", "FacilityCity", ...
FROM tbls1facilities
WHERE "EPAFacilityID" = '100000193471'
ORDER BY "EPAFacilityID", "ReceiptDate" DESC

This returns one row per facility—the most recently submitted data.

Accident Deduplication: Unique Events by Date + Time

Since the same accident appears in multiple submissions, we identify unique accidents by their date and time:

-- Get unique accidents for a facility
SELECT DISTINCT ON ("AccidentDate", "AccidentTime")
  "AccidentHistoryID", "AccidentDate", "AccidentTime", ...
FROM tbls6accidenthistory
WHERE "FacilityID" IN (all FacilityIDs for this EPAFacilityID)
ORDER BY "AccidentDate" DESC, "AccidentTime"

Why Date + Time?

We use (AccidentDate, AccidentTime) as the deduplication key because:

It identifies the real-world event — Two records with the same date/time at the same facility are the same accident
It's always populated — Unlike other fields that may vary between submissions
It's stable — The accident date doesn't change across submissions

Note: AccidentHistoryID is not a good deduplication key because each submission generates new IDs for the same historical accidents.

Summary Table

Entity	Raw Database	Display Logic	Dedup Key
Facilities	Multiple rows per physical facility	Show most recent submission	`EPAFacilityID` + latest `ReceiptDate`
Accidents	Same accident in multiple submissions	Count/show unique events	`(AccidentDate, AccidentTime)` per facility
Chemicals	Listed per submission	Show from most recent submission	N/A (inherited from facility dedup)

What You See

Facility Detail Page

Current facility info (from most recent submission)
Count of unique accidents (deduplicated)
List of all RMP submissions for this facility (for transparency)
List of accidents (deduplicated by date/time)

Search Results

One result per physical facility (by EPAFacilityID)
Accurate accident counts (deduplicated)
Facility name/address from most recent submission

Map View

One marker per physical facility
Popup shows current name and accurate accident counts

Design Philosophy

Preserve source data — We don't modify or delete any data from the DLP database
Deduplicate at display time — Apply logic when querying, not when importing
Transparent about submissions — Users can see all RMP submissions for a facility
Use stable identifiers — EPAFacilityID over FacilityID, date/time over AccidentHistoryID

This approach lets us:

Update data by simply re-importing (no complex merge logic)
Provide historical submission views if needed in the future
Maintain data provenance back to the source

Using the API

RMP Map provides a free, public API for programmatic access to this data. The API returns JSON and requires no authentication.

Key Points

Deduplication is applied — API responses use the same deduplication logic described above
EPAFacilityID is canonical — Use this ID to reference facilities, not FacilityID
Accident counts are accurate — Counts reflect unique events, not submission records
GeoJSON support — Request format=geojson for map-ready data

Common Endpoints

/api/search — Search facilities with filters

/api/facilities/:id — Get facility details by EPAFacilityID

/api/accidents/:id — Get accident details

/api/facilities/geo — All facilities as GeoJSON

For complete endpoint documentation, parameters, and examples, see the full API documentation.

API Metadata

All API responses include a _meta block with data provenance information:

{
  "_meta": {
    "source": "U.S. EPA Risk Management Program via Data Liberation Project",
    "license": "CC BY-SA 4.0",
    "disclaimer": "Data as reported by facilities to EPA"
  }
}

See API Terms of Use for rate limits and attribution requirements.