This repo scrapes, normalises, and validates historic English league tables from Wikipedia so the resulting JSON can be embedded in other projects or visualisations.
- A supported Wikipedia scraping workflow that can resume after interruptions.
- Utilities to merge overlapping sources, verify season integrity, and minify the resulting datasets.
- Jest unit + integration tests focused on the active Wikipedia pipeline.
wikipedia/is the actively supported ingestion path.rsssf/,scripts/csv-data/, and older reference exports should be treated as legacy tooling unless you are intentionally doing archive work.- The overview scraper (
node wikipedia/cli/index.js overview) is now the primary maintained Wikipedia dataset flow across the full historical range. - The promotion/relegation scraper (
node wikipedia/cli/index.js build) remains available as a legacy/historical fallback for classic Football League season pages.
- Node.js
>= 20 pnpm >= 8(declared viapackageManager)- macOS/Linux shell or Windows WSL for the scraping scripts
Install dependencies once:
pnpm i- Generate Wikipedia data
# Primary maintained flow: overview parser across the full supported range node wikipedia/cli/index.js overview --start 1888 --end 2024 --output ./data-output --include-war-placeholders - Merge and normalise
node wikipedia/data/combine-output-files.js --output ./data-output/all-seasons.json \ ./data-output/wiki_overview_tables_by_season.json
- Validate and test
node wikipedia/data/verify-football-data.js --fail-on-issues ./data-output pnpm test:integration
- Minify for distribution (optional)
node scripts/minify-json.js ./data-output/all-seasons.json
All commands are resumable. If you stop a scraper with Ctrl+C, progress written to data-output stays intact.
The default maintained dataset workflow now uses the overview parser end to end. The promotion scraper is still useful for legacy comparison work and fixture repair, but it is no longer the main checked-in data path.
# Setup Repo, Install Deps
pnpm i
# Generate Data
node wikipedia/cli/index.js overview --start 1888 --end 2024 --output ./data-output --force-update --include-war-placeholders
# Combine data into all-seasons file
node wikipedia/data/combine-output-files.js --output ./data-output/all-seasons.json ./data-output/wiki_overview_tables_by_season.json
# Verify the generated data
node wikipedia/data/verify-football-data.js --fail-on-issues ./data-output
pnpm test:integration
# If all is good, finally minify data ready for external use
node scripts/minify-json.js ./data-output/all-seasons.json
node scripts/minify-json.js ./data-output/wiki_overview_tables_by_season.jsonWhen data-output/wiki_promotion_relegations_by_season.json needs to be refreshed for historical comparison or legacy fixture coverage, rebuild it from code instead of patching individual seasons by hand:
pnpm wiki:build:promotion
pnpm wiki:minify:promotion
pnpm test:integration:promotionFor a single-season repair while preserving the checked-in dataset shape, use the same command with a narrow range and keep --ignore-war-years enabled. Example for the 1919-20 edge season:
node wikipedia/cli/index.js build --start 1919 --end 1919 --output ./data-output --force-update --ignore-war-years
node scripts/minify-json.js ./data-output/wiki_promotion_relegations_by_season.json
pnpm test:integration:promotiondata/– raw reference files and one-off exports.data-output/– canonical JSON outputs grouped by source (e.g.data-output/rsssf).scripts/– helper utilities such asminify-json.jsplus older one-off generators.wikipedia/– the main scraper, parsers, and FootballData models.rsssf/– legacy RSSSF parsing experiments.shared/,club_names.json– shared helpers and canonicalised club naming.
Run node wikipedia/cli/index.js <command> [options] to build FootballData-format JSON directly from Wikipedia tables.
| Command | Purpose | Default output |
|---|---|---|
build |
Legacy/historical promotion-relegation scraper for classic Football League season pages, mainly Tier 1 and Tier 2. | data-output/wiki_promotion_relegations_by_season.json |
overview |
Primary maintained parser. Reads overview pages (e.g. “2015–16 in English football”) and captures all listed tiers. | data-output/wiki_overview_tables_by_season.json |
combined |
Legacy bridge command: run build first, then backfill missing seasons with overview. |
Both files above, reusing the same --output directory. |
Common flags across commands:
| Flag | Default | Description |
|---|---|---|
-s, --start <year> |
varies | First season (inclusive). |
-e, --end <year> |
varies | Final season (inclusive). |
-o, --output <dir> |
./data-output |
Directory that will contain the JSON file(s). |
-u, --update-only |
false |
Skip seasons that already contain data on disk. |
-f, --force-update |
false |
Ignore cached entries and rebuild everything. |
--ignore-war-years |
false |
Skip WWI/WWII suspension years entirely. |
--include-war-placeholders |
false |
Emit metadata-only wartime placeholder seasons in overview output. |
Each run saves season-by-season progress immediately, so reruns are fast. The combined command exists for legacy mixed-source rebuilds, but the checked-in maintained path is now overview.
Tip: for the checked-in overview dataset we now run
overviewacross the full supported range. Keepbuildaround for legacy comparisons, targeted fixture repair, and classic-season parser regressions.
node rsssf/cli.js scrape [options] converts RSSSF HTML into the same FootballData schema. This path is kept for legacy/archive work and is not the primary maintained workflow.
| Option | Description |
|---|---|
-u, --url <url> |
One or more RSSSF page URLs to fetch. Repeat for multiple seasons. |
-f, --from-file <file> |
Parse saved HTML instead of fetching over the network (repeatable). |
-s, --start <year> / -e, --end <year> |
Generate season URLs using the default template (https://www.rsssf.org/engpaul/FLA/{seasonSlug}.html). Requires both flags to be provided. |
--url-template <template> |
Custom season URL template – supports {seasonSlug}, {startYear}, {endYear}, {seasonSlugUnderscore}, etc. |
-o, --output <path> |
JSON output path. Multiple sources treat this as a directory; range scraping writes an aggregate file under data-output/rsssf. |
--pretty |
Pretty-print instead of minified JSON. |
--save-html <path> |
Persist the raw HTML alongside the JSON (single file or directory depending on the context). |
Range mode continually updates data-output/rsssf/rsssf_promotion_relegations_by_season.json and guards against partial data loss by saving after each season (even when interrupted).
# Pretty-print one season to stdout-style JSON
node rsssf/cli.js scrape --url https://www.rsssf.org/engpaul/FLA/1908-09.html --pretty
# Fetch several seasons, write each JSON into data-output/rsssf, and persist HTML copies
node rsssf/cli.js scrape --start 1950 --end 1952 --output ./data-output/rsssf --save-html ./data-output/rsssf/html
# Parse existing HTML exports (useful for offline work)
node rsssf/cli.js scrape --from-file ./rsssf-cache/1960-61.html --from-file ./rsssf-cache/1961-62.htmlwikipedia/data/combine-output-files.js– merge multiple FootballData JSON files, drop war-year placeholders, prefer the richest tier record for each season, and show a grouped “missing seasons” summary. Use--include-emptyto keep placeholder entries and--compactfor minified JSON.wikipedia/data/compare-football-data.js– compare two FootballData JSON files and report season, tier, table, outcome-list, and metadata changes between releases. Pass--jsonfor machine-readable output. Pass--markdownfor a release-note-friendly summary.scripts/minify-json.js– shrink JSON files in place or alongside (foo.min.json) so they are ready for publishing.wikipedia/data/verify-football-data.js– lint FootballData exports for empty tiers, duplicate teams, stat mismatches, or promotion/relegation inconsistencies. Pass--fail-on-issuesto exit non-zero when anomalies exist.
- The final contract is the merged file:
data-output/all-seasons.json. - Every FootballData export may include a top-level
metadataobject with release provenance:schemaVersiongeneratorgeneratedAtgitShasourceFilesbuildOptions
- Each season contains a
seasonInfosummary object plus one or moretierNobjects. seasonInfois not a league table. It is a season-level summary that currently stores:seasonpromotedrelegated- source metadata such as
seasonSlug,sourceUrl, ortableCount
seasonInfo.promotedmeans clubs moving into the top flight for the following season.seasonInfo.relegatedmeans clubs leaving the top flight at the end of that season.- Every
tierNentry is an object withseason,table,promoted, andrelegated. - Every
tierNentry now carries a singlemetadataobject:sourcesourceUrlseasonSlugleagueIdtitletableIndextableCounttierKey
# Build the maintained merged dataset from the overview export
node wikipedia/data/combine-output-files.js --output ./data-output/all-seasons.json \
./data-output/wiki_overview_tables_by_season.json
# Run the data lint pass on every JSON file under ./data-output
node wikipedia/data/verify-football-data.js --fail-on-issues ./data-output
# Compare a previous release file against a freshly generated one
node wikipedia/data/compare-football-data.js ./releases/all-seasons-prev.json ./data-output/all-seasons.json
# Generate a markdown release summary
node wikipedia/data/compare-football-data.js --markdown ./releases/all-seasons-prev.json ./data-output/all-seasons.json
# Minify the merged dataset next to its original (writes all-seasons.min.json)
node scripts/minify-json.js ./data-output/all-seasons.jsonRun the full Jest suite (unit + lightweight parsing checks):
pnpm testTarget just the integration suite (which exercises the supported Wikipedia scrapers end-to-end) when validating new data runs:
pnpm test:integration
pnpm test:integration:overview # primary maintained Wikipedia fixtures
pnpm test:integration:promotion # legacy promotion/relegation fixturesCoverage is available via:
pnpm test:coverageEvery script sets NODE_OPTIONS=--experimental-vm-modules automatically so Jest can execute the ESM codebase without extra configuration.
- Keep output directories around; the CLIs skip existing seasons unless
--force-updateis provided, which significantly cuts rerun time. club_names.jsoncontains canonical spellings that the scrapers rely on when reconciling seasonal data – update it before running the cleaners if you expect new clubs to appear.- Extend
wikipedia/builders/parse-season-pages.jsorwikipedia/builders/parse-ext-season-overview-pages.jsif you need extra metadata (attendance, form, etc.); the FootballData schema is intentionally flexible.