English Football Statistics Data & Scripts

This repo scrapes, normalises, and validates historic English league tables from Wikipedia so the resulting JSON can be embedded in other projects or visualisations.

A supported Wikipedia scraping workflow that can resume after interruptions.
Utilities to merge overlapping sources, verify season integrity, and minify the resulting datasets.
Jest unit + integration tests focused on the active Wikipedia pipeline.

Supported Scope

wikipedia/ is the actively supported ingestion path.
rsssf/, scripts/csv-data/, and older reference exports should be treated as legacy tooling unless you are intentionally doing archive work.
The overview scraper (node wikipedia/cli/index.js overview) is now the primary maintained Wikipedia dataset flow across the full historical range.
The promotion/relegation scraper (node wikipedia/cli/index.js build) remains available as a legacy/historical fallback for classic Football League season pages.

Requirements

Node.js >= 20
pnpm >= 8 (declared via packageManager)
macOS/Linux shell or Windows WSL for the scraping scripts

Install dependencies once:

pnpm i

Quick Start

Generate Wikipedia data

# Primary maintained flow: overview parser across the full supported range
node wikipedia/cli/index.js overview --start 1888 --end 2024 --output ./data-output --include-war-placeholders

Merge and normalise

node wikipedia/data/combine-output-files.js --output ./data-output/all-seasons.json \
  ./data-output/wiki_overview_tables_by_season.json

Validate and test

node wikipedia/data/verify-football-data.js --fail-on-issues ./data-output
pnpm test:integration

Minify for distribution (optional)

node scripts/minify-json.js ./data-output/all-seasons.json

All commands are resumable. If you stop a scraper with Ctrl+C, progress written to data-output stays intact.

Detailed workflow

The default maintained dataset workflow now uses the overview parser end to end. The promotion scraper is still useful for legacy comparison work and fixture repair, but it is no longer the main checked-in data path.

# Setup Repo, Install Deps
pnpm i
# Generate Data
node wikipedia/cli/index.js overview --start 1888 --end 2024 --output ./data-output --force-update --include-war-placeholders
# Combine data into all-seasons file
node wikipedia/data/combine-output-files.js --output ./data-output/all-seasons.json ./data-output/wiki_overview_tables_by_season.json
# Verify the generated data
node wikipedia/data/verify-football-data.js --fail-on-issues ./data-output
pnpm test:integration
# If all is good, finally minify data ready for external use
node scripts/minify-json.js ./data-output/all-seasons.json
node scripts/minify-json.js ./data-output/wiki_overview_tables_by_season.json

Legacy promotion fixture rebuild flow

When data-output/wiki_promotion_relegations_by_season.json needs to be refreshed for historical comparison or legacy fixture coverage, rebuild it from code instead of patching individual seasons by hand:

pnpm wiki:build:promotion
pnpm wiki:minify:promotion
pnpm test:integration:promotion

For a single-season repair while preserving the checked-in dataset shape, use the same command with a narrow range and keep --ignore-war-years enabled. Example for the 1919-20 edge season:

node wikipedia/cli/index.js build --start 1919 --end 1919 --output ./data-output --force-update --ignore-war-years
node scripts/minify-json.js ./data-output/wiki_promotion_relegations_by_season.json
pnpm test:integration:promotion

Project Structure

data/ – raw reference files and one-off exports.
data-output/ – canonical JSON outputs grouped by source (e.g. data-output/rsssf).
scripts/ – helper utilities such as minify-json.js plus older one-off generators.
wikipedia/ – the main scraper, parsers, and FootballData models.
rsssf/ – legacy RSSSF parsing experiments.
shared/, club_names.json – shared helpers and canonicalised club naming.

Wikipedia CLI (`wiki-league`)

Run node wikipedia/cli/index.js <command> [options] to build FootballData-format JSON directly from Wikipedia tables.

Command	Purpose	Default output
`build`	Legacy/historical promotion-relegation scraper for classic Football League season pages, mainly Tier 1 and Tier 2.	`data-output/wiki_promotion_relegations_by_season.json`
`overview`	Primary maintained parser. Reads overview pages (e.g. “2015–16 in English football”) and captures all listed tiers.	`data-output/wiki_overview_tables_by_season.json`
`combined`	Legacy bridge command: run `build` first, then backfill missing seasons with `overview`.	Both files above, reusing the same `--output` directory.

Common flags across commands:

Flag	Default	Description
`-s, --start <year>`	varies	First season (inclusive).
`-e, --end <year>`	varies	Final season (inclusive).
`-o, --output <dir>`	`./data-output`	Directory that will contain the JSON file(s).
`-u, --update-only`	`false`	Skip seasons that already contain data on disk.
`-f, --force-update`	`false`	Ignore cached entries and rebuild everything.
`--ignore-war-years`	`false`	Skip WWI/WWII suspension years entirely.
`--include-war-placeholders`	`false`	Emit metadata-only wartime placeholder seasons in overview output.

Each run saves season-by-season progress immediately, so reruns are fast. The combined command exists for legacy mixed-source rebuilds, but the checked-in maintained path is now overview.

Tip: for the checked-in overview dataset we now run overview across the full supported range. Keep build around for legacy comparisons, targeted fixture repair, and classic-season parser regressions.

RSSSF CLI (`rsssf-scraper`)

node rsssf/cli.js scrape [options] converts RSSSF HTML into the same FootballData schema. This path is kept for legacy/archive work and is not the primary maintained workflow.

Option	Description
`-u, --url <url>`	One or more RSSSF page URLs to fetch. Repeat for multiple seasons.
`-f, --from-file <file>`	Parse saved HTML instead of fetching over the network (repeatable).
`-s, --start <year>` / `-e, --end <year>`	Generate season URLs using the default template (`https://www.rsssf.org/engpaul/FLA/{seasonSlug}.html`). Requires both flags to be provided.
`--url-template <template>`	Custom season URL template – supports `{seasonSlug}`, `{startYear}`, `{endYear}`, `{seasonSlugUnderscore}`, etc.
`-o, --output <path>`	JSON output path. Multiple sources treat this as a directory; range scraping writes an aggregate file under `data-output/rsssf`.
`--pretty`	Pretty-print instead of minified JSON.
`--save-html <path>`	Persist the raw HTML alongside the JSON (single file or directory depending on the context).

Range mode continually updates data-output/rsssf/rsssf_promotion_relegations_by_season.json and guards against partial data loss by saving after each season (even when interrupted).

Example invocations

# Pretty-print one season to stdout-style JSON
node rsssf/cli.js scrape --url https://www.rsssf.org/engpaul/FLA/1908-09.html --pretty

# Fetch several seasons, write each JSON into data-output/rsssf, and persist HTML copies
node rsssf/cli.js scrape --start 1950 --end 1952 --output ./data-output/rsssf --save-html ./data-output/rsssf/html

# Parse existing HTML exports (useful for offline work)
node rsssf/cli.js scrape --from-file ./rsssf-cache/1960-61.html --from-file ./rsssf-cache/1961-62.html

JSON Utilities

wikipedia/data/combine-output-files.js – merge multiple FootballData JSON files, drop war-year placeholders, prefer the richest tier record for each season, and show a grouped “missing seasons” summary. Use --include-empty to keep placeholder entries and --compact for minified JSON.
wikipedia/data/compare-football-data.js – compare two FootballData JSON files and report season, tier, table, outcome-list, and metadata changes between releases. Pass --json for machine-readable output. Pass --markdown for a release-note-friendly summary.
scripts/minify-json.js – shrink JSON files in place or alongside (foo.min.json) so they are ready for publishing.
wikipedia/data/verify-football-data.js – lint FootballData exports for empty tiers, duplicate teams, stat mismatches, or promotion/relegation inconsistencies. Pass --fail-on-issues to exit non-zero when anomalies exist.

Exported Data Shape

The final contract is the merged file: data-output/all-seasons.json.
Every FootballData export may include a top-level metadata object with release provenance:
- schemaVersion
- generator
- generatedAt
- gitSha
- sourceFiles
- buildOptions
Each season contains a seasonInfo summary object plus one or more tierN objects.
seasonInfo is not a league table. It is a season-level summary that currently stores:
- season
- promoted
- relegated
- source metadata such as seasonSlug, sourceUrl, or tableCount
seasonInfo.promoted means clubs moving into the top flight for the following season.
seasonInfo.relegated means clubs leaving the top flight at the end of that season.
Every tierN entry is an object with season, table, promoted, and relegated.
Every tierN entry now carries a single metadata object:
- source
- sourceUrl
- seasonSlug
- leagueId
- title
- tableIndex
- tableCount
- tierKey

Utility examples

# Build the maintained merged dataset from the overview export
node wikipedia/data/combine-output-files.js --output ./data-output/all-seasons.json \
  ./data-output/wiki_overview_tables_by_season.json

# Run the data lint pass on every JSON file under ./data-output
node wikipedia/data/verify-football-data.js --fail-on-issues ./data-output

# Compare a previous release file against a freshly generated one
node wikipedia/data/compare-football-data.js ./releases/all-seasons-prev.json ./data-output/all-seasons.json

# Generate a markdown release summary
node wikipedia/data/compare-football-data.js --markdown ./releases/all-seasons-prev.json ./data-output/all-seasons.json

# Minify the merged dataset next to its original (writes all-seasons.min.json)
node scripts/minify-json.js ./data-output/all-seasons.json

Testing

Run the full Jest suite (unit + lightweight parsing checks):

pnpm test

Target just the integration suite (which exercises the supported Wikipedia scrapers end-to-end) when validating new data runs:

pnpm test:integration
pnpm test:integration:overview    # primary maintained Wikipedia fixtures
pnpm test:integration:promotion   # legacy promotion/relegation fixtures

Coverage is available via:

pnpm test:coverage

Every script sets NODE_OPTIONS=--experimental-vm-modules automatically so Jest can execute the ESM codebase without extra configuration.

Additional Notes

Keep output directories around; the CLIs skip existing seasons unless --force-update is provided, which significantly cuts rerun time.
club_names.json contains canonical spellings that the scrapers rely on when reconciling seasonal data – update it before running the cleaners if you expect new clubs to appear.
Extend wikipedia/builders/parse-season-pages.js or wikipedia/builders/parse-ext-season-overview-pages.js if you need extra metadata (attendance, form, etc.); the FootballData schema is intentionally flexible.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.codex/steering		.codex/steering
.github/workflows		.github/workflows
data-output		data-output
data		data
docs		docs
rsssf		rsssf
scripts		scripts
shared		shared
wikipedia		wikipedia
.editorconfig		.editorconfig
.eslintrc.cjs		.eslintrc.cjs
.gitignore		.gitignore
.lefthook.yml		.lefthook.yml
.nvmrc		.nvmrc
.prettierignore		.prettierignore
.prettierrc		.prettierrc
AGENTS.md		AGENTS.md
LICENSE		LICENSE
club_names.json		club_names.json
codex-ai.json		codex-ai.json
jest.config.js		jest.config.js
jest.integration.config.js		jest.integration.config.js
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

English Football Statistics Data & Scripts

Supported Scope

Requirements

Quick Start

Detailed workflow

Legacy promotion fixture rebuild flow

Project Structure

Wikipedia CLI (`wiki-league`)

RSSSF CLI (`rsssf-scraper`)

Example invocations

JSON Utilities

Exported Data Shape

Utility examples

Testing

Additional Notes

About

Uh oh!

Releases 4

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

English Football Statistics Data & Scripts

Supported Scope

Requirements

Quick Start

Detailed workflow

Legacy promotion fixture rebuild flow

Project Structure

Wikipedia CLI (wiki-league)

RSSSF CLI (rsssf-scraper)

Example invocations

JSON Utilities

Exported Data Shape

Utility examples

Testing

Additional Notes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Wikipedia CLI (`wiki-league`)

RSSSF CLI (`rsssf-scraper`)

Packages