Page MenuHomePhabricator

Force pages to be fully re-parsed occasionally
Open, Needs TriagePublic

Description

Caching is hard.

MediaWiki has $wgCacheEpoch and every row in the page database table has page.page_touched and page.page_links_updated timestamp columns.

For Varnish cache, we force pages to not be older than 30 days because we don't want to serve stale content to users.

When is a page is edited, we re-parse the page and update the relevant *links database tables.

However, the current approaches leave gaps:

  • The *links database table updates can be missed, due to job queue issues or flukes
  • Some pages aren't edited for many years, so they don't get re-parsed
  • Application code regularly gets updated, but pages then only get re-parsed "lazily", which typically translates to if they're edited

Some of the data integrity issues we've been seeing are mentioned at T87716#2316414.

I would like to investigate using one of these timestamps we store as a means of forcing pages that are more than X days since that timestamp to be fully re-parsed and regenerated. With any other cache, we would have some kind of eviction policy. With the *links cache, we seem to currently rely on the assumption that incremental updates (e.g., from linked pages being created or deleted) and occasional edits to the page will keep everything in sync. However, on large and small wikis alike, there simply isn't enough edit activity. Or a bug gets introduced for a few months that prevents updates in certain cases. Or the job queue gets overloaded and jobs are manually deleted/de-duplicated in an emergency.

A number of users have developed scripts, such as touch.py, to iterate through lists of pages and null-edit each of them. This is an effective, but hackish and silly, workaround that seems to be awfully discouraged. If we want to prevent null-edit scripts from being run so often, we need to find a way to make pages and their metadata less stale.

As we go forward, adding new data sources such as arbitrary Wikidata data to our pages, it will be even more important to make sure that we're serving relatively fresh content to users. Forcing the pages to be regenerated occasionally seems like an appropriate solution.

Event Timeline

Yes please. This is a long-standing problem. Let me know if I can support this task in any way (testing or QA; I am not a developer).

Wbm1058 changed the task status from Open to In Progress.Aug 16 2023, 11:18 AM
Wbm1058 claimed this task.

I've noted that for a long time the status of this task has been "Open, Needs Triage".

Searching for the meaning of "triage" on Phabricator, I found that the term is not mentioned on the Bug management/Bug report life cycle page.

Searching Phabricator/Help, I found that "triagers will take care of tasks that have no project set." Krinkle has added the Performance-Team (Radar) project to this task, and they are Watching this task on their board. I guess that means Krinkle is a "triager".

I also see that there is a Triagers project, but members of this project just have permissions for Batch Edit which allows them to perform mass/bulk edits for tasks.

So, while I'm still fuzzy on the details of what constitutes "triage", I'm boldly triaging this task by claiming it, at least until someone tells me I can't do that. Follow the Bot request for approval link in my June 25, 2022 comment above to monitor my progress. I'm moving forward with this as parent task: T157670: "Periodically run refreshLinks.php on production sites" remains stuck on low priority, even after its subtask T159512 was closed as "resolved".

Well now I see the status is "Open, In Progress, Needs Triage" but I'm at a loss for how to remove "Needs Triage". A member of the Triagers project needs to do that?

SD0001 changed the task status from In Progress to Open.Aug 16 2023, 11:39 AM
SD0001 subscribed.

Task is to force the purges from the server level. Using a bot to purge pages through the API, as noted in the task description, is "an effective, but hackish and silly, workaround that seems to be awfully discouraged," although doing it is still welcome until a satisfactory solution to the problem is implemented.

("Needs Triage" refers to no priority rating being set, see https://www.mediawiki.org/wiki/Phabricator/Project_management#Setting_task_priorities.)