⚓ T161556 Implement a way to have linter reprocess all pages
Page MenuHomePhabricator

Implement a way to have linter reprocess all pages
Closed, ResolvedPublic

Description

+Services for advice/help.

Currently Linter updates after a page is edited, causing changeprop to request the new version of it from parsoid. However, sometimes we want to reprocess all pages due to code changes (e.g. T160599) or other errors (e.g. T160573#3135189).

In MediaWiki we have a script that reparses all pages sequentially for this purpose (refreshLinks.php). I was thinking of writing a similar script that would just make requests to a single parsoid instance in order for each page to be reparsed and sent back to linter.

Does that sound like a good plan? Other suggestions?

Event Timeline

The other use case (besides bug fixes) is when we implement code to surface other errors / warnings -- which there are plans for.

mobrovac subscribed.

Does the Parsoid HTML change as a result of code changes made to the linter? If so, you might consider bumping the content type patch version. We have a filter in RB that checks the version and re-renders the page if the versions don't match.

No, the HTML is independent and doesn't change when linter stuff does.

So, if I understand your description right, you basically want to trigger a parse of each page's current revision in Parsoid, but there is no need to update stored or cached content in RESTBase or Varnish. If so, then the htmldumper script should already be very close to what you are looking for. It supports requesting each title in a wiki from an API, and you can run it without storing any of the outputs. The exact API call might need some customization to hit the internal parsoid instance.

Arlolra triaged this task as Medium priority.Mar 28 2017, 2:21 PM

I agree that HTMLDumper might be the way to go. Regarding the URL, I have put a PR up that allows you to specify it from the command line.

Sorry I forgot to comment here - I ended up writing a small python script to do this for now: https://git.legoktm.com/legoktm/parsoid-reparse

Is htmldumper already deployed somewhere?

It is not, but it's easy to set it up a host with npm, like ruthenium.

This has nothing to do with the links tables, as far as I can see.

Maybe it would be useful to remove all known false positives (modules, css, js) from the database. Because the wiki authors cannot do anything to remove them. By the way, at the German Wikivoyage more than 90 % of all linter messages are false positives!

Why do you say this? I fixed a lot of such files.

We cannot fix these modules, css and js files because they are free of errors. We cannot fix false positives. And we cannot fix modules because they use another content model.

Please give an example for what is for you wrong in wikitext, but becomes false positive in .js file.

Please have a view to

Most of the files listed here are modules including MediaWiki:Common.css, and they are no wikitext articles. The modules were used more than 10,000 times and do not show any error in the articles where they are used.

I see. Yes, It would be better without them, but I just fix those.

Btw, if it will be fixed, the problem with anonimous users that add (25 years old) after birthday of article issue, when all you need is purging, will be almost solved...

Mentioned in SAL (#wikimedia-operations) [2017-09-04T22:07:35Z] <legoktm> starting script to reparse all pages in parsoid for Linter (python2 parsoid-reparse.py http://parsoid.discovery.wmnet:8000 --sitematrix --linter-only --skip-closed https://aa.wikipedia.org/w/api.php) - T161556

Mentioned in SAL (#wikimedia-operations) [2017-09-07T18:39:19Z] <legoktm> restarted script to reparse all pages in parsoid for Linter (python3 parsoid_reparse.py http://parsoid.discovery.wmnet:8000 --sitematrix --linter-only --skip-closed https://aa.wikipedia.org/w/api.php) - T161556

I fixed some problems with the script that legoktm created and retarted the script with a high concurrency level on Tuesday, May 14th. It finished in about ~5 days processing pages on all active wikis (except commons and wikidata). A total of about ~80M pages were processed. However, I noticed that talk namespaces haven't been processed properly (and maybe other non-article namespaces .. I need to investigate that). So, for now, I restarted to process all talk namespace pages on all those wikis. It has been running for about 3 hours now and I expect it to finish in < 24 hours.

@Elitre FYI about T181007 since that is causing some of the delays in linter updates in the UI. But, once the script I restarted ends tomrorow OR services fixes that issue (whichever happens first), the delay should go away.

I fixed some problems with the script that legoktm created and retarted the script with a high concurrency level on Tuesday, May 14th.

I think you meant November here.

ssastry claimed this task.