Implement a way to have linter reprocess all pages
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Legoktm
	Mar 27 2017, 8:26 PM

Description

+Services for advice/help.

Currently Linter updates after a page is edited, causing changeprop to request the new version of it from parsoid. However, sometimes we want to reprocess all pages due to code changes (e.g. T160599) or other errors (e.g. T160573#3135189).

In MediaWiki we have a script that reparses all pages sequentially for this purpose (refreshLinks.php). I was thinking of writing a similar script that would just make requests to a single parsoid instance in order for each page to be reparsed and sent back to linter.

Does that sound like a good plan? Other suggestions?

Related Objects

Mentioned In: T181007: Investigate backlog in RecordLintJob
T171375: Do not lint template and module pages
Mentioned Here: T181007: Investigate backlog in RecordLintJob
T160573: Special:LintErrors page had not been updated more than 90 mins after a page has been edited to fix errors
T160599: Parsoid's linter doesn't know about thumbtime parameter

Event Timeline

Legoktm created this task.Mar 27 2017, 8:26 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 27 2017, 8:26 PM

The other use case (besides bug fixes) is when we implement code to surface other errors / warnings -- which there are plans for.

Does the Parsoid HTML change as a result of code changes made to the linter? If so, you might consider bumping the content type patch version. We have a filter in RB that checks the version and re-renders the page if the versions don't match.

No, the HTML is independent and doesn't change when linter stuff does.

So, if I understand your description right, you basically want to trigger a parse of each page's current revision in Parsoid, but there is no need to update stored or cached content in RESTBase or Varnish. If so, then the htmldumper script should already be very close to what you are looking for. It supports requesting each title in a wiki from an API, and you can run it without storing any of the outputs. The exact API call might need some customization to hit the internal parsoid instance.

Arlolra triaged this task as Medium priority.Mar 28 2017, 2:21 PM

I agree that HTMLDumper might be the way to go. Regarding the URL, I have put a PR up that allows you to specify it from the command line.

Sorry I forgot to comment here - I ended up writing a small python script to do this for now: https://git.legoktm.com/legoktm/parsoid-reparse

Is htmldumper already deployed somewhere?

It is not, but it's easy to set it up a host with npm, like ruthenium.

Jdforrester-WMF subscribed.Apr 25 2017, 6:15 PM

ssastry moved this task from Needs Triage to Non-Parsing-Team Tasks on the Parsoid board.Apr 27 2017, 2:35 PM

• Elitre subscribed.Apr 27 2017, 2:57 PM

Jonesey95 added a parent task: T157670: Periodically run refreshLinks.php on production sites..Apr 28 2017, 4:05 AM

IKhitron subscribed.Apr 28 2017, 9:06 AM

This has nothing to do with the links tables, as far as I can see.

Jdforrester-WMF removed a parent task: T157670: Periodically run refreshLinks.php on production sites..May 2 2017, 2:47 PM

• MZMcBride subscribed.May 9 2017, 5:29 AM

Arlolra mentioned this in T171375: Do not lint template and module pages.Jul 22 2017, 2:18 PM

RolandUnger subscribed.Jul 26 2017, 5:18 AM

Maybe it would be useful to remove all known false positives (modules, css, js) from the database. Because the wiki authors cannot do anything to remove them. By the way, at the German Wikivoyage more than 90 % of all linter messages are false positives!

Why do you say this? I fixed a lot of such files.

We cannot fix these modules, css and js files because they are free of errors. We cannot fix false positives. And we cannot fix modules because they use another content model.

Please give an example for what is for you wrong in wikitext, but becomes false positive in .js file.

Please have a view to

Most of the files listed here are modules including MediaWiki:Common.css, and they are no wikitext articles. The modules were used more than 10,000 times and do not show any error in the articles where they are used.

I see. Yes, It would be better without them, but I just fix those.

Btw, if it will be fixed, the problem with anonimous users that add (25 years old) after birthday of article issue, when all you need is purging, will be almost solved...

Mentioned in SAL (#wikimedia-operations) [2017-09-04T22:07:35Z] <legoktm> starting script to reparse all pages in parsoid for Linter (python2 parsoid-reparse.py http://parsoid.discovery.wmnet:8000 --sitematrix --linter-only --skip-closed https://aa.wikipedia.org/w/api.php) - T161556

Mentioned in SAL (#wikimedia-operations) [2017-09-07T18:39:19Z] <legoktm> restarted script to reparse all pages in parsoid for Linter (python3 parsoid_reparse.py http://parsoid.discovery.wmnet:8000 --sitematrix --linter-only --skip-closed https://aa.wikipedia.org/w/api.php) - T161556

I fixed some problems with the script that legoktm created and retarted the script with a high concurrency level on Tuesday, May 14th. It finished in about ~5 days processing pages on all active wikis (except commons and wikidata). A total of about ~80M pages were processed. However, I noticed that talk namespaces haven't been processed properly (and maybe other non-article namespaces .. I need to investigate that). So, for now, I restarted to process all talk namespace pages on all those wikis. It has been running for about 3 hours now and I expect it to finish in < 24 hours.

• Pchelolo mentioned this in T181007: Investigate backlog in RecordLintJob.Nov 20 2017, 10:53 PM

@Elitre FYI about T181007 since that is causing some of the delays in linter updates in the UI. But, once the script I restarted ends tomrorow OR services fixes that issue (whichever happens first), the delay should go away.

TYVM for letting me know :)

In T161556#3776194, @ssastry wrote:

I fixed some problems with the script that legoktm created and retarted the script with a high concurrency level on Tuesday, May 14th.

I think you meant November here.

ssastry closed this task as Resolved.Jan 18 2018, 4:32 PM

ssastry claimed this task.

Implement a way to have linter reprocess all pages Closed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Implement a way to have linter reprocess all pages
Closed, ResolvedPublic
Actions