⚓ T364690 Transclusion doesn't work after deletion of source file
Page MenuHomePhabricator

Transclusion doesn't work after deletion of source file
Open, Needs TriagePublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

  1. File has been uploaded to Commons - https://commons.wikimedia.org/w/index.php?title=Special:Log&page=File%3AКрасный+библиотекарь+%28журнал%29%2C+1923%2C+№+1.pdf
  2. Index has been created in Wikisource - https://ru.wikisource.org/wiki/Индекс:Красный_библиотекарь_(журнал),_1923,_№_1.pdf
  3. Text has been recognised
  4. Index has been transcluded into pages with <pages> tag - https://ru.wikisource.org/wiki/Служебная:Ссылки_сюда?target=Индекс%3AКрасный+библиотекарь+%28журнал%29%2C+1923%2C+№+1.pdf&namespace=&hidelinks=1
  5. File has been deleted from Commons - https://commons.wikimedia.org/w/index.php?title=Special:Log&page=File%3AКрасный+библиотекарь+%28журнал%29%2C+1923%2C+№+1.pdf
  6. Index file and recognised text still available in Wikisource - https://ru.wikisource.org/wiki/Индекс:Красный_библиотекарь_(журнал),_1923,_№_1.pdf
  7. Transclusion doesn't work after deletion of source file. For example https://ru.wikisource.org/w/index.php?title=Работа_детских_библиотек_в_Москве_(Каптерева)&oldid=4627379

What happens?:
<pages>doesn't return anything

What should have happened instead?:
<pages> should add text from Index even if source file is absent
(optional for this bug) add tracking category for indexes without source files

Software version (on Special:Version page; skip for WMF-hosted wikis like Wikipedia):
MediaWiki 1.43.0-wmf.4 (2111e6d) - https://ru.wikisource.org/wiki/Служебная:Версия

Other information (browser name/version, screenshots, etc.):

Event Timeline

Reedy renamed this task from Transclusion doesn't work atfer deletion of source file to Transclusion doesn't work after deletion of source file.May 12 2024, 3:07 PM

This is expected if there is no file backing a index file, transclusions won't work.

I agree—this is expected behavior. Moving forward there are a few options:

  1. Re-upload the file. Since it was deleted at Commons for copyright reasons, I recommend against doing that at Commons (though that would solve your technical problem, albeit briefly). That said, since WMF servers are in the USA (including Wikisource Russian), you might ultimately run into issues even if you re-uploaded it to s:ru:Файл:Красный библиотекарь (журнал), 1923, № 1.pdf instead of c:File:Красный библиотекарь (журнал), 1923, № 1.pdf
  2. Use Шаблон:Страница to transclude the pages instead of <pages/> since the individual Страница: wikitext still exists for each page.

In any event, based upon c:Commons:Deletion requests/Красный библиотекарь (журнал), it looks like you are running into a legal issue of copyrights. You should probably give up on hosting that at WMF until the copyrights have expired from a USA point of view (or you can get the copyright holders to publish their content under a WMF acceptable free license such as CC-BY, CC-BY-SA, GPL, LGPL, FAL/LAL, ODC, GFDL, etc.).

Even the second option I listed above, is likely not a long term solution. But you might be able to use such a method to reconstruct the target(s) long enough to download and store the transcriptions locally before they ultimately get deleted for copyright cause (by an admin/sysop at Wikisource Russian).

As an example, I did Служебная:Изменения/5130429/prev as can be seen at: Библиотека и учащиеся (Смушкова).

Case above is just an example. This can be any abstract file with any abstract reason of deletion. After deletion of file, the text is no longer displayed even though text exists.

Case above is just an example. This can be any abstract file with any abstract reason of deletion. After deletion of file, the text is no longer displayed even though text exists.

In other situations where there are no legal or other policy issues, I recommend just re-uploading the file (but Commons does not usually delete things without ample cause).

This is expected if there is no file backing a index file, transclusions won't work.

This shouldn't happen because the <pages/> mechanism works with text in Page and Index NS. Whereas the mediafile is in File NS. These are independent spaces.

  1. Use Шаблон:Страница to transclude the pages instead of <pages/> since the individual Страница: wikitext still exists for each page.
  • The "Страница" template is not recommended for use, as is the English Template:Page, which has the warnings "This template has been deprecated. Please see Help:Transclusion instead..", "Use of this template is discouraged . Users should preferentially be using the <pages /> syntax as detailed at Help:Transclusion".
  • This requires replacing the existing <pages/> with a mass of “Страница” templates in a different format of arguments. Moreover, there can be many dozens of lines with this template, as many as there were index pages in <pages/>, and all this on the mass of pages that were associated with this index. So, this is impossible to do manually in practice (for only one issue of a journal or book - this is tens and many hundreds of index pages, on many pages of the Main NS). (As in the mentioned example >120 index pages and >35 related pages.)

In addition, there is no tracking category for such errors. We found out about this completely by accident. It is not known how many similar ones exist in the project and Wikisource in all other languages.

In any event, based upon c:Commons:Deletion requests/Красный библиотекарь (журнал), it looks like you are running into a legal issue of copyrights.

This example is a complicated situation. It was removed due to copyright reasons of the magazine editor. However, according to the law, editors are not recognized as authors, so deletion for this reason is not legal. But it is useless to prove on Wikimedia Commons, since this is a separate project with independent administrators who often delete some files out of caution, while a lot of others with gross copyright violations and false license templates remain there and are multiplied daily.
Also, several articles (a dozen pages of the magazine) violate copyright. This is the reason to delete the file.

But this is a bug that all the pages of other authors of this magazine with <pages/> are now broken.

As an example, I did Служебная:Изменения/5130429/prev as can be seen at: Библиотека и учащиеся (Смушкова).

Here is similar example from there. Sorry, I deleted your example before I read this message, because it is a violation of copyright ("Смушкова").

This shouldn't happen because the <pages/> mechanism works with text in Page and Index NS. Whereas the mediafile is in File NS. These are independent spaces.

But they aren't independent because the Index and Page pages depend on the File (and the Page pages also depend on the Index).

  • The "Страница" template is not recommended for use, as is the English Template:Page, which has the warnings "This template has been deprecated. Please see Help:Transclusion instead..", "Use of this template is discouraged . Users should preferentially be using the <pages /> syntax as detailed at Help:Transclusion".
  • This requires replacing the existing <pages/> with a mass of “Страница” templates in a different format of arguments. Moreover, there can be many dozens of lines with this template, as many as there were index pages in <pages/>, and all this on the mass of pages that were associated with this index. So, this is impossible to do manually in practice (for only one issue of a journal or book - this is tens and many hundreds of index pages, on many pages of the Main NS). (As in the mentioned example >120 index pages and >35 related pages.)

I never suggested it was easy, fun or recommended—just possible.

In addition, there is no tracking category for such errors. We found out about this completely by accident. It is not known how many similar ones exist in the project and Wikisource in all other languages.

I agree this could use considerably better error reporting/tracking.

This example is a complicated situation. It was removed due to copyright reasons of the magazine editor. However, according to the law, editors are not recognized as authors, so deletion for this reason is not legal. But it is useless to prove on Wikimedia Commons, since this is a separate project with independent administrators who often delete some files out of caution, while a lot of others with gross copyright violations and false license templates remain there and are multiplied daily.
Also, several articles (a dozen pages of the magazine) violate copyright. This is the reason to delete the file.

Well it is up to Wikisource Russian as to whether or not to allow the files to be locally hosted there or not, however, WMF does have some say in things when it comes to legal issues like copyrights.

Some ways around this are to censor the pages of the PDFs replacing them with blanks for the copyright violation issues and then re-upload them somewhere (either locally or Commons).

Alternatively you could break the pages apart into images like JPEGs and upload all the individual pages, skipping the ones still under copyright. I am sure you are aware of the ability to create Index pages across such individual page images.

It would be nice if the guys at Commons somehow notified the wikis consuming their media when they make such deletions, etc.

But this is a bug that all the pages of other authors of this magazine with <pages/> are now broken.

I won't argue that things are not broken. I would argue this is expected behavior and not a bug.

Here is similar example from there. Sorry, I deleted your example before I read this message, because it is a violation of copyright ("Смушкова").

That is not a problem. I do not really know Russian so I was sort of shooting in the dark anyway (I had just randomly picked a main space article that depended on the Index and some of the Page pages from the ones related to this discussion).

This shouldn't happen because the <pages/> mechanism works with text in Page and Index NS. Whereas the mediafile is in File NS. These are independent spaces.

But they aren't independent because the Index and Page pages depend on the File (and the Page pages also depend on the Index).

No, <pages/> only uses existing text in Page NS, file is absolutely not needed for this and is not used, why. Page and Index NS only use file during index page creation and visually assist users in proofreading. So, <pages/> should continue to display that text without breaking pages when the file is deleted.

  • The "Страница" template is not recommended for use, as is the English Template:Page, which has the warnings "This template has been deprecated. Please see Help:Transclusion instead..", "Use of this template is discouraged . Users should preferentially be using the <pages /> syntax as detailed at Help:Transclusion".
  • This requires replacing the existing <pages/> with a mass of “Страница” templates in a different format of arguments. Moreover, there can be many dozens of lines with this template, as many as there were index pages in <pages/>, and all this on the mass of pages that were associated with this index. So, this is impossible to do manually in practice (for only one issue of a journal or book - this is tens and many hundreds of index pages, on many pages of the Main NS). (As in the mentioned example >120 index pages and >35 related pages.)

I never suggested it was easy, fun or recommended—just possible.

This is impossible to do manually in practice. You will not correct pages of this journal, I and no one in our small project will change it to {{страница}} due to technical complexity, no one will ever. The pages remains broken. There is no reason to delete them, because they are proofread in Page NS and are legal.

But this is a bug that all the pages of other authors of this magazine with <pages/> are now broken.

I won't argue that things are not broken. I would argue this is expected behavior and not a bug.

This is a massive bug, unfortunately. A lot of pages with free licenses are broken. Like this, this, this, etc. Users and administrators don’t even know about it, and corrections are technically impossible.
What then is a “bug” if not the breakdown of a lot of pages with the disappearance of text, although it is present in the system and is legal.

I agree the lack of useful error reporting to know about when this happens is certainly a bug but Proofread Page (PRP) is specifically for media-backed transcriptions. As far as I know it has never supported transcription without being backed by some File media. So in that way I would argue this is not a bug. Most Wikisource sites do support other forms of transcription (and even translation, etc.) not involving PRP (in addition to supporting PRP-based).

I disagree that <pages/> should be able to emit Page content without the supporting File media. It might be able to be made to do such but that would be an additional feature request and not a bug (and I am not convinced that is the right things to do even). It could only be a bug if it supported that in the past and so far as I know it never has. Your wanting it to support such a feature does not make it a bug (otherwise I could make ludicrous claims that it is a bug because it does not provide me monetary income or some other random feature, etc.).

This shouldn't happen because the <pages/> mechanism works with text in Page and Index NS. Whereas the mediafile is in File NS. These are independent spaces.

But they aren't independent because the Index and Page pages depend on the File (and the Page pages also depend on the Index).

No, <pages/> only uses existing text in Page NS, file is absolutely not needed for this and is not used, why. Page and Index NS only use file during index page creation and visually assist users in proofreading. So, <pages/> should continue to display that text without breaking pages when the file is deleted.

That's incorrect, ProofreadPage depends on the File namespace to build the pagelist, figure out valid Page: namespace pages and determine the transclusion order of the pages.

But this is a bug that all the pages of other authors of this magazine with <pages/> are now broken.

I won't argue that things are not broken. I would argue this is expected behavior and not a bug.

This is a massive bug, unfortunately. A lot of pages with free licenses are broken. Like this, this, this, etc. Users and administrators don’t even know about it, and corrections are technically impossible.
What then is a “bug” if not the breakdown of a lot of pages with the disappearance of text, although it is present in the system and is legal.

The lack of error reporting could be construed as a bug, the breakdown of pages is not a bug and is intended behavior. You're trying to start a car with water and then claiming that it is a manufacturing defect.

The correct approach in this kind of scenario is to upload the file to your local wiki (if the local/global policies allow it) with the same name as the now deleted file on commons. If the local/global policies don't allow it, then you are out of luck and Phabricator is not the correct place to dispute that.

That's incorrect, ProofreadPage depends on the File namespace to build the pagelist, figure out valid Page: namespace pages and determine the transclusion order of the pages.

Even in situations where the pagelist is entirely determined by the Index (e.g., loose non-multiple page media like a pile of images), I believe there is a dependency on File media.

The lack of error reporting could be construed as a bug, the breakdown of pages is not a bug and is intended behavior. You're trying to start a car with water and then claiming that it is a manufacturing defect.

I agree. This is definitely a bug but only for the lack or proper and useful error reporting under such circumstances which are actually fairly common and quite hard to detect and find.

The correct approach in this kind of scenario is to upload the file to your local wiki (if the local/global policies allow it) with the same name as the now deleted file on commons. If the local/global policies don't allow it, then you are out of luck and Phabricator is not the correct place to dispute that.

Again I agree. Short of complete removal of all related content (which is always an option but it has its down sides), I believe there are basically three long term approaches:

  1. Modify the PDFs to replace the content still under copyrights with blank pages/sections and then re-upload them either locally or at Commons
  2. Break the pages apart and re-upload the pages with non-copyright infringing material as separate image pages. There are number of images formats and again the media can be hosted either locally or at Commons
  3. Forsake the File media and do not use PRP to transcribe the non-copyright infringing material, thereby hosting the content without being media-backed.

I personally prefer option one because as the as parts of the original document that fall out of copyright over time, the media can be re-uploaded repeatedly slowly adding back those parts until the complete publication can eventually be hosted as is.

If option one does not seem optimal for Wikisource Russian, then I recommend option three and finally option two. Ultimately it is up to the Wikisource Russian community to decide what is best for it (be that via community leadership or consensus), etc.

FYI: In terms of the error reporting bug from this issue, the following seems to be applicable:

Change #1031631 had a related patch set uploaded (by Sohom Datta; author: Sohom Datta):

[mediawiki/extensions/ProofreadPage@master] Add a tracking category for pagelist tags without associated Files

https://gerrit.wikimedia.org/r/1031631

It seems like that patch, when merged might solve the issue with finding <pagelist/> usage when the File media cannot be properly processed (in the case of that issue/task it was related to media hosting issues that could be rectified with purges and null edits but it should also apply when the backing media is deleted).

Change #1031631 had a related patch set uploaded (by Sohom Datta; author: Sohom Datta):

[mediawiki/extensions/ProofreadPage@master] Add a tracking category for pagelist tags without associated Files

https://gerrit.wikimedia.org/r/1031631

The correct approach in this kind of scenario is to upload the file ...

Without tracking tools nobody knows which pages are broken. How to find all pages in all Wikisource instances where is required uploading new files? They can only be found by chance.

https://gerrit.wikimedia.org/r/1031631

It seems like that patch, when merged might solve the issue with finding <pagelist/> usage when the File media cannot be properly processed

The patch has not been accepted so far. Am I wrong, or will the broken pages and their tracking never be fixed?

This shouldn't happen because the <pages/> mechanism works with text in Page and Index NS. Whereas the mediafile is in File NS. These are independent spaces.

But they aren't independent because the Index and Page pages depend on the File (and the Page pages also depend on the Index).

No, <pages/> only uses existing text in Page NS, file is absolutely not needed for this and is not used, why. Page and Index NS only use file during index page creation and visually assist users in proofreading. So, <pages/> should continue to display that text without breaking pages when the file is deleted.

That's incorrect, ProofreadPage depends on the File namespace to build the pagelist, figure out valid Page: namespace pages and determine the transclusion order of the pages.

That's what I said. The file is needed only at the modes of creating the pagelist (index) and proofreading.

The <pages /> tag should only take the text of pages from the local Wikisource database, according to a list of a pre-generated index. Why would it access an image file in the Commons repository?.. Sounds like nonsense. If this is indeed the case, then this is a major non-optimization of the code.

@Vladis13: You are correct as long as nobody has reviewed and merged the patch. This generally applies for all and any patches. For the general topic, please see https://www.mediawiki.org/wiki/Gerrit/Code_review/Getting_reviews - thanks.

https://gerrit.wikimedia.org/r/1031631

It seems like that patch, when merged might solve the issue with finding <pagelist/> usage when the File media cannot be properly processed

The patch has not been accepted so far. Am I wrong, or will the broken pages and their tracking never be fixed?

Code review on ProofreadPage is generally fairly slow for a variety of reasons

This shouldn't happen because the <pages/> mechanism works with text in Page and Index NS. Whereas the mediafile is in File NS. These are independent spaces.

But they aren't independent because the Index and Page pages depend on the File (and the Page pages also depend on the Index).

No, <pages/> only uses existing text in Page NS, file is absolutely not needed for this and is not used, why. Page and Index NS only use file during index page creation and visually assist users in proofreading. So, <pages/> should continue to display that text without breaking pages when the file is deleted.

That's incorrect, ProofreadPage depends on the File namespace to build the pagelist, figure out valid Page: namespace pages and determine the transclusion order of the pages.

That's what I said. The file is needed only at the modes of creating the pagelist (index) and proofreading.

The <pages /> tag should only take the text of pages from the local Wikisource database, according to a list of a pre-generated index. Why would it access an image file in the Commons repository?.. Sounds like nonsense. If this is indeed the case, then this is a major non-optimization of the code.

The index is computed based on the image file, or through manually linked pages as a fallback. I do not see why this is a "major non-optimization" or "nonsense".

This shouldn't happen because the <pages/> mechanism works with text in Page and Index NS. Whereas the mediafile is in File NS. These are independent spaces.

But they aren't independent because the Index and Page pages depend on the File (and the Page pages also depend on the Index).

No, <pages/> only uses existing text in Page NS, file is absolutely not needed for this and is not used, why. Page and Index NS only use file during index page creation and visually assist users in proofreading. So, <pages/> should continue to display that text without breaking pages when the file is deleted.

That's incorrect, ProofreadPage depends on the File namespace to build the pagelist, figure out valid Page: namespace pages and determine the transclusion order of the pages.

That's what I said. The file is needed only at the modes of creating the pagelist (index) and proofreading.

The <pages /> tag should only take the text of pages from the local Wikisource database, according to a list of a pre-generated index. Why would it access an image file in the Commons repository?.. Sounds like nonsense. If this is indeed the case, then this is a major non-optimization of the code.

The index is computed based on the image file, or through manually linked pages as a fallback. I do not see why this is a "major non-optimization" or "nonsense".

You state that with EVERY reading of a page of a work by site readers, the index is re-generated, scanning the file in the Commons repository. This is a time-consuming and resource-intensive operation.

Then it should be like this:

  • Only when saving the index page (Index NS) does the extension analyze the arguments of the <pagelist /> tag (which defines the mapping: pdf page number = book page number or literal label), creates an index (mapping, it’s short, banal "[{1: 'image'}, {2: 4}, {3: 5}, etc. according to the number of pdf pages]"), and is saves in the database. This is all.
  • When a reader opens a page of a work, the <pages/> tag works only with database 1) takes the pages mapping from the database, takes the range of page numbers according to the arguments in the tag, 2) takes the text of the pages from the database, cuts out the sections specified in the arguments, renders html: gluing the texts of the pages, inserting between them the page numbers of the book with links to the pages of the Page NS, according to the mapping (index). This is all.

Compare the resource costs of generating an index from scratch for each read, and one only when saving the index page. I think this is obvious.

This shouldn't happen because the <pages/> mechanism works with text in Page and Index NS. Whereas the mediafile is in File NS. These are independent spaces.

But they aren't independent because the Index and Page pages depend on the File (and the Page pages also depend on the Index).

No, <pages/> only uses existing text in Page NS, file is absolutely not needed for this and is not used, why. Page and Index NS only use file during index page creation and visually assist users in proofreading. So, <pages/> should continue to display that text without breaking pages when the file is deleted.

That's incorrect, ProofreadPage depends on the File namespace to build the pagelist, figure out valid Page: namespace pages and determine the transclusion order of the pages.

That's what I said. The file is needed only at the modes of creating the pagelist (index) and proofreading.

The <pages /> tag should only take the text of pages from the local Wikisource database, according to a list of a pre-generated index. Why would it access an image file in the Commons repository?.. Sounds like nonsense. If this is indeed the case, then this is a major non-optimization of the code.

The index is computed based on the image file, or through manually linked pages as a fallback. I do not see why this is a "major non-optimization" or "nonsense".

You state that with EVERY reading of a page of a work by site readers, the index is re-generated, scanning the file in the Commons repository. This is a time-consuming and resource-intensive operation.

Then it should be like this:

  • Only when saving the index page (Index NS) does the extension analyze the arguments of the <pagelist /> tag (which defines the mapping: pdf page number = book page number or literal label), creates an index (mapping, it’s short, banal "[{1: 'image'}, {2: 4}, {3: 5}, etc. according to the number of pdf pages]"), and is saves in the database. This is all.
  • When a reader opens a page of a work, the <pages/> tag works only with database 1) takes the pages mapping from the database, takes the range of page numbers according to the arguments in the tag, 2) takes the text of the pages from the database, cuts out the sections specified in the arguments, renders html: gluing the texts of the pages, inserting between them the page numbers of the book with links to the pages of the Page NS, according to the mapping (index). This is all.

Compare the resource costs of generating an index from scratch for each read, and one only when saving the index page. I think this is obvious.

The recompilation happens on page save or link update on related pages, not on every render.

This shouldn't happen because the <pages/> mechanism works with text in Page and Index NS. Whereas the mediafile is in File NS. These are independent spaces.

But they aren't independent because the Index and Page pages depend on the File (and the Page pages also depend on the Index).

No, <pages/> only uses existing text in Page NS, file is absolutely not needed for this and is not used, why. Page and Index NS only use file during index page creation and visually assist users in proofreading. So, <pages/> should continue to display that text without breaking pages when the file is deleted.

That's incorrect, ProofreadPage depends on the File namespace to build the pagelist, figure out valid Page: namespace pages and determine the transclusion order of the pages.

That's what I said. The file is needed only at the modes of creating the pagelist (index) and proofreading.

The <pages /> tag should only take the text of pages from the local Wikisource database, according to a list of a pre-generated index. Why would it access an image file in the Commons repository?.. Sounds like nonsense. If this is indeed the case, then this is a major non-optimization of the code.

The index is computed based on the image file, or through manually linked pages as a fallback. I do not see why this is a "major non-optimization" or "nonsense".

You state that with EVERY reading of a page of a work by site readers, the index is re-generated, scanning the file in the Commons repository. This is a time-consuming and resource-intensive operation.

Then it should be like this:

  • Only when saving the index page (Index NS) does the extension analyze the arguments of the <pagelist /> tag (which defines the mapping: pdf page number = book page number or literal label), creates an index (mapping, it’s short, banal "[{1: 'image'}, {2: 4}, {3: 5}, etc. according to the number of pdf pages]"), and is saves in the database. This is all.
  • When a reader opens a page of a work, the <pages/> tag works only with database 1) takes the pages mapping from the database, takes the range of page numbers according to the arguments in the tag, 2) takes the text of the pages from the database, cuts out the sections specified in the arguments, renders html: gluing the texts of the pages, inserting between them the page numbers of the book with links to the pages of the Page NS, according to the mapping (index). This is all.

Compare the resource costs of generating an index from scratch for each read, and one only when saving the index page. I think this is obvious.

The recompilation happens on page save or link update on related pages, not on every render.

And why, on each rendering, does the code recompile the index (mapping), although it does not change, and scan the pdf to Commons, which is not needed, since only text from the database is used? And besides the optimization problem, this leads to the discussed breakdown of the mass of pages.

This shouldn't happen because the <pages/> mechanism works with text in Page and Index NS. Whereas the mediafile is in File NS. These are independent spaces.

But they aren't independent because the Index and Page pages depend on the File (and the Page pages also depend on the Index).

No, <pages/> only uses existing text in Page NS, file is absolutely not needed for this and is not used, why. Page and Index NS only use file during index page creation and visually assist users in proofreading. So, <pages/> should continue to display that text without breaking pages when the file is deleted.

That's incorrect, ProofreadPage depends on the File namespace to build the pagelist, figure out valid Page: namespace pages and determine the transclusion order of the pages.

That's what I said. The file is needed only at the modes of creating the pagelist (index) and proofreading.

The <pages /> tag should only take the text of pages from the local Wikisource database, according to a list of a pre-generated index. Why would it access an image file in the Commons repository?.. Sounds like nonsense. If this is indeed the case, then this is a major non-optimization of the code.

The index is computed based on the image file, or through manually linked pages as a fallback. I do not see why this is a "major non-optimization" or "nonsense".

You state that with EVERY reading of a page of a work by site readers, the index is re-generated, scanning the file in the Commons repository. This is a time-consuming and resource-intensive operation.

Then it should be like this:

  • Only when saving the index page (Index NS) does the extension analyze the arguments of the <pagelist /> tag (which defines the mapping: pdf page number = book page number or literal label), creates an index (mapping, it’s short, banal "[{1: 'image'}, {2: 4}, {3: 5}, etc. according to the number of pdf pages]"), and is saves in the database. This is all.
  • When a reader opens a page of a work, the <pages/> tag works only with database 1) takes the pages mapping from the database, takes the range of page numbers according to the arguments in the tag, 2) takes the text of the pages from the database, cuts out the sections specified in the arguments, renders html: gluing the texts of the pages, inserting between them the page numbers of the book with links to the pages of the Page NS, according to the mapping (index). This is all.

Compare the resource costs of generating an index from scratch for each read, and one only when saving the index page. I think this is obvious.

The recompilation happens on page save or link update on related pages, not on every render.

And why, on each rendering, does the code recompile the index (mapping), although it does not change, and scan the pdf to Commons, which is not needed, since only text from the database is used? And besides the optimization problem, this leads to the discussed breakdown of the mass of pages.

As I said before the compilation does not happen on each render. It happens when a template transcluded/related to the page changes OR when the page is purged OR when the page is edited (basically ProofreadPage and the underlying parser treats the <pages /> tag like a series of templates and updates are triggered whenever mediawiki decides that the templates need updating).

As has been mentioned before, the problem here is an onwiki issue dispute/licensing disagreement that should be resolved on-wiki and not by changing the software to support a non-standard usecase.

[When commenting, please remove unneeded quotes (that you don't refer to) to keep things readable. Thanks.]

And why, on each rendering, does the code recompile the index (mapping), although it does not change, and scan the pdf to Commons, which is not needed, since only text from the database is used? And besides the optimization problem, this leads to the discussed breakdown of the mass of pages.

As I said before the compilation does not happen on each render. It happens when a template transcluded/related to the page changes OR when the page is purged OR when the page is edited (basically ProofreadPage and the underlying parser treats the <pages /> tag like a series of templates and updates are triggered whenever mediawiki decides that the templates need updating).

Please answer my question above, and don't quibble with words. Re-compilations happens during rendering, or a couple of times less often when saves page/transclusions, there is almost no difference. Because when proofreading a book, there making a mass of edits/savings on tens and hundreds of pages of transclusions. And on each of these thousands of edits for each book, as you say, the code doing useless duplicates of re-compilations and https-requests to the Commons pdf, this leads to the discussed breakdown of the mass of pages of mass Wikisources.
Whereas in reality, the index page is usually edited only a few times, creating the desired listpages, nothing needs to be recompiled and everything will work without errors.

Please answer my question above, why?

As has been mentioned before, the problem here is an onwiki issue dispute/licensing disagreement that should be resolved on-wiki and not by changing the software to support a non-standard usecase.

This has nothing to do with licensing (as I said above, pages with a free license are massively broken). You answer looks like an attempt to shift responsibility to other projects and send me away.

If you are not personally interested in this task, or are lazy to program, then simply do not respond to this discussion. Thank you!

And why, on each rendering, does the code recompile the index (mapping), although it does not change, and scan the pdf to Commons, which is not needed, since only text from the database is used? And besides the optimization problem, this leads to the discussed breakdown of the mass of pages.

As I said before the compilation does not happen on each render. It happens when a template transcluded/related to the page changes OR when the page is purged OR when the page is edited (basically ProofreadPage and the underlying parser treats the <pages /> tag like a series of templates and updates are triggered whenever mediawiki decides that the templates need updating).

Please answer my question above, and don't quibble with words. Re-compilations happens during rendering, or a couple of times less often when saves page/transclusions, there is almost no difference. Because when proofreading a book, there making a mass of edits/savings on tens and hundreds of pages of transclusions. And on each of these thousands of edits for each book, as you say, the code doing useless duplicates of re-compilations and https-requests to the Commons pdf, this leads to the discussed breakdown of the mass of pages of mass Wikisources.
Whereas in reality, the index page is usually edited only a few times, creating the desired listpages, nothing needs to be recompiled and everything will work without errors.

Please answer my question above, why?

I'm not the one quibbling with words here. Recompilation on cache invalidation is very different from rendering on every page view, one happens every few days, and the other happens every time you view a page.

I don't think the code here is duplicating any work, every time a page is saved or a related page is updated, there is a chance that the file does not exist or has changed and the sequence that was previously constructed is invalid. This needs to be accounted for by ProofreadPage, which it does by checking for the existence of the file, computing the list of pages, and then re-rendering the sequence (or displaying an error if the sequence is invalid).

As has been mentioned before, the problem here is an onwiki issue dispute/licensing disagreement that should be resolved on-wiki and not by changing the software to support a non-standard usecase.

This has nothing to do with licensing (as I said above, pages with a free license are massively broken). You answer looks like an attempt to shift responsibility to other projects and send me away.

If you are not personally interested in this task, or are lazy to program, then simply do not respond to this discussion. Thank you!

You have failed to provide a valid use-case where a file will need to be deleted after an Index has been created. The examples given in this task are cases where the licenses of the file were disputed on commons. That should be handled on Commons or via local policies, not by making software changes. If a file is indeed non-free, the transcription simply should not exist at all.

General request: Please read and follow https://www.mediawiki.org/wiki/Code_of_Conduct and https://www.mediawiki.org/wiki/Bug_management/Phabricator_etiquette and refrain from personal attacks if you would like to be active in Wikimedia Phabricator. Thanks a lot.

I don't think the code here is duplicating any work, every time a page is saved or a related page is updated, there is a chance that the file does not exist or has changed and the sequence that was previously constructed is invalid. This needs to be accounted for by ProofreadPage, which it does by checking for the existence of the file, computing the list of pages, and then re-rendering the sequence (or displaying an error if the sequence is invalid).

This is the functionality of a separate <pagelist/> tag, launched when saving the Index page.
The <pages/> tag only needs existing pages and a pre-compiled <pagelist/> list of pages from the database. Running book indexing and file scanning in the <pages/> tag does nothing, but pointless duplicate re-compilations, broken pages (as discussed) and load on the server.

When changing a file, the output of <pages/> does not change anything dependent on the file, since it outputs the text of the pages in Page NS, taking it from the database, according to the attributes defined in the tag (page range and sections); pages with the specified numbers will still be rendered. So scanning a file from <pages/> is pointless.
For example, the <pages index="Book.pdf" from=1 to=10> tag will output pages "Page:Book.pdf/1" until "Page:Book.pdf/10" saved in database of old version file, even if this file will change. Right?

As has been mentioned before, the problem here is an onwiki issue dispute/licensing disagreement that should be resolved on-wiki and not by changing the software to support a non-standard usecase.

This has nothing to do with licensing (as I said above, pages with a free license are massively broken). You answer looks like an attempt to shift responsibility to other projects and send me away.

You have failed to provide a valid use-case where a file will need to be deleted after an Index has been created. The examples given in this task are cases where the licenses of the file were disputed on commons. That should be handled on Commons or via local policies, not by making software changes. If a file is indeed non-free, the transcription simply should not exist at all.

Google Translate does not clearly translate your first sentence to me. I gave (at the bottom of T364690#9793607) examples of broken pages with free licenses.

Change #1031631 merged by jenkins-bot:

[mediawiki/extensions/ProofreadPage@master] Add a tracking category for pagelist tags without associated Files

https://gerrit.wikimedia.org/r/1031631