Usage requirement of mo audio element · Issue #1986 · w3c/epub-specs · GitHub
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Usage requirement of mo audio element #1986

Closed
iherman opened this issue Feb 4, 2022 · 30 comments · Fixed by #2001
Closed

Usage requirement of mo audio element #1986

iherman opened this issue Feb 4, 2022 · 30 comments · Fixed by #2001
Labels
EPUB33 Issues addressed in the EPUB 3.3 revision Spec-EPUB3 The issue affects the core EPUB 3.3 Recommendation Topic-MediaOverlays The issue affects media overlays

Comments

@iherman
Copy link
Member

iherman commented Feb 4, 2022

I a bit bothered by the usage requirement of the audio element in MO. The text says:

Usage
A REQUIRED child of the par element unless its sibling text element refers to audio or video media, or to textual content intended for rendering via text-to-speech, in which case it is OPTIONAL (see 7.3.2.4 Embedded Media).

As read, this sounds like an untestable requirement to me, unless it is tested run-time. After all, how does a checker know that an HTML file, linked through a text element, is used with text-to-speech? We do not define normatively anything for tts after all.

Maybe EPUBCheck found a trick that I miss (@rdeltour ?). Otherwise, it may be cleaner to replace that requirement by something like:

An OPTIONAL child of the par element. If the element is missing, its adjacent text element either refers to audio or video, or is supposed to refer to a textual content intended for rendering via text-to-speech.

I think that content-wise we are saying the same thing, but the checker is not under the obligation of checking this.

@iherman iherman added the Topic-MediaOverlays The issue affects media overlays label Feb 4, 2022
@iherman
Copy link
Member Author

iherman commented Feb 4, 2022

@mattgarrish
Copy link
Member

mattgarrish commented Feb 4, 2022

After all, how does a checker know that an HTML file, linked through a text element, is used with text-to-speech?

I'm not sure what you're asking by this. It's an instruction to the reading system to render it via TTS when the audio element isn't present and the element doesn't refer to embedded media. If you check what is referenced by the fragment, what would make this complex to test?

The test to verify would be to play back the smil file and see if the reading systems renders the text content of the referenced element using its own TTS playback. If not, it fails.

We don't define TTS enhancement technologies like SSML, PLS, and CSS3 Speech, but there's still a requirement for Reading Systems to render the content via TTS in media overlays. That's the section the link on TTS goes to.

I agree it would be better to say it's optional except when required by the other element, though. It's always weird to start off saying it's required and then not make it required. But it would be better to stick with the XPATH axes terminology ("sibling" not "adjacent").

Maybe we can make this a list to make it easier to read:

@iherman
Copy link
Member Author

iherman commented Feb 4, 2022

Your proposed text is much better, for sure.

What bothers me, I guess, is that all the other statements and requirements in that section are on the structure of the MO file (element must be present, attributes are optional or not, etc.). Ie, for a given MO file the validity of the MO file can be checked easily, eg, via an xml schema of some sort (and maybe checking the media type of a referenced file). This statement is an outlier, because no schema or media type check can really check whether the mo is valid.

Maybe we cannot help it, but it is still worth rewriting the sentence as you propose to make things clearer.

@mattgarrish
Copy link
Member

This statement is an outlier, because no schema or media type check can really check whether the mo is valid.

It's not a simple schema check, but the outlier cases where it might have problems probably aren't too realistic for EPUB, either.

Or put another way, what else widely occurs in the body of an EPUB but audio, video, images and text? That immediately excludes a clear set of HTML elements (header stuff, script, style, etc.).

For the rest, you can likely omit the audio element and some playback behaviour will still occur. It may not be what was intended, but you can't solve intent by schema alone.

And if you want to get fancier and expect other media types might occur (e.g., so you don't blanket accept whatever is in an object or embed tag), the core media types plus video/* types can be used to further check the integrity of what is referenced.

I personally think the implementation is more complicated than the verification, but I don't believe this feature is expected to be widely used, at least not the TTS part. I expect it's there primarily for DAISY producers, and I don't see this requirement tripping them up.

@marisademeglio
Copy link
Contributor

I am in favor of making audio officially optional. See my previous comments.

@mattgarrish
Copy link
Member

I am in favor of making audio officially optional.

I think that makes sense, but should we be more explicit in the text element definition about what it can reference so that it's not legal to reference non-content elements?

If text must reference palpable content, for example, then omitting audio really shouldn't matter.

@iherman
Copy link
Member Author

iherman commented Feb 8, 2022

I think it would be cleaner for the audio element to say something like:

An OPTIONAL child of the par element. If missing, the text element child of the par element should refer to an audio or video media; otherwise the Reading System should render the textual content via, e.g., text-to-speech.

@mattgarrish I guess your proposal to add a reference to a palpable content is to be added to the text element definition. I would agree with that although, again, it is difficult to check for a checker...

@iherman
Copy link
Member Author

iherman commented Feb 8, 2022

B.t.w., making the audio element is in line with the definition of the par element which refers to the element as optional.

@mattgarrish
Copy link
Member

I would agree with that although, again, it is difficult to check for a checker...

What bothers me more now is that I was forgetting that SVG support is optional, so we still have to account for it. I don't believe it has any concept of palpable content that we could similarly cite.

(Minor nitpick, but the intro for the text element says: "The text element references an element in the EPUB Content Document." Since an MO doc can reference multiple epub content documents, this should really be "in an EPUB Content Document".)

@iherman
Copy link
Member Author

iherman commented Feb 9, 2022

For SVG: aren't the only 'palpable' content elements <text>, <textPath> and <tspan>? So if we modify the definition of a text element we could say something like

Identifies the associated fragment of an EPUB Content Document. This fragment must be a palpable content element for XHTML, or one of <text>, <textPath>, or <tspan> for SVG?

That should work...

@mattgarrish
Copy link
Member

mattgarrish commented Feb 9, 2022

For text content, yes, but the text element can also reference images, audio and video.

That means, at least as I understand it, you can associate audio with just about any component of an svg image outside of script and style. I don't know if those are the only two elements that are non-content or if there are others -- I don't create SVGs so can't evaluate the purpose of all the elements just by looking at the list of names.

@iherman
Copy link
Member Author

iherman commented Feb 9, 2022

That means, at least as I understand it, you can associate audio with just about any component of an svg image outside of script and style. I don't know if those are the only two elements that are non-content or if there are others -- I don't create SVGs so can't evaluate the purpose of all the elements just by looking at the list of names.

Hm. To be honest, I am not even sure what that would mean: I do not see how a feedback would work on a drawing the same way as the RS would go, say, sentence-by-sentence signalling what the audio reading is just referring to.

My suggestion is that we are getting into a new specification territory, which we should not do.

@mattgarrish
Copy link
Member

I am not even sure what that would mean

Ya, I think that's the gist of this note from the authoring spec:

In this section, the EPUB Content Document is assumed to be an XHTML Content Document. While EPUB Creators may use Media Overlays with SVG Content Documents, playback behavior might not be consistent and therefore interoperability is not guaranteed.

We might want to look at better explaining this, like being more explicit that there isn't even guaranteed support for playback of SVGs at all.

I do not see how a feedback would work on a drawing the same way as the RS would go, say, sentence-by-sentence signalling what the audio reading is just referring to.

I'm assuming that you could highlight graphical objects. For example, the slices of a pie chart could be described by audio. But I don't recall ever seeing SVG with overlays, and don't know at what level it stops making sense to reference specific graphical elements.

@mattgarrish
Copy link
Member

mattgarrish commented Feb 9, 2022

It may even make sense to reconsider why we're supporting SVG when we expect there to be interoperability problems due to a lack of detailed playback behaviour.

@marisademeglio
Copy link
Contributor

I do have an example somewhere from many years ago that shows MO sync with SVG. As you can apply CSS to SVG, it works in the same way as MO + HTML, assuming the author defines styles that make sense for SVG.

But as there wasn't really any interest in digging into this direction, I think we left it in the spec as "possible but mostly uncharted territory"

@iherman
Copy link
Member Author

iherman commented Feb 10, 2022

I think supporting SVG by explicitly referring to textual elements (as in #1986 (comment)) is a low-hanging fruit spec-wise (no idea if anybody ever implemented it, though). It may also make sense if SVG is used for FXL.

Going beyond that is not obvious. It is much more complicated in practice than text (it would be possible to use one SVG element that describes several slices in a pie chart with a <path> element with complex path data, something that is often done by authoring systems. I would propose that either

  • we keep SVG underspecified; or
  • we define it clearly for textual elements and leave the rest underspecified

In both cases we put a note into the spec making that fact clear and that the SVG related work is left for future work.

@iherman
Copy link
Member Author

iherman commented Feb 10, 2022

Actually... playing with some SVG, and if we want to make things explicit, then the <g> elements should be added to the mix (for basic grouping).

@iherman
Copy link
Member Author

iherman commented Feb 10, 2022

FWIW, I created an MO test with SVG (see w3c/epub-tests#118). I hope the test is o.k. from a spec point of view; Thorium does the reading but does not do the highlighting on the SVG file, so that is not entirely o.k. and the web version of the Colibrio reader (which otherwise seems to do a good job for MO) seems not to implement spine level SVG in the first place...

@wareid

@iherman
Copy link
Member Author

iherman commented Feb 10, 2022

Actually... maybe we are overthinking the SVG issue. The highlight mechanism is CSS based. Whatever can be styled can be used for highlight. In theory, there is no need for even mentioning the concept of palpable elements or SVG's text elements; things are fine as they are.

Well... at least for the Content document. I would think that the A11y spec, or one of the guidelines, should really say that palpable elements should be used as a highlight focus in MO (or something like that). But the content spec seems to be fine to me.

@mattgarrish
Copy link
Member

Whatever can be styled can be used for highlight.

Is that definable? It can't be whatever accepts a class attribute, since everything accepts a class attribute whether it can be styled or not.

@iherman
Copy link
Member Author

iherman commented Feb 10, 2022

In HTML we cannot avoid authors shooting themselves into their feet, i.e., to use a class and its styling on something that cannot be styled. That is, actually, an HTML feature... In our case, the class is added to whatever element is pointed to in the SMIL file: the same thing applies. If that element cannot be styled then... well, too bad for the authors' feet. I do not think we should specify what can be referred to from the SMIL file.

If the SVG author (to come back to what triggered this discussion) decides to refer to some complex drawing when the SMIL file makes the RS read some specific text: this should be o.k. And it actually is, per spec.

Obviously, there are A11y issues that might have to be added to some implementation guide; you are much more of an expert than I am on that. But for the content document, I believe our best action is... to do nothing :-)

@marisademeglio
Copy link
Contributor

Well MO rendering just applies a class attribute, the "style" part is up to the author. Like my earlier comment, I think it's up to the author whether it makes sense or not. But the spec mechanism works in either case of HTML or SVG.

@iherman
Copy link
Member Author

iherman commented Feb 16, 2022

I wonder how we should close this issue. My feeling is:

(1) we can modify the audio element's definition with something like:

An OPTIONAL child of the par element. If missing, the text element child of the par element SHOULD refer to an audio or video media; otherwise the Reading System SHOULD render the textual content via, e.g., text-to-speech.

This is what I proposed in #1986 (comment) (not sure about the SHOULD vs. should, though).

(2) The definition of the text element should say "in an EPUB Content Document" instead of "in the Content Document" (see #1986 (comment))

(3) I have the impression that getting into some further specification about palpable elements and its equivalents for SVG would become way more complicated than what it is worth; spec-wise, I think the current text is o.k. Would we want, however, to add a note after the text element definition along the lines of:

Note: this specification places no restriction on the value of the src attribute. Authors should, however, refer to a content that can be styled with CSS to make the association with style information effective, i.e., palpable content for XHTML or paths, basic shapes, or text elements in SVG.

I am happy to write up a PR.

@marisademeglio
Copy link
Contributor

An OPTIONAL child of the par element. If missing, the text element child of the par element SHOULD refer to an audio or video media; otherwise the Reading System SHOULD render the textual content via, e.g., text-to-speech.

Can we just say it's an OPTIONAL child of par and not get into embedded media, as there's a separate section for that?

Also, is it clear enough that this audio should be an audio representation of the corresponding text element? Should that be in the element definition?

I would also move the reading system TTS behavior recommendation to the RS spec, under rendering text.

Everything else you suggested sounds great to me.

@iherman
Copy link
Member Author

iherman commented Feb 17, 2022

An OPTIONAL child of the par element. If missing, the text element child of the par element SHOULD refer to an audio or video media; otherwise the Reading System SHOULD render the textual content via, e.g., text-to-speech.

Can we just say it's an OPTIONAL child of par and not get into embedded media, as there's a separate section for that?

…and that is also true for the TTS. Specification-wise you are right but, editorially and for readability, I think it is better if these two information are made explicit (with a link to the embedded a/v section, b.t.w.). It may be difficult to find those points otherwise.

Also, is it clear enough that this audio should be an audio representation of the corresponding text element? Should that be in the element definition?

Well... that may have been the intention of MO, but I am not sure this restriction is actually part of the spec. Isn't it perfectly o.k. to have

<par>
	<text src="abc.html#link_to_a_video_element"/>
    <audio src="somemusic.mp3"/>
</par>

which provides a musical background to a video element? My reading is that this is perfectly conformant to the spec (and it should be).

I would also move the reading system TTS behavior recommendation to the RS spec, under rendering text.

I am not sure what you mean. The content document has a section on TTS (§ 7.3.2.5) and so does the RS document (§ 8.3.3). Both sections are very short, obviously, but they are also different. What is the change you propose?

@marisademeglio
Copy link
Contributor

…and that is also true for the TTS. Specification-wise you are right but, editorially and for readability, I think it is better if these two information are made explicit (with a link to the embedded a/v section, b.t.w.). It may be difficult to find those points otherwise.

Why would we use the definition for the audio element to talk about the text element? It makes the audio element definition less clear.
I actually would have no issue with saying that the MO text element has to refer to TTS-able text or image with alt, and not embedded media at all. So I guess if those embedded media tests don't get enough implementations, meet me back at this issue thread :)

I am not sure what you mean. The content document has a section on TTS (§ 7.3.2.5) and so does the RS document (§ 8.3.3). Both sections are very short, obviously, but they are also different. What is the change you propose?

Ok, I propose two things:

  1. don't talk about TTS in the audio element definition (so: just do what I wrote above, make audio OPTIONAL with no other language about embedded media or TTS)
  2. in the TTS section of the core spec, edit paragraph 2 to read

"
In the case where a text element has no audio sibling, the EPUB Creators MUST ensure the text fragment is appropriate for TTS rendering (e.g., contains a textual EPUB Content Document element or has a text fallback).
"

@iherman
Copy link
Member Author

iherman commented Feb 18, 2022

I actually would have no issue with saying that the MO text element has to refer to TTS-able text or image with alt, and not embedded media at all.

You mean when there is no audio sibling, right? Or do you want to remove the whole of the embedded media section and disallow referencing audio/video from a text element? I am not sure if I like the idea...

So I guess if those embedded media tests don't get enough implementations, meet me back at this issue thread :)

Good point about implementation, of course. I already have a not-yet-fully-finished test on embedded video usage (a PR should come to your review soon) and the implementations are, so far, shall we say poor. But not to have them at all may be premature at this point.

@iherman
Copy link
Member Author

iherman commented Feb 18, 2022

@marisademeglio I may misunderstand what you said; I went ahead and created the PR #2001. We can finalize the text more easily over there, I think.

@marisademeglio
Copy link
Contributor

You mean when there is no audio sibling, right? Or do you want to remove the whole of the embedded media section and disallow referencing audio/video from a text element? I am not sure if I like the idea...

I am inclined to disallow it completely. It feels like the purpose of allowing this is to cover all our edge cases, not to provide useful content creation mechanisms or user experiences (I do not know of a single example, even just a test, that uses this mechanism to provide good accessibility). It is also hell on implementations.

But I digress -- I don't think we can yet take the liberty of making such a big change. Though as soon as we can, I'm ready!

Thanks for making the spec edits! I will review shortly.

@iherman
Copy link
Member Author

iherman commented Feb 19, 2022

Created issue #2001. This issue will be closed if and when that PR is merged.

@mattgarrish mattgarrish added the EPUB33 Issues addressed in the EPUB 3.3 revision label Mar 11, 2022
@mattgarrish mattgarrish added the Spec-EPUB3 The issue affects the core EPUB 3.3 Recommendation label Sep 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EPUB33 Issues addressed in the EPUB 3.3 revision Spec-EPUB3 The issue affects the core EPUB 3.3 Recommendation Topic-MediaOverlays The issue affects media overlays
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants