Pronunciation Gap Analysis and Use Cases

Pronunciation Gap Analysis and Use Cases

W3C First Public Working Draft

This version:
https://www.w3.org/TR/2020/WD-pronunciation-gap-analysis-and-use-cases-20200310/
Latest published version:
https://www.w3.org/TR/pronunciation-gap-analysis-and-use-cases/
Latest editor's draft:
https://w3c.github.io/pronunciation/gap-analysis_and_use-case
Editors:
(Educational Testing Service)
(Pearson)
(Educational Testing Service)
(Educational Testing Service)
(W3C)

Abstract

The objective of the Pronunciation Task Force is to develop normative specifications and best practices guidance collaborating with other W3C groups as appropriate, to provide for proper pronunciation in HTML content when using text to speech (TTS) synthesis. This document presents the results of the Pronunciation Task Force work on an HTML standard. It includes an introduction with a historical perspective, an enumeration of the core requirements, a listing of approach use cases, and finally a gap analysis. Gaps are defined when a requirement does not have a corresponding use case approach by which it can be authored in HTML.

Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at https://www.w3.org/TR/.

This is a First Public Working Draft of Pronunciation Explainer by the Accessible Platform Architectures Working Group. It was initially developed by the Pronunciation Task Force. The Pronunciation Task Force decided to merge the previous Pronunciation Gap Analysis and Pronunciation Use Cases into one document and name it Pronunciation Gap Analysis and Use Cases. The previous Pronunciation Gap Analysis and Pronunciation Use Cases will be retired.

To comment, file an issue in the W3C pronunciation GitHub repository. If this is not feasible, send email to public-pronunciation@w3.org (subscribe, archives). Comments are requested by 14 April 2020. In-progress updates to the document may be viewed in the publicly visible editors' draft.

Publication as a First Public Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the W3C Patent Policy. The group does not expect this document to become a W3C Recommendation. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

This document is governed by the 1 March 2019 W3C Process Document.

1. Introduction

This section is non-normative.

Accurate, consistent pronunciation and presentation of content spoken by text to speech synthesis (TTS) is an essential requirement in education, communication, entertainment, and other domains. From helping to teach spelling and pronunciation in different languages, TTS has become a vital technology for providing access to digital content on the web, through mobile devices, and now via voice-based assistants. Organizations such as educational publishers and assessment vendors are looking for a standards-based solution to enable authoring of spoken presentation guidance in HTML which can then be consumed by assistive technologies and other applications that utilize text to speech synthesis (TTS) for rendering of content. Historically, efforts at standardization (e.g. SSML or CSS Speech) have not led to broad adoption of any standard by user agents, authors or assistive technologies; what has arisen are a variety of non-interoperable approaches that meet specific needs for some applications.

A success criteria for pronunciation has been included in WCAG since 2.0, 3.1.6 Pronunciation (AAA), which calls for a mechanism to be available for identifying specific pronunciation of words where meaning of the words, in context, is ambiguous without knowing the pronunciation. The techniques suggested make no reference any of the speech related standards of W3C.

The W3C has developed two standards pertaining to the presentation of speech synthesis which have reached recommendation status, Speech Synthesis Markup Language (SSML) and the Pronunciation Lexicon Specification (PLS). Both standards are directly consumed by a speech synthesis engine supporting those standards. While a PLS file reference may be referenced in a HTML page using link rel, there is no known uptake of PLS using this method by assistive technologies. There are technical methods to allow authors to inline SSML within HTML (using namespaces), but such an approach has not been adopted, and anecdotal comments from browser and assistive technology vendors have suggested this is not a viable approach.

The CSS Speech Module is a retired W3C Working Group Note that describes a mechanism by which content authors may apply a variety of speech styling and presentation properties to HTML. This approach has a number of advantages but does not implement the full set of features required for pronunciation. Section 16 of the Note specifically references the issue of pronunciation:

CSS does not specify how to define the pronunciation (expressed using a well-defined phonetic alphabet) of a piece of text within the markup document. A "phonemes" property was described in earlier drafts of this specification, but objections were raised due to breaking the principle of separation between content and presentation (the "phonemes" authored within aural CSS stylesheets would have needed to be updated each time text changed within the markup document). The "phonemes" functionality is therefore considered out-of-scope in CSS (the presentation layer) and should be addressed in the markup / content layer.

While a portion of CSS Speech was demonstrated by Apple in 2011 on iOS with Safari and VoiceOver, it is not presently supported on any platform with any Assistive Technology, and work on the standard has itself been stopped by the CSS working group.

Efforts to address the need for pronunciation standards have also been considered by both assessment technology vendors and the publishing community. Citing the need for pronunciation and presentation controls, the IMS Global Learning Consortium added the ability to author SSML markup, specify PLS files, and reference CSS Speech properties to the Question Test Interoperability (QTI) Accessible Portable Item Protocol (APIP). In practice, QTI/APIP authored content is transformed into HTML for rendering in web browsers. This led to the dilemma that there is no standardized (and supported) method for inlining SSML in HTML, nor is there support for CSS Speech. This resulted in situation where SSML is the primary authoring model, with assessment vendors implementing a custom method for adding the SSML (or SSML-like) features to HTML using non-standard or data attributes and customized Read Aloud software consuming those attributes for text to speech synthesis. Given the need to deliver accurate spoken presentation, non-standard approaches often include mis-use of WAI-ARIA, and novel or contextually non-valid attributes (e.g., label). A problem for screen reader users occurs when custom pronunciation is applied via a misuse of the aria-label attribute, which results in an issue for screen reader users who also rely upon refreshable braille, and in which a hinted pronunciation intended only for a text to speech synthesizer to also appears on the braille display.

The attribute model for adding pronunciation and presentation guidance for assistive technologies and text to speech synthesis has demonstrated traction by vendors trying to solve this need. It should be noted that many of the required features are not well supported by a single attribute, as most follow the form of a presentation property / value pairing. Using multiple attributes to provide guidance to assistive technologies is not novel, as seen with WAI-ARIA where multiple attributes may be applied to a single element, for example, role and aria-checked. The EPUB standard for digital publishing introduced a namespaced version of the SSML phoneme and alphabet attributes enabling content authors to provide pronunciation guidance. Uptake by the publishing community has been limited, reportedly due to the lack of support in reading systems and assistive technologies.

In order to further the discussion about the essential need for consistent pronunciation and presentation of content spoken by text to speech synthesis (TTS), the Pronunciation Task Force has compiled the following sections of information in this document:

2. Core Features for Pronunciation and Spoken Presentation

The common spoken pronunciation requirements from the education domain serve as a primary source for the core features and are applicable to many other domains that require accurate pronunciation and presentation. These requirements can be broken down into the following main functions that would support authoring and spoken presentation needs.

2.1 Language

When content is authored in mixed language, a mechanism is needed to allow authors to indicate both the base language of the content as well as the language of individual words and phrases. The expectation is that assistive technologies and other tools that utilize text to speech synthesis would detect and apply the language requested when presenting the text.

2.2 Voice Family / Gender

Content authors may elect to adjust the spoken presentation to provide a gender specific voice to reflect that of the author, or for a character (or characters) in theatrical a presentation of a story. Many assistive technologies already provide user selection of voice family and gender independent of any authored intent.

2.3 Phonetic Pronunciation of String Values

In some cases, words may need to have their phonetic pronunciation prescribed by the content author. This may occur when uncommon words (not supported by text to speech synthesizers)are present, or in cases where word pronunciation will vary based on context, and that context may not be correctly interpreted.

2.4 String Substitution

There are cases where content that is visually presented may require replacement (substitution) with an alternate textual form to ensure correct pronunciation by text to speech synthesizers. In these cases, phonetic pronunciation may be a solution to this need.

2.5 Rate / Pitch / Volume

While end users should have full control over spoken presentation parameters such as speaking rate, pitch, and volume (e.g., WCAG 1.4.2 ), content authors may elect to adjust those parameters to control the spoken presentation for purposes such as a theatrical presentation of a story. Many assistive technologies already provide user controlled speaking rate, pitch, and volume independent of any authored intent.

2.6 Emphasis

In written text, an author may find it necessary to add emphasis to an important word or phrase. HTML supports both semantic elements (e.g., em) and CSS properties which, through a variety of style options, make programmatic detection of authored emphasis difficult (e.g., font-weight: heavy). While the emphasis element has existed since HTML 2.0, there is currently no uptake by assistive technology or read aloud tools to present text semantically tagged for emphasis to be spoken with emphasis.

2.7 Say As

While text to speech engines continue to improve in their ability to process text and provide accurate spoken rendering of acronyms and numeric values, there can be instances where uncommon terms or alphanumeric constructs pose challenges.

2.7.1 Presentation of Numeric Values

Precise spoken presentation of numeric values may not always be correctly determined by text to speech engines from context.  Examples include speaking a multidigit number as individual digits (100 spoken as "one, zer, zero" instead of "one hundred"), correct reading of year values, and the correct speaking of ordinal and cardinal numbers. Furthermore, some educators may have specific requirements for the spoken presentation of a numeric value which may differ from a TTS engine's default rendering. For example, the Smarter Balanced Assessment Consortium has developed Read Aloud Guidelines to be followed by human readers used by students who may require a spoken presentation of an educational test, which includes specific examples of how numeric values should be read aloud.

2.7.2 Presentation of String Values

Precise presentation of string values may not be determined correctly by text to speech synthesizers. Examples include speaking acroyms as individual letters rather than words and providing a substitute pronunciation for a word based on context (the word "read" may be pronounced as either "reed" or "red"), and providing a phonetic pronunciation of a word to ensure the correct spoken presentation.

2.8 Pausing

Content authors may elect or find it necessary to insert pauses in content. Pauses can be inserted for dramatic effect or may be necessary to render the spoken presentation understandable. For example, pauses inserted between numeric values may be needed to limit the chance of hearing multiple numbers as a single value. One common technique to achieve pausing to date has involved inserting non-visible commas before or after a text string requiring a pause. While this may work in practice for a read aloud TTS tool, it is problematic for screen reader users who may, based on verbosity settings, hear the multiple commas announced, and for refreshable braille users who will have the commas visible in braille. In addition, some tests, such as PARCC, specify spoken presentation requirements in their accessibility guidelines which include inserting pauses before and after emphasized word and mathematical terms.

3. Use Cases

This section presents 6 use cases which describe specific implmentationimplementation approaches for introducing pronunciation and spoken presentation authoring markup into HTML5 and are based on an inline or attribute model. These two differing approach models emerged from the work of the Accessible Platform Architecture Pronunciation Task Force and represent a decision point for integrating SSML (or SSML-like characteristics) into HTML. Each of the approaches differs in authoring and consumption models (specifically for assistive technologies. Other approaches may appear in subsequent working drafts.

Successful use cases provide ease of authoring and consumption by assistive technologies and user agents that utilize synthetic speech for spoken presentation of web content. The most challenging aspect of consumption may be alignment of the markup approach with the standard mechanisms by which assistive technologies, specifically screen readers, obtain content via platform accessibility APIs.

3.1 Use Case aria-ssml

3.1.1 Background and Current Practice

A new aria attribute could be used to include pronunciation content.

3.1.2 Goal

Embed SSML in an HTML document.

3.1.3 Target Audience

  • Assistive Technology
  • Browser Extensions
  • Search Engines

3.1.4 Implementation Options

aria-ssml as embedded JSON

When AT encounters an element with aria-ssml, the AT should enhance the UI by processing the pronunciation content and passing it to the Web Speech API or an external API (e.g., Google's Text to Speech API).

I say <span aria-ssml='{"phoneme":{"ph":"pɪˈkɑːn","alphabet":"ipa"}}'>pecan</span>.
You say <span aria-ssml='{"phoneme":{"ph":"ˈpi.kæn","alphabet":"ipa"}}'>pecan</span>.

Client will convert JSON to SSML and pass the XML string a speech API.

var msg = new SpeechSynthesisUtterance();
msg.text = convertJSONtoSSML(element.getAttribute('aria-ssml'));
speechSynthesis.speak(msg);

aria-ssml referencing XML by template ID

<!-- ssml must appear inside a template to be valid -->
<template id="pecan">
<?xml version="1.0"?>
<speak version="1.1"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US">
    You say, <phoneme alphabet="ipa" ph="pɪˈkɑːn">pecan</phoneme>.
    I say, <phoneme alphabet="ipa" ph="ˈpi.kæn">pecan</phoneme>.
</speak>
</template>

<p aria-ssml="#pecan">You say, pecan. I say, pecan.</p>

Client will parse XML and serialize it before passing to a speech API:

var msg = new SpeechSynthesisUtterance();
var xml = document.getElementById('pecan').content.firstElementChild;
msg.text = serialize(xml);
speechSynthesis.speak(msg);

aria-ssml referencing an XML string as script tag

<script id="pecan" type="application/ssml+xml">
<speak version="1.1"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US">
    You say, <phoneme alphabet="ipa" ph="pɪˈkɑːn">pecan</phoneme>.
    I say, <phoneme alphabet="ipa" ph="ˈpi.kæn">pecan</phoneme>.
</speak>
</script>

<p aria-ssml="#pecan">You say, pecan. I say, pecan.</p>

Client will pass the XML string raw to a speech API.

var msg = new SpeechSynthesisUtterance();
msg.text = document.getElementById('pecan').textContent;
speechSynthesis.speak(msg);

aria-ssml referencing an external XML document by URL

<p aria-ssml="http://example.com/pronounce.ssml#pecan">You say, pecan. I say, pecan.</p>

Client will pass the string payload to a speech API.

var msg = new SpeechSynthesisUtterance();
var response = await fetch(el.dataset.ssml)
msg.txt = await response.text();
speechSynthesis.speak(msg);

3.1.5 Existing Work

3.1.6 Problems and Limitations

  • aria-ssml is not a valid aria-* attribute.
  • OS/Browsers combinations that do not support the serialized XML usage of the Web Speech API.

3.2 Use Case data-ssml

3.2.1 Background and Current Practice

As an existing attribute, data-* could be used, with some conventions, to include pronunciation content.

3.2.2 Goal

  • Support repeated use within the page context
  • Support external file references
  • Reuse existing techniques without expanding specifications

3.2.3 Target Audience

Hearing users

3.2.4 Implementation Options

data-ssml as embedded JSON

When an element with data-ssml is encountered by an SSML-aware AT, the AT should enhance the user interface by processing the referenced SSML content and passing it to the Web Speech API or an external API (e.g., Google's Text to Speech API).

<h2>The Pronunciation of Pecan</h2>
<p><speak>
I say <span data-ssml='{"phoneme":{"ph":"pɪˈkɑːn","alphabet":"ipa"}}'>pecan</span>.
You say <span data-ssml='{"phoneme":{"ph":"ˈpi.kæn","alphabet":"ipa"}}'>pecan</span>.

Client will convert JSON to SSML and pass the XML string a speech API.

var msg = new SpeechSynthesisUtterance();
msg.text = convertJSONtoSSML(element.dataset.ssml);
speechSynthesis.speak(msg);

data-ssml referencing XML by template ID

<!-- ssml must appear inside a template to be valid -->
<template id="pecan">
<?xml version="1.0"?>
<speak version="1.1"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US">
    You say, <phoneme alphabet="ipa" ph="pɪˈkɑːn">pecan</phoneme>.
    I say, <phoneme alphabet="ipa" ph="ˈpi.kæn">pecan</phoneme>.
</speak>
</template>

<p data-ssml="#pecan">You say, pecan. I say, pecan.</p>

Client will parse XML and serialize it before passing to a speech API:

var msg = new SpeechSynthesisUtterance();
var xml = document.getElementById('pecan').content.firstElementChild;
msg.text = serialize(xml);
speechSynthesis.speak(msg);

data-ssml referencing an XML string as script tag

<script id="pecan" type="application/ssml+xml">
<speak version="1.1"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US">
    You say, <phoneme alphabet="ipa" ph="pɪˈkɑːn">pecan</phoneme>.
    I say, <phoneme alphabet="ipa" ph="ˈpi.kæn">pecan</phoneme>.
</speak>
</script>

<p data-ssml="#pecan">You say, pecan. I say, pecan.</p>

Client will pass the XML string raw to a speech API.

var msg = new SpeechSynthesisUtterance();
msg.text = document.getElementById('pecan').textContent;
speechSynthesis.speak(msg);

data-ssml referencing an external XML document by URL

<p data-ssml="http://example.com/pronounce.ssml#pecan">You say, pecan. I say, pecan.</p>

Client will pass the string payload to a speech API.

var msg = new SpeechSynthesisUtterance();
var response = await fetch(el.dataset.ssml)
msg.txt = await response.text();
speechSynthesis.speak(msg);

3.2.5 Existing Work

3.2.6 Problems and Limitations

  • Does not assume or suggest visual pronunciation help for deaf or hard of hearing
  • Use of data-* requires input from AT vendors
  • XML data is not indexed by search engines

3.3 Use Case HTML5

3.3.1 Background and Current Practice

HTML5 includes the XML namespaces for MathML and SVG. Therefore, using either's elements in an HTML5 document is valid. Likewise, including an SSML namespace would allow the valid use of SSML in HTML5. Because SSML's implementation is non-visual in nature, browser implementation could be slow or non-existent without affecting how authors use SSML in HTML. Browsers would treat the element like any other unknown element, as HTMLUnknownElement.

3.3.2 Goal

  • Support valid use of SSML in HTML5 documents
  • Allow visual pronunciation support

3.3.3 Target Audience

  • SSML-aware technologies and browser extensions
  • Search indexers

3.3.4 Implementation Options

SSML

When an element with data-ssml is encountered by an SSML-aware AT, the AT should enhance the user interface by processing the referenced SSML content and passing it to the Web Speech API or an external API (e.g., Google's Text to Speech API).

<h2>The Pronunciation of Pecan</h2>
  <p><speak>
  You say, <phoneme alphabet="ipa" ph="pɪˈkɑːn">pecan</phoneme>.
  I say, <phoneme alphabet="ipa" ph="ˈpi.kæn">pecan</phoneme>.
</speak></p>

3.3.5 Existing Work

3.3.6 Problems and Limitations

SSML is not valid HTML5

3.4 Use Case Custom Element

3.4.1 Background and Current Practice

Embed valid SSML in HTML using custom elements registered as ssml-* where * is the actual SSML tag name (except for p which expects the same treatment as an HTML p in HTML layout).

3.4.2 Goal

Support use of SSML in HTML documents.

3.4.3 Target Audience

  • SSML-aware technologies and browser extensions
  • Search indexers

3.4.4 Implementation Options

ssml-speak: see demo

Only the <ssml-speak> component requires registration. The component code lifts the SSML by getting the innerHTML and removing the ssml- prefix from the interior tags and passing it to the web speech API. The <p> tag from SSML is not given the prefix because we still want to start a semantic paragraph within the content. The other tags used in the example have no semantic meaning. Tags like <em> in HTML could be converted to <emphasis> in SSML. In that case, CSS styles will come from the browser's default styles or the page author.

<ssml-speak>
  Here are <ssml-say-as interpret-as="characters">SSML</ssml-say-as> samples.
  I can pause<ssml-break time="3s"></ssml-break>.
  I can speak in cardinals.
  Your number is <ssml-say-as interpret-as="cardinal">10</ssml-say-as>.
  Or I can speak in ordinals.
  You are <ssml-say-as interpret-as="ordinal">10</ssml-say-as> in line.
  Or I can even speak in digits.
  The digits for ten are <ssml-say-as interpret-as="characters">10</ssml-say-as>.
  I can also substitute phrases, like the <ssml-sub alias="World Wide Web Consortium">W3C</ssml-sub>.
  Finally, I can speak a paragraph with two sentences.
  <p>
    <ssml-s>You say, <ssml-phoneme alphabet="ipa" ph="pɪˈkɑːn">pecan</ssml-phoneme>.</ssml-s>
    <ssml-s>I say, <ssml-phoneme alphabet="ipa" ph="ˈpi.kæn">pecan</ssml-phoneme>.</ssml-s>
  </p>
</ssml-speak>
<template id="ssml-controls">
  <style>
    [role="switch"][aria-checked="true"] :first-child,
    [role="switch"][aria-checked="false"] :last-child {
      background: #000;
      color: #fff;
    }
  </style>
  <slot></slot>
  <p>
    <span id="play">Speak</span>
    <button role="switch" aria-checked="false" aria-labelledby="play">
      <span>on</span>
      <span>off</span>
    </button>
  </p>
</template>
class SSMLSpeak extends HTMLElement {
  constructor() {
    super();
    const template = document.getElementById('ssml-controls');
    const templateContent = template.content;
    this.attachShadow({mode: 'open'})
      .appendChild(templateContent.cloneNode(true));
  }
  connectedCallback() {
    const button = this.shadowRoot.querySelector('[role="switch"][aria-labelledby="play"]')
    const ssml = this.innerHTML.replace(/ssml-/gm, '')
    const msg = new SpeechSynthesisUtterance();
    msg.lang = document.documentElement.lang;
    msg.text = `<speak version="1.1"
      xmlns="http://www.w3.org/2001/10/synthesis"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
        http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
      xml:lang="${msg.lang}">
    ${ssml}
    </speak>`;
    msg.voice = speechSynthesis.getVoices().find(voice => voice.lang.startsWith(msg.lang));
    msg.onstart = () => button.setAttribute('aria-checked', 'true');
    msg.onend = () => button.setAttribute('aria-checked', 'false');
    button.addEventListener('click', () => speechSynthesis[speechSynthesis.speaking ? 'cancel' : 'speak'](msg))
  }
}

customElements.define('ssml-speak', SSMLSpeak);

3.4.5 Existing Work

3.4.6 Problems and Limitations

  • OS/Browsers combinations that do not support the serialized XML usage of the Web Speech API.
  • Browsers may need to map SSML tags with CSS styles for default user agent styles.
  • Without an extension or AT, only user interaction can start the Web Speech API.
  • Authors or parsing may need to remove HTML content with unintended SSML semantics before serialization.

3.5 Use Case JSON-LD

3.5.1 Background and Current Practice

JSON-LD provides an established standard for embedding data in HTML. Unlike other microdata approaches, JSON-LD helps to reuse standardized annotations through external references.

3.5.2 Goal

Support use of SSML in HTML documents.

3.5.3 Target Audience

  • SSML-aware technologies and browser extensions
  • Search indexers

3.5.4 Implementation Options

JSON-LD

<script type="application/ld+json">
{
  "@context": "http://schema.org/",
  "@id": "/Pronunciation#WKRP",
  "@type": "TextPronunciation",
  "@language": "en",
  "text": "WKRP",
  "speechToTextMarkup": "SSML",
  "phoneticText": "<say-as interpret-as=\"characters\">WKRP</say-as>"
}
</script>
<p>
  Do you listen to <span itemscope
    itemtype="http://schema.org/TextPronunciation"
    itemid="/Pronunciation#WKRP">WKRP</span>?
</p>

3.5.5 Existing Work

3.5.6 Problems and Limitations

JSON-LD is not an established "type"/published schema

3.6 Use Case Ruby

3.6.1 Background and Current Practice

<Ruby> annotations are short runs of text presented alongside base text, primarily used in East Asian typography as a guide for pronunciation or to include other annotations.

ruby guides pronunciation visually. This seems like a natural fit for text-to-speech.

3.6.2 Goal

  • Support use of SSML in HTML documents.
  • Offer visual pronunciation support.

3.6.3 Target Audience

  • AT and browser extensions
  • Search indexers

3.6.4 Implementation Options

ruby with microdata

Microdata can augment the ruby element and its descendants.

<p>
  You say,
  <span itemscope="" itemtype="http://example.org/Pronunciation">
    <ruby itemprop="phoneme" content="pecan">
      pecan
      <rt itemprop="ph">pɪˈkɑːn</rt>
      <meta itemprop="alphabet" content="ipa">
    </ruby>.
  </span>
  I say,
  <span itemscope="" itemtype="http://example.org/Pronunciation">
    <ruby itemprop="phoneme" content="pecan">
      pe
      <rt itemprop="ph">ˈpi</rt>
      can
      <rt itemprop="ph">kæn</rt>
      <meta itemprop="alphabet" content="ipa">
    </ruby>.
  </span>
</p>

3.6.5 Existing Work

3.6.6 Problems and Limitations

  • AT may process annotations as content
  • AT "double reading" words instead of choosing either the content or the annotation
  • Only offers for a few SSML expressions
  • Difficult to reuse by reference

4. Gap Analysis

Based on the features and use cases described in the prior sections, the following table presents existing speech presentation standards, HTML features, and WAI-ARIA attributes that may offer a method to achieve the requirement for HTML authors. A blank cell for any approach represents a gap in support.

Requirement
HTML
WAI-ARIA
PLS
CSS Speech
SSML
Language
Yes



Yes
Voice Family/Gender



Yes
Yes
Phonetic Pronunciation


Yes

Yes
Substitution

Partial


Yes
Rate/Pitch/Volume



Yes
Yes
Emphasis
Yes


Yes
Yes
Say As




Yes
Pausing



Yes
Yes

The following sections describe how each of the required features may be met by the use of existing approaches. A key consideration in the analysis is whether a means exists to directly author (or annotate) HTML content to incorporate the spoken presentation and pronunciation feature.

4.1 Language

Allow content authors to specify the language of text contained within an element so that the TTS used for rendering will select the appropriate language for synthesis.

HTML

lang attribute can be applied at the document level or to individual elements. (WCAG) (AT Supported: some)

SSML

Example: <speak> In Paris, they pronounce it <lang xml:lang="fr-FR">Paris</lang> </speak>code>

4.2 Voice Family/Gender

Allow content authors to specify a specific TTS voice to be used to render text. For example, for content that presents a dialog between two people, a woman and a man, the author may specify that a female voice be used for the woman's text and a male voice be used for the man's text. Some platform TTS services may support a variety of voices, identified by a name, gender, or even age.

CSS

voice-family property can be used to specify the gender of the voice.

Example: { voice-family: male; }

SSML

Using the <voice> element, the gender of the speaker, if supported by the TTS engine, can be specified.

Example: <voice gender="female" >Mary had a little lamb,</voice>

4.3 Phonetic Pronunciation

Allow content authors to precisely specify the phonetic pronunciation of a word or phrase.

PLS

Using PLS, all the pronunciations can be factored out into an external PLS document which is referenced by the <lexicon> element of SSML

Example: <speak> <lexicon uri="http://www.example.com/movie_lexicon.pls"/>
                The title of the movie is: "La vita è bella" (Life is beautiful),
    which is directed by Roberto Benigni.</speak>

SSML

The following is a simple example of an SSML document. It includes an Italian movie title and the name of the director to be read in US English.

Example: The title of the movie is: 
            <speak> <phoneme alphabet="ipa" ph="ˈlɑ ˈviːɾə ˈʔeɪ ˈbɛlə">
            "La vita è bella"</phoneme> (Life is beautiful),
            which is directed by 
            <phoneme alphabet="ipa" ph="ɹəˈbɛːɹɾoʊ bɛˈniːnji""> 
            Roberto Benigni </phoneme>.</speak>
            

4.4 Substitution

Allow content authors to substitute a text string to be rendered by TTS instead of the actual text contained in an element.

WAI-ARIA

The aria-label and aria-labelledby attribute can be used by an author to supply a text string that will become the accessible name for the element upon which it is applied.  This usage effectively provides a mechanism for performing text substation that is supported by a screen reader. However, it is problematic for one significant reason; for users who utilize screen readers and refreshable Braille, the content that is voiced will not match the content that is sent to the refreshable Braille device. This mismatch would not be acceptable for some content, particularly for assessment content.

SSML

Pronounce the specified word or phrase as a different word or phrase. Specify the pronunciation to substitute with the alias attribute.


   <speak>
       My favorite chemical element is <sub alias="aluminum">Al</sub>,
       but Al prefers <sub alias="magnesium">Mg</sub>. 
 </speak>
 

4.5 Rate/Pitch/Volume

Allow content authors to specify characteristics, such as rate, pitch, and/or volume of the TTS rendering of the text.

CSS

voice-rate
The ‘voice-rate’ property manipulates the rate of generated synthetic speech in terms of words per minute.

voice-pitch
The ‘voice-pitch’ property specifies the "baseline" pitch of the generated speech output, which depends on the used ‘voice-family’ instance, and varies across speech synthesis processors (it approximately corresponds to the average pitch of the output). For example, the common pitch for a male voice is around 120Hz, whereas it is around 210Hz for a female voice.

voice-range
The ‘voice-range’ property specifies the variability in the "baseline" pitch, i.e. how much the fundamental frequency may deviate from the average pitch of the speech output. The dynamic pitch range of the generated speech generally increases for a highly animated voice, for example when variations in inflection are used to convey meaning and emphasis in speech. Typically, a low range produces a flat, monotonic voice, whereas a high range produces an animated voice.

SSML

prosody modifies the volume, pitch, and rate of the tagged speech.


  <speak>
      Normal volume for the first sentence.
      <prosody volume="x-loud">Louder volume for the second sentence</prosody>.
      When I wake up, <prosody rate="x-slow">I speak quite slowly</prosody>.
      I can speak with my normal pitch, 
      <prosody pitch="x-high"> but also with a much higher pitch </prosody>, 
      and also <prosody pitch="low">with a lower pitch</prosody>.
</speak>

4.6 Emphasis

Allow content authors to specify that text content be spoken with emphasis, for example, louder and more slowly. This can be viewed as a simplification of the Rate/Pitch/Volume controls to reduce authoring complexity.

HTML

The HTML <em> element marks text that has stress emphasis. The <em> element can be nested, with each level of nesting indicating a greater degree of emphasis.

The <em> element is for words that have a stressed emphasis compared to surrounding text, which is often limited to a word or words of a sentence and affects the meaning of the sentence itself. Typically this element is displayed in italic type. However, it should not be used simply to apply italic styling; use the CSS font-style property for that purpose. Use the <cite> element to mark the title of a work (book, play, song, etc.). Use the <i> element to mark text that is in an alternate tone or mood, which covers many common situations for italics such as scientific names or words in other languages. Use the <strong> element to mark text that has greater importance than surrounding text.

CSS

voice-stress
The ‘voice-stress’ property manipulates the strength of emphasis, which is normally applied using a combination of pitch change, timing changes, loudness and other acoustic differences. The precise meaning of the values therefore depend on the language being spoken.

SSML

Emphasize the tagged words or phrases. Emphasis changes rate and volume of the speech. More emphasis is spoken louder and slower. Less emphasis is quieter and faster.


        <speak>
        I already told you I 
       <emphasis level="strong">really like</emphasis> that person.
       </speak> 

4.7 Say As

Allow content authors to specify how text is spoken. For example, content authors would be able to indicate that a series of four numbers should be spoken as a year rather than a cardinal number.

CSS

The ‘speak-as’ property determines in what manner text gets rendered aurally, based upon a predefined list of possibilities.

Note

Speech synthesizers are knowledgeable about what a number is. The ‘speak-as’ property enables some level of control on how user agents render numbers, and may be implemented as a preprocessing step before passing the text to the actual speech synthesizer.

SSML

Describes how the text should be interpreted. This lets you provide additional context to the text and eliminate any ambiguity on how Alexa should render the text. Indicate how Alexa should interpret the text with the interpret-as attribute.


<speak>
    Here is a number spoken as a cardinal number: 
    <say-as interpret-as="cardinal">12345</say-as>.
    Here is the same number with each digit spoken separately:
    <say-as interpret-as="digits">12345</say-as>.
    Here is a word spelled out: <say-as interpret-as="spell-out">hello</say-as>
</speak>

4.8 Pausing

Allow content authors to specify pauses before or after content to ensure the desired prosody of the presentation, which can affect the pronunciation of the pronunciation of content the precedes or follows the pause.

CSS

The ‘pause-before’ and ‘pause-after’ properties specify a prosodic boundary (silence with a specific duration) that occurs before (or after) the speech synthesis rendition of the selected element, or if any ‘cue-before’ (or ‘cue-after’) is specified, before (or after) the cue within the aural box model.

Note

Note that although the functionality provided by this property is similar to the break element from the SSML markup language [SSML], the application of ‘pause’ prosodic boundaries within the aural box model of CSS Speech requires special considerations (e.g. "collapsed" pauses).

SSML

break represents a pause in the speech. Set the length of the pause with the strength or time attributes.


<speak>
    There is a three second pause here <break time="3s"/> 
    then the speech continues.
</speak> 

A. Acknowledgments

This section is non-normative.

The following people contributed to the development of this document.

A.1 Participants active in the Pronunciation TF at the time of publication