Copyright © 2011-2014 W3C® (MIT, ERCIM, Keio, Beihang), All Rights Reserved. W3C liability, trademark and document use rules apply.
This is a Recommendation Track snapshot. For the latest updates, possibly including important bug fixes, please see the Draft Community Group Report.
This specification defines WebVTT, the Web Video Text Tracks format. Its main use is for marking up external text track resources in connection with the HTML <track> element.
WebVTT files provide captions or subtitles for video content, and also text video descriptions [MAUR], chapters for content navigation, and more generally any form of metadata that is time-aligned with audio or video content.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
Work on this specification is being undertaken both in the Web Media Text Tracks Community Group as well as in the W3C Timed Text Working Group. The latter group works towards a W3C Recommendation for reference purposes with interoperability requirements, while the former is a Draft Community Group Report that continues to evolve.
This document was published by the W3C Timed Text Working Group as a First Public Working Draft.
This document is intended to become a W3C Recommendation.
If you wish to make comments regarding this document, please send them to
public-tt@w3.org
(subscribe,
archives)
with [webvtt]
at the start of your email's subject.
All comments are welcome.
Publication as a First Public Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
This document is governed by the 1 August 2014 W3C Process Document.
This section is non-normative.
The WebVTT (Web Video Text Tracks) format is intended for marking up external text track resources.
The main use for WebVTT files is captioning or subtitling video content. Here is a sample file that captions an interview:
WEBVTT 00:11.000 --> 00:13.000 <v Roger Bingham>We are in New York City 00:13.000 --> 00:16.000 <v Roger Bingham>We're actually at the Lucern Hotel, just down the street 00:16.000 --> 00:18.000 <v Roger Bingham>from the American Museum of Natural History 00:18.000 --> 00:20.000 <v Roger Bingham>And with me is Neil deGrasse Tyson 00:20.000 --> 00:22.000 <v Roger Bingham>Astrophysicist, Director of the Hayden Planetarium 00:22.000 --> 00:24.000 <v Roger Bingham>at the AMNH. 00:24.000 --> 00:26.000 <v Roger Bingham>Thank you for walking down here. 00:27.000 --> 00:30.000 <v Roger Bingham>And I want to do a follow-up on the last conversation we did. 00:30.000 --> 00:31.500 align:end size:50% <v Roger Bingham>When we e-mailed— 00:30.500 --> 00:32.500 align:start size:50% <v Neil deGrasse Tyson>Didn't we talk about enough in that conversation? 00:32.000 --> 00:35.500 align:end size:50% <v Roger Bingham>No! No no no no; 'cos 'cos obviously 'cos 00:32.500 --> 00:33.500 align:start size:50% <v Neil deGrasse Tyson><i>Laughs</i> 00:35.500 --> 00:38.000 <v Roger Bingham>You know I'm so excited my glasses are falling off here.
This section is non-normative.
Line breaks in cues are honored. User agents will also insert extra line breaks if necessary to fit the cue in the cue's width. In general, therefore, authors are encouraged to write cues all on one line except when a line break is definitely necessary, and to not manually line-wrap for aesthetic reasons alone.
These captions on a public service announcement video demonstrate line breaking:
WEBVTT 00:01.000 --> 00:04.000 Never drink liquid nitrogen. 00:05.000 --> 00:09.000 — It will perforate your stomach. — You could die. 00:10.000 --> 00:14.000 The Organisation for Sample Public Service Announcements accepts no liability for the content of this advertisement, or for the consequences of any actions taken on the basis of the information provided.
The first cue is simple, it will probably just display on one line. The second will take two lines, one for each speaker. The third will wrap to fit the width of the video, possibly taking multiple lines. For example, the three cues could look like this:
Never drink liquid nitrogen. — It will perforate your stomach. — You could die. The Organisation for Sample Public Service Announcements accepts no liability for the content of this advertisement, or for the consequences of any actions taken on the basis of the information provided.
If the width of the cues is smaller, the first two cues could wrap as well, as in the following example. Note how the second cue's explicit line break is still honored, however:
Never drink liquid nitrogen. — It will perforate your stomach. — You could die. The Organisation for Sample Public Service Announcements accepts no liability for the content of this advertisement, or for the consequences of any actions taken on the basis of the information provided.
Also notice how the wrapping is done so as to keep the line lengths balanced.
This section is non-normative.
WebVTT also supports some less-often used features.
In this example, the cues have an identifier:
WEBVTT 1 00:00.000 --> 00:02.000 That's an, an, that's an L! transcript credit 00:04.000 --> 00:05.000 Transcribed by Celestials™
This allows a style sheet to specifically target the cues (notice the use of CSS character escape sequences):
::cue(#\31) { color: green; } ::cue(#transcript\ credit) { color: red; }
In this example, each cue says who is talking using voice spans. In the first cue, the span specifying the speaker is also annotated with two classes, "first" and "loud". In the third cue, there is also some italics text (not associated with a specific speaker). The last cue is annotated with just the class "loud".
WEBVTT 00:00.000 --> 00:02.000 <v.first.loud Esme>It's a blue apple tree! 00:02.000 --> 00:04.000 <v Mary>No way! 00:04.000 --> 00:06.000 <v Esme>Hee!</v> <i>laughter</i> 00:06.000 --> 00:08.000 <v.loud Mary>That's awesome!
Notice that as a special exception, the voice spans don't have to be closed if they cover the entire cue text.
Style sheets can style these spans:
::cue(v[voice="Esme"]) { color: blue } ::cue(v[voice="Mary"]) { color: green } ::cue(i) { font-style: italic } ::cue(.loud) { font-size: 2em }
This example shows how to position cues at explicit positions in the video viewport.
WEBVTT 00:00:00.000 --> 00:00:04.000 position:10%,start align:start size:35% Where did he go? 00:00:03.000 --> 00:00:06.500 position:90% align:end size:35% I think he went down this lane. 00:00:04.000 --> 00:00:06.500 position:45%,end align:middle size:35% What are you waiting for?
The cues cover only 35% of the video viewport's width. The first cue has its cue box left aligned at the 10% mark of the video viewport width and the text is left aligned within that box - probably underneath a speaker on the left of the video image. "start" alignment of the cue box is the default for start aligned text, so does not need to be specified in "position". The second cue has its cue box right aligned at the 90% mark of the video viewport width. The same effect can be achieved with "position:55%,start", which explicitly positions the cue box. The third cue has middle aligned text within the same type of cue box as the first cue.
This example shows two regions containing rollup captions for two different speakers. Fred's cues scroll up in a region in the left half of the video, Bill's cues scroll up in a region on the right half of the video. Fred's first cue disappears at 12.5sec even though it is defined until 20sec because its region is limited to 3 lines and at 12.5sec a fourth cue appears:
WEBVTT Region: id=fred width=40% lines=3 regionanchor=0%,100% viewportanchor=10%,90% scroll=up Region: id=bill width=40% lines=3 regionanchor=100%,100% viewportanchor=90%,90% scroll=up 00:00:00.000 --> 00:00:20.000 region:fred align:left <v Fred>Hi, my name is Fred 00:00:02.500 --> 00:00:22.500 region:bill align:right <v Bill>Hi, I'm Bill 00:00:05.000 --> 00:00:25.000 region:fred align:left <v Fred>Would you like to get a coffee? 00:00:07.500 --> 00:00:27.500 region:bill align:right <v Bill>Sure! I've only had one today. 00:00:10.000 --> 00:00:30.000 region:fred align:left <v Fred>This is my fourth! 00:00:12.500 --> 00:00:32.500 region:fred align:left <v Fred>OK, let's go.
Note that regions are only defined for horizontal cues.
All diagrams, examples, and notes in this specification are non-normative, as are all sections explicitly marked non-normative. Everything else in this specification is normative.
The key words "MUST", "MUST NOT", "SHOULD", "SHOULD NOT", "MAY", and "OPTIONAL" in the normative parts of this document are to be interpreted as described in RFC2119. The key word "OPTIONALLY" in the normative parts of this document is to be interpreted with the same normative meaning as "MAY" and "OPTIONAL". For readability, these words do not appear in all uppercase letters in this specification. [RFC2119]
Requirements phrased in the imperative as part of algorithms (such as "strip any leading space characters" or "return false and abort these steps") are to be interpreted with the meaning of the key word ("must", "should", "may", etc) used in introducing the algorithm.
Conformance requirements phrased as algorithms or specific steps may be implemented in any manner, so long as the end result is equivalent. (In particular, the algorithms defined in this specification are intended to be easy to follow, and not intended to be performant.)
This specification relies on several other underlying specifications.
The following term is defined in the Encoding standard: [ENCODING]
The following terms are defined in the DOM standard: [DOM]
Document
interfaceDocumentFragment
interfaceProcessingInstruction
interfaceText
interfacedata
attributelocalName
attributenamespaceURI
attributeownerDocument
attributetarget
attributeThe following terms are defined in the HTML standard: [HTML]
HTMLElement
interfaceclass
attributelang
attributetitle
attributeaudio
elementvideo
elementTextTrackCue
interfaceaddCue()
methodThe following term is defined in the Web IDL specification: [WEBIDL]
IndexSizeError
WebVTT cues are HTML text track cues that additionally consist of the following: [HTML]
The cue box of a text track cue is a box within which the text of all lines of the cue is to be rendered.
The position of the cue box within the video frame's dimensions depends on the value of the text track cue text position and the text track cue line position.
A writing direction, either horizontal (a line extends horizontally and is positioned vertically, with consecutive lines displayed below each other), vertical growing left (a line extends vertically and is positioned horizontally, with consecutive lines displayed to the left of each other), or vertical growing right (a line extends vertically and is positioned horizontally, with consecutive lines displayed to the right of each other).
The writing direction is a property of the text inside the cue box which influences the interpretation of the positioning settings of the cue box.
If the writing direction is horizontal, then the line position percentages are relative to the height of the video, and text position and size percentages are relative to the width of the video.
Otherwise, line position percentages are relative to the width of the video, and text position and size percentages are relative to the height of the video.
The writing direction defaults to horizontal.
A boolean indicating whether the line's position is a line position (positioned to a multiple of the line dimensions of the first line of the cue), or whether it is a percentage of the dimension of the video.
Cues whose text track cue snap-to-lines flag is set will be placed within the title-safe area on user agents that use overscan. Cues with the flag unset will be positioned as requested (modulo overlap avoidance if multiple cues are in the same place).
By default, the snap-to-lines flag is set to 'true'.
The line position defines positioning of the cue box.
A line position is either a number giving the position of the lines of the cue, to be interpreted as defined by the writing direction and snap-to-lines flag of the cue, or the special value auto, which means the position is to depend on the other showing tracks.
A text track cue has a text track cue computed line position whose value is that returned by the following algorithm, which is defined in terms of the other aspects of the cue:
If the text track cue line position is numeric, the text track cue snap-to-lines flag of the text track cue is not set, and the text track cue line position is negative or greater than 100, then return 100 and abort these steps.
If the text track cue line position is numeric, return the value of the text track cue line position and abort these steps. (Either the text track cue snap-to-lines flag is set, so any value, not just those in the range 0..100, is valid, or the value is in the range 0..100 and is thus valid regardless of the value of that flag.)
If the text track cue snap-to-lines flag of the text track cue is not set, return the value 100 and abort these steps. (The text track cue line position is the special value auto.)
Let cue be the text track cue.
If cue is not in a list of cues of a text track, or if that text track is not in the list of text tracks of a media element, return −1 and abort these steps.
Let track be the text track whose list of cues the cue is in.
Let n be the number of text tracks whose text track mode is showing and that are in the media element's list of text tracks before track.
Increment n by one.
Negate n.
Return n.
An alignment for the cue box's line position, one of:
A text track cue has a default text track cue line alignment of start.
The text position defines positioning of the cue box in the direction defined by the writing direction.
The text position is either a number giving the position of the cue box as a percentage value or the special value auto, which means the position is to depend on the text alignment of the cue.
If the cue is not within a region, the percentage value is to be interpreted as a percentage of the video dimensions, otherwise as a percentage of the region dimensions.
A text track cue has a text track cue computed text position whose value is that returned by the following algorithm, which is defined in terms of the other aspects of the cue:
If the text track cue text position is numeric, then return the value of the text track cue text position and abort these steps. (Otherwise, the text track cue text position is the special value auto.)
If the text track cue text alignment is start or left, return 0 and abort these steps.
If the text track cue text alignment is end or right, return 100 and abort these steps.
If the text track cue text alignment is middle, return 50 and abort these steps.
Since the default value of the text track cue text alignment is middle, if there is no text track cue text alignment setting for a cue, the text track cue text position defaults to 50%.
Even for horizontal cues with right-to-left paragraph direction text, the cue box is positioned from the left edge of the video frame. This allows defining a rendering space template which can be filled with either left-to-right or right-to-left paragraph direction text. If such a cue box template is created with start or end aligned text, it is best to also specify a size since otherwise the text may flip from one side of the video frame to the other.
An alignment for the cue box in the dimension of the writing direction, describing which part of the cue box is aligned to the text position, one of:
A text track cue has a text track cue computed text position alignment whose value is that returned by the following algorithm, which is defined in terms of other aspects of the cue:
If the text track cue text position alignment is not auto, then return the value of the text track cue text position alignment and abort these steps.
If the text track cue text alignment is start or left, return start and abort these steps.
If the text track cue text alignment is end or right, return end and abort these steps.
If the text track cue text alignment is middle, return middle and abort these steps.
Since the text track cue text position always measures from the left of the video (for horizontal cues) or the top (otherwise), the text track cue text position alignment start value varies between left and top for horizontal and vertical cues, but not between left and right even for changing paragraph direction.
A number giving the size of the cue box, to be interpreted as a percentage of the video, as defined by the writing direction.
By default, the text track cue size is 100%.
An alignment for all lines of text within the cue box, in the dimension of the writing direction and the paragraph direction [BIDI], one of:
By default, the value of the text track cue text alignment is middle aligned.
An optional text track region to which a cue belongs.
The associated rules for updating the text track rendering of WebVTT text track cues are the rules for updating the display of WebVTT text tracks.
When a WebVTT text track cue whose active flag is set has its writing direction, snap-to-lines flag, line position, text position, size, text alignment, region, or text change value, then the user agent must empty the text track cue display state, and then immediately run the text track's rules for updating the display of WebVTT text tracks.
A text track region represents a subpart of the video viewport and provides a rendering area for text track cues.
Each text track region consists of:
An arbitrary string.
A number giving the width of the box within which the text of each line of the containing cues is to be rendered, to be interpreted as a percentage of the video width. Defaults to 100.
A number giving the number of lines of the box within which the text of each line of the containing cues is to be rendered. Defaults to 3.
Two numbers giving the x and y coordinates within the region which is anchored to the video viewport and does not change location even when the region does, e.g. because of font size changes. Defaults to (0,100), i.e. the bottom left corner of the region.
Two numbers giving the x and y coordinates within the video viewport to which the region anchor point is anchored. Defaults to (0,100), i.e. the bottom left corner of the viewport.
One of the following:
For parsing, we also need the following:
A list of zero or more text track regions.
A WebVTT file must consist of a WebVTT file body encoded as UTF-8 and
labeled with the MIME type text/vtt
. [RFC3629]
A WebVTT file body consists of the following components, in the following order:
WEBVTT
".A WebVTT line terminator consists of one of the following:
A WebVTT metadata header consists of the following components, in the given order:
A WebVTT metadata header name and a WebVTT metadata header value each
consist of any sequence of zero or more characters other than U+000A LINE FEED (LF) characters
and U+000D CARRIAGE RETURN (CR) characters except that the entire resulting string must not
contain the substring "-->
" (U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS, U+003E
GREATER-THAN SIGN).
A WebVTT cue consists of the following components, in the given order:
-->
" (U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS, U+003E GREATER-THAN SIGN).A WebVTT cue corresponds to one piece of time-aligned text or data in the WebVTT file, for example one subtitle. The cue payload is the text or data associated with the cue.
A WebVTT cue identifier is any sequence of one or more characters not containing
the substring "-->
" (U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS, U+003E GREATER-THAN
SIGN), nor containing any U+000A LINE FEED (LF) characters or U+000D CARRIAGE RETURN (CR)
characters.
A WebVTT cue identifier must be unique amongst all the WebVTT cue identifiers of all WebVTT cues of a WebVTT file.
A WebVTT cue identifier can be used to reference a specific cue, for example from script or CSS.
The WebVTT cue timings part of a WebVTT cue consists of the following components, in the given order:
-->
" (U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS, U+003E
GREATER-THAN SIGN).The WebVTT cue timings give the start and end offsets of the WebVTT cue. Different cues can overlap. Cues are always listed ordered by their start time.
A WebVTT timestamp can be either a WebVTT timestamp representing hours, minutes, seconds and thousandths of a second or a WebVTT timestamp representing a time in seconds and fractions of a second.
A WebVTT timestamp is always interpreted relative to the current playback position of the media data that the WebVTT file is to be synchronized with, which always starts at 0.
A WebVTT timestamp representing hours hours, minutes minutes, seconds seconds, and thousandths of a second seconds-frac, consists of the following components, in the given order:
A WebVTT timestamp representing a time in seconds and fractions of a second is a WebVTT timestamp representing hours hours, minutes minutes, seconds seconds, and thousandths of a second seconds-frac, calculated as follows:
Let seconds be the integer part of the time.
Let seconds-frac be the fractional component of the time, expressed as the digits of the decimal fraction given to three decimal digits.
If seconds is greater than 59, then let minutes be the integer component of seconds divided by sixty, and then let seconds be the remainder of dividing seconds divided by sixty. Otherwise, let minutes be zero.
If minutes is greater than 59, then let hours be the integer component of minutes divided by sixty, and then let minutes be the remainder of dividing minutes divided by sixty. Otherwise, let hours be zero.
A WebVTT cue settings list consist of a sequence of zero or more WebVTT cue settings in any order, separated from each other by one or more U+0020 SPACE characters or U+0009 CHARACTER TABULATION (tab) characters. Each setting consists of the following components, in the order given:
A WebVTT cue setting name and a WebVTT cue setting value each consist
of any sequence of zero or more characters other than U+000A LINE FEED (LF) characters and
U+000D CARRIAGE RETURN (CR) characters except that the entire resulting string must not contain
the substring "-->
" (U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS, U+003E GREATER-THAN
SIGN).
A WebVTT percentage consists of the following components:
A WebVTT comment consists of the following components, in the given order:
NOTE
".-->
" (U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS, U+003E GREATER-THAN
SIGN).A WebVTT comment is ignored by the parser.
WebVTT metadata text consists of any sequence of zero or more characters other than U+000A LINE FEED (LF) characters and U+000D CARRIAGE RETURN (CR) characters, each optionally separated from the next by a WebVTT line terminator. (In other words, any text that does not have two consecutive WebVTT line terminators and does not start or end with a WebVTT line terminator.)
WebVTT metadata text cues are only useful for scripted applications (using the
metadata
text track kind).
WebVTT cue text is cue payload that consists of zero or more WebVTT cue components, in any order, each optionally separated from the next by a WebVTT line terminator.
The WebVTT cue components are:
WebVTT cue internal text consists of an optional WebVTT line terminator, followed by zero or more WebVTT cue components, in any order, each optionally followed by a WebVTT line terminator.
A WebVTT cue class span consists of a WebVTT cue span start tag
"c
" that disallows an annotation, WebVTT cue internal text representing cue
text, and a WebVTT cue span end tag "c
".
A WebVTT cue italics span consists of a WebVTT cue span start tag
"i
" that disallows an annotation, WebVTT cue internal text representing the
italicized text, and a WebVTT cue span end tag "i
".
A WebVTT cue bold span consists of a WebVTT cue span start tag
"b
" that disallows an annotation, WebVTT cue internal text representing the
boldened text, and a WebVTT cue span end tag "b
".
A WebVTT cue underline span consists of a WebVTT cue span start tag
"u
" that disallows an annotation, WebVTT cue internal text representing the
underlined text, and a WebVTT cue span end tag "u
".
A WebVTT cue ruby span consists of the following components, in the order given:
ruby
" that disallows an annotation.rt
" that disallows an annotation.rt
". If this is the last occurrence of
this group of components in the WebVTT cue ruby span, then this last end tag string
may be omitted.ruby
".A WebVTT cue voice span consists of the following components, in the order given:
v
" that requires an annotation; the
annotation represents the name of the voice.v
". If this WebVTT cue voice span is
the only component of its WebVTT cue text
sequence, then the end tag may be omitted for brevity.A WebVTT cue language span consists of the following components, in the order given:
lang
" that requires an annotation; the
annotation represents the language of the following component, and must be a valid BCP 47
language tag. [BCP47]lang
".A WebVTT cue span start tag has a tag name and either requires or disallows an annotation, and consists of the following components, in the order given:
If the start tag requires an annotation: a U+0020 SPACE character or a U+0009 CHARACTER TABULATION (tab) character, followed by one or more of the following components, the concatenation of their representations having a value that contains at least one character other than U+0020 SPACE and U+0009 CHARACTER TABULATION (tab) characters:
A WebVTT cue span end tag has a tag name and consists of the following components, in the order given:
A WebVTT cue timestamp consists of a U+003C LESS-THAN SIGN character (<), followed by a WebVTT timestamp representing the time that the given point in the cue becomes active, followed by a U+003E GREATER-THAN SIGN character (>). The time represented by the WebVTT timestamp must be greater than the times represented by any previous WebVTT cue timestamps in the cue, as well as greater than the cue's start time offset, and less than the cue's end time offset.
A WebVTT cue text span consists of one or more characters other than U+000A LINE FEED (LF) characters, U+000D CARRIAGE RETURN (CR) characters, U+0026 AMPERSAND characters (&), and U+003C LESS-THAN SIGN characters (<).
WebVTT cue span start tag annotation text consists of one or more characters other than U+000A LINE FEED (LF) characters, U+000D CARRIAGE RETURN (CR) characters, U+0026 AMPERSAND characters (&), and U+003E GREATER-THAN SIGN characters (>).
A WebVTT cue amp escape is the five character string
"&
".
A WebVTT cue lt escape is the four character string "<
".
A WebVTT cue gt escape is the four character string ">
".
A WebVTT cue lrm escape is the five character string
"‎
".
A WebVTT cue rlm escape is the five character string
"‏
".
A WebVTT cue nbsp escape is the six character string
"
".
A WebVTT cue settings list may contain a reference to a text track region. To define a region, a WebVTT region metadata header is specified.
A WebVTT region metadata header is a special kind of WebVTT metadata header where both of the following apply:
Region
".A WebVTT region represents its WebVTT region settings.
The WebVTT region setting list of a WebVTT region metadata header consists of zero or more of the following components, in any order, separated from each other by one or more U+0020 SPACE characters or U+0009 CHARACTER TABULATION (tab) characters. Each component must not be included more than once per WebVTT region setting list string.
The WebVTT region setting list gives configuration options regarding the dimensions, positioning and anchoring of the region. For example, it allows a group of cues within a region to be anchored in the center of the region and the center of the video viewport. In this example, when the font size grows, the region grows uniformly in all directions from the center.
A WebVTT region identifier setting consists of the following components, in the order given:
The string "id
".
A U+003D EQUALS SIGN character (=).
An arbitrary string of one or more characters other than U+0020 SPACE or U+0009
CHARACTER TABULATION character. The string must not contain the substring "-->
"
(U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS, U+003E GREATER-THAN SIGN). The string is called the
WebVTT region identifier.
A WebVTT region identifier must be unique amongst all the WebVTT region identifiers of all WebVTT regions of a WebVTT file.
The WebVTT region identifier gives a name to the region so it can be referenced by the cues that belong to the region.
A WebVTT region width setting consists of the following components, in the order given:
The string "width
".
A U+003D EQUALS SIGN character (=).
The WebVTT region width setting provides a fixed width as a percentage of the video width for the region into which cues are rendered and based on which alignment is calculated.
A WebVTT region lines setting consists of the following components, in the order given:
The string "lines
".
A U+003D EQUALS SIGN character (=).
One or more ASCII digits.
The WebVTT region lines setting provides a fixed height as a number of lines for the region into which cues are rendered. As such, it defines the height of the roll-up region if it is a scroll region.
A WebVTT region anchor setting consists of the following components, in the order given:
The string "regionanchor
".
A U+003D EQUALS SIGN character (=).
A U+002C COMMA character (,).
The WebVTT region anchor setting provides a tuple of two percentages that specify the point within the region box that is fixed in location. The first percentage measures the x-dimension and the second percentage y-dimension from the top left corner of the region box. If no WebVTT region anchor setting is given, the anchor defaults to 0%, 100% (i.e. the bottom left corner).
A WebVTT region viewport anchor setting consists of the following components, in the order given:
The string "viewportanchor
".
A U+003D EQUALS SIGN character (=).
A U+002C COMMA character (,).
The WebVTT region viewport anchor setting provides a tuple of two percentages that specify the point within the video viewport that the region anchor point is anchored to. The first percentage measures the x-dimension and the second percentage measures the y-dimension from the top left corner of the video viewport box. If no viewport anchor is given, it defaults to 0%, 100% (i.e. the bottom left corner).
For browsers, the region maps to an absolute positioned CSS box relative to the video viewport, i.e. there is a relative positioned box that represents the video viewport relative to which the regions are absolutely positioned. Overflow is hidden.
A WebVTT region scroll setting consists of the following components, in the order given:
The string "scroll
".
A U+003D EQUALS SIGN character (=).
The string "up
".
The WebVTT region scroll setting specifies whether cues rendered into the region are allowed to move out of their initial rendering place and roll up, i.e. move towards the top of the video viewport. If the scroll setting is omitted, cues do not move from their rendered position.
Cues are added to a region one line at a time below existing cue lines. When an existing rendered cue line is removed, and it was above another already rendered cue line, that cue line moves into its space, thus scrolling in the given direction. If there is not enough space for a new cue line to be added to a region, the top-most cue line is pushed off the visible region (thus slowly becoming invisible as it moves into overflow:hidden). This eventually makes space for the new cue line and allows it to be added.
When there is no scroll direction, cue lines are added in the empty line closest to the line in the bottom of the region. If no empty line is available, the oldest line is replaced.
A WebVTT cue settings list consists of zero or more of the following settings. Each setting must not be included more than once per WebVTT cue settings list.
A WebVTT cue settings list gives configuration options regarding the position and alignment of the cue box and the cue text within. For example, it allows a cue box to be aligned to the left or positioned at the top right with the cue text within middle aligned.
A WebVTT vertical text cue setting is a WebVTT cue setting that consists of the following components, in the order given:
vertical
" as the WebVTT cue setting name.A U+003A COLON character (:).
rl
",
"lr
".A WebVTT vertical text cue setting configures the cue to use vertical text layout rather than horizontal text layout. Vertical text layout is sometimes used in Japanese, for example. The default is horizontal layout.
A WebVTT line position cue setting consists of the following components, in the order given:
The string "line
" as the WebVTT cue setting name.
A U+003A COLON character (:).
start
", "middle
",
"end
"A WebVTT line position cue setting configures the position of the cue box in the direction opposite to the writing direction. For horizontal cues, this is the vertical position. The positioning is calculated relative to the start, middle, or end of the cue box, depending on the text track cue line alignment value - start by default. The position can be given either as a percentage of the video dimension or as a line number. Line numbers are based on the size of the first line of the cue. Positive line numbers count from the start of the video frame (the first line is numbered 0), negative line numbers from the end of the frame (the last line is numbered −1).
A WebVTT text position cue setting consists of the following components, in the order given:
The string "position
" as the WebVTT cue setting name.
A U+003A COLON character (:).
start
", "middle
",
"end
"A WebVTT text position cue setting configures the position of the cue box in the direction orthogonal to the WebVTT line position cue setting. For horizontal cues, this is the horizontal position. The text position is given as a percentage of the video frame. The positioning is calculated relative to the start, middle, or end of the cue box, depending on the text track cue computed text position alignment value, which is overridden by the WebVTT text position cue setting alignment value.
A WebVTT size cue setting consists of the following components, in the order given:
The string "size
" as the WebVTT cue setting name.
A U+003A COLON character (:).
As the WebVTT cue setting value: a WebVTT percentage.
A WebVTT size cue setting configures the size of the cue box in the same direction as the WebVTT text position cue setting. For horizontal cues, this is the width of the cue box. It is given as a percentage of the width of the frame.
A WebVTT alignment cue setting consists of the following components, in the order given:
The string "align
" as the WebVTT cue setting name.
A U+003A COLON character (:).
start
",
"middle
", "end
", "left
", "right
"A WebVTT alignment cue setting configures the alignment of the text
within the cue. The keywords are relative to the text direction; for left-to-right English text,
"start
" means left-aligned.
A WebVTT region cue setting consists of the following components, in the order given:
The string "region
" as the WebVTT cue setting name.
A U+003A COLON character (:).
As the WebVTT cue setting value: an arbitrary string of one or more characters
other than U+0020 SPACE or U+0009 CHARACTER TABULATION character. The string must not contain
the substring "-->
" (U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS, U+003E GREATER-THAN
SIGN).
A WebVTT region cue setting configures a cue to become part of a region by referencing the region's identifier unless the cue has a "vertical", "line" or "size" cue setting. If a cue is part of a region, its cue settings for "position" and "align" are applied to the line boxes in the cue relative to the region box.
A WebVTT file whose cues all follow the following rules is said to be a WebVTT file using only nested cues:
given any two cues cue1 and cue2 with start and end time offsets (x1, y1) and (x2, y2) respectively,
The following example matches this definition:
WEBVTT 00:00.000 --> 01:24.000 Introduction 00:00.000 --> 00:44.000 Topics 00:44.000 --> 01:19.000 Presenters 01:24.000 --> 05:00.000 Scrolling Effects 01:35.000 --> 03:00.000 Achim's Demo 03:00.000 --> 05:00.000 Timeline Panel
Notice how you can express the cues in this WebVTT file as a tree structure:
If the file has cues that can't be expressed in this fashion, then they don't match the definition of a WebVTT file using only nested cues. For example:
WEBVTT 00:00.000 --> 01:00.000 The First Minute 00:30.000 --> 01:30.000 The Final Minute
In this ninety-second example, the two cues partly overlap, with the first ending before the second ends and the second starting before the first ends. This therefore is not a WebVTT file using only nested cues.
The syntax definition of WebVTT files allows authoring of a wide variety of WebVTT files with a mix of cues. However, only a small subset of WebVTT file types are typically authored.
Conformance checkers, when validating WebVTT files, may offer to restrict syntax checking for validating these types.
A WebVTT file whose cues all have a cue payload that is WebVTT metadata text is said to be a WebVTT file using metadata content.
WebVTT chapter title text is WebVTT cue text that makes use only of zero or more of the following components, each optionally separated from the next by a WebVTT line terminator:
A WebVTT file using chapter title text is a WebVTT file using only nested cues whose cues all have a cue payload that is WebVTT chapter title text.
A WebVTT file whose cues all have a cue payload that is WebVTT cue text is said to be a WebVTT file using cue text.
A WebVTT parser, given an input byte stream and a text track list of cues output, must decode the byte stream using the UTF-8 decode algorithm, and then must parse the resulting string according to the WebVTT parser algorithm below. This results in text track cues being added to output. [RFC3629]
A WebVTT parser, specifically its conversion and parsing steps, is typically run asynchronously, with the input byte stream being updated incrementally as the resource is downloaded; this is called an incremental WebVTT parser.
A WebVTT parser verifies a file signature before parsing the provided byte stream. If the stream lacks this WebVTT file signature, then the parser aborts.
The WebVTT parser algorithm is as follows:
Let input be the string being parsed, after conversion to Unicode, and with the following transformations applied:
Replace all U+0000 NULL characters by U+FFFD REPLACEMENT CHARACTERs.
Replace each U+000D CARRIAGE RETURN U+000A LINE FEED (CRLF) character pair by a single U+000A LINE FEED (LF) character.
Replace all remaining U+000D CARRIAGE RETURN characters by U+000A LINE FEED (LF) characters.
Let position be a pointer into input, initially pointing at the start of the string. In an incremental WebVTT parser, when this algorithm (or further algorithms that it uses) moves the position pointer, the user agent must wait until appropriate further characters from the byte stream have been added to input before moving the pointer, so that the algorithm never reads past the end of the input string. Once the byte stream has ended, and all characters have been added to input, then the position pointer may, when so instructed by the algorithms, be moved past the end of input.
Collect a sequence of characters that are not U+000A LINE FEED (LF) characters. Let line be those characters, if any.
If line is less than six characters long, then abort these steps. The file does not start with the correct WebVTT file signature and was therefore not successfully processed.
If line is exactly six characters long but does not exactly equal
"WEBVTT
", then abort these steps. The file does not start with the correct
WebVTT file signature and was therefore not successfully processed.
If line is more than six characters long but the first six characters do not
exactly equal "WEBVTT
", or the seventh character is neither a U+0020 SPACE
character nor a U+0009 CHARACTER TABULATION (tab) character, then abort these steps. The file
does not start with the correct WebVTT file signature and was therefore not successfully
processed.
If position is past the end of input, then abort these steps. The file was successfully processed, but it contains no useful data and so no text track cues were added to output.
The character indicated by position is a U+000A LINE FEED (LF) character. Advance position to the next character in input.
Header: Collect a sequence of characters that are not U+000A LINE FEED (LF) characters. Let line be those characters, if any.
Let regions be a text track list of regions.
Metadata header creation: Let metadata be a new WebVTT metadata header.
Let metadata's name be the empty string.
Let metadata's value be the empty string.
If line contains the character ":" (A U+003A COLON), then set metadata's name to the substring of line before the first ":" character and metadata's value to the substring after this character.
If metadata's name equals "Region":
If position is past the end of input, then jump to the step labeled end.
The character indicated by position is a U+000A LINE FEED (LF) character. Advance position to the next character in input.
If line contains the three-character substring "-->
" (U+002D
HYPHEN-MINUS, U+002D HYPHEN-MINUS, U+003E GREATER-THAN SIGN), then set the already
collected line flag and jump to the step labeled cue loop.
If line is not the empty string, then jump back to the step labeled header.
Cue loop: If the already collected line flag is set, then jump to the step labeled cue creation.
Collect a sequence of characters that are U+000A LINE FEED (LF) characters.
Collect a sequence of characters that are not U+000A LINE FEED (LF) characters. Let line be those characters, if any.
If line is the empty string, then jump to the step labeled end. (In such a case, position is also forcibly past the end of input.)
Cue creation: Let cue be a new text track cue and initialize it as follows:
Let cue's text track cue identifier be the empty string.
Let cue's text track cue pause-on-exit flag be false.
Let cue's text track cue region be null.
Let cue's text track cue writing direction be horizontal.
Let cue's text track cue snap-to-lines flag be true.
Let cue's text track cue line position be auto.
Let cue's text track cue line alignment be start alignment.
Let cue's text track cue text position be auto.
Let cue's text track cue text position alignment be auto.
Let cue's text track cue size be 100.
Let cue's text track cue text alignment be middle alignment.
Let cue's text track cue text be the empty string.
If line contains the three-character substring "-->
" (U+002D
HYPHEN-MINUS, U+002D HYPHEN-MINUS, U+003E GREATER-THAN SIGN), then jump to the step labeled
timings below.
Let cue's text track cue identifier be line.
If position is past the end of input, then discard cue and jump to the step labeled end.
If the character indicated by position is a U+000A LINE FEED (LF) character, advance position to the next character in input.
Collect a sequence of characters that are not U+000A LINE FEED (LF) characters. Let line be those characters, if any.
If line is the empty string, then discard cue and jump to the step labeled cue loop.
Timings: Unset the already collected line flag.
Collect WebVTT cue timings and settings from line using regions for cue. If that fails, jump to the step labeled bad cue.
Let cue text be the empty string.
Cue text loop: If position is past the end of input, then jump to the step labeled cue text processing.
If the character indicated by position is a U+000A LINE FEED (LF) character, advance position to the next character in input.
Collect a sequence of characters that are not U+000A LINE FEED (LF) characters. Let line be those characters, if any.
If line is the empty string, then jump to the step labeled cue text processing.
If line contains the three-character substring "-->
" (U+002D
HYPHEN-MINUS, U+002D HYPHEN-MINUS, U+003E GREATER-THAN SIGN), then set the already
collected line flag and jump to the step labeled cue text processing.
If cue text is not empty, append a U+000A LINE FEED (LF) character to cue text.
Let cue text be the concatenation of cue text and line.
Return to the step labeled cue text loop.
Cue text processing: Let the text track cue text of cue be cue text, and let the rules for rendering the cue in isolation be the rules for interpreting WebVTT cue text.
Add cue to the text track list of cues output.
Jump to the step labeled cue loop.
Bad cue: Discard cue.
Bad cue loop: If position is past the end of input, then jump to the step labeled end.
If the character indicated by position is a U+000A LINE FEED (LF) character, advance position to the next character in input.
Collect a sequence of characters that are not U+000A LINE FEED (LF) characters. Let line be those characters, if any.
If line contains the three-character substring "-->
" (U+002D
HYPHEN-MINUS, U+002D HYPHEN-MINUS, U+003E GREATER-THAN SIGN), then set the already
collected line flag and jump to the step labeled cue loop.
If line is the empty string, then jump to the step labeled cue loop.
Otherwise, jump to the step labeled bad cue loop.
End: The file has ended. Abort these steps. The WebVTT parser has finished. The file was successfully processed.
When the WebVTT parser requires that the user agent collect WebVTT region settings from a string input for a text track, the user agent must run the following algorithm.
A WebVTT region object is a conceptual construct to represent a WebVTT region that is used as a root node for lists of WebVTT node objects. This algorithm returns a list of WebVTT Region Objects.
Let settings be the result of splitting input on spaces.
If setting does not contain a U+003D EQUALS SIGN character (=), or if the first U+003D EQUALS SIGN character (=) in setting is either the first or last character of setting, then jump to the step labeled next setting.
Let name be the leading substring of setting up to and excluding the first U+003D EQUALS SIGN character (=) in that string.
Let value be the trailing substring of setting starting from the character immediately after the first U+003D EQUALS SIGN character (=) in that string.
Run the appropriate substeps that apply for the value of name, as follows:
If name is a case-sensitive match for "id
"
Let region's identifier be value.
Otherwise if name is a case-sensitive match for
"width
"
If parse a percentage string from value returns a percentage, let region's text track region width be percentage.
lines
"If value contains any characters other than ASCII digits, then jump to the step labeled next setting.
Interpret value as an integer, and let number be that number.
Let region's text track region lines be number.
regionanchor
"If value does not contain a U+002C COMMA character (,), then jump to the step labeled next setting.
Let anchorX be the leading substring of value up to and excluding the first U+002C COMMA character (,) in that string.
Let anchorY be the trailing substring of value starting from the character immediately after the first U+002C COMMA character (,) in that string.
If parse a percentage string from anchorX or parse a percentage string from anchorY don't return a percentage, then jump to the step labeled next setting.
Let region's text track region anchor point be the tuple of the percentage values calculated from anchorX and anchorY.
viewportanchor
"If value does not contain a U+002C COMMA character (,), then jump to the step labeled next setting.
Let viewportanchorX be the leading substring of value up to and excluding the first U+002C COMMA character (,) in that string.
Let viewportanchorY be the trailing substring of value starting from the character immediately after the first U+002C COMMA character (,) in that string.
If parse a percentage string from viewportanchorX or parse a percentage string from viewportanchorY don't return a percentage, then jump to the step labeled next setting.
Let region's text track region viewport anchor point be the tuple of the percentage values calculated from viewportanchorX and viewportanchorY.
scroll
"If value is a case-sensitive match for the string
"up
", then let region's scroll value be "scroll
up".
The rules to parse a percentage string are as follows. This will return either a number in the range 0..100, or nothing. If at any point the algorithm says that it "fails", this means that it is aborted at that point and returns nothing.
Let input be the string being parsed.
If input contains any characters other than U+0025 PERCENT SIGN characters (%), U+002E DOT characters (.) and ASCII digits, then fail.
If input does not contain at least one ASCII digit, then fail.
If input contains more than one U+002E DOT character (.), then fail.
If any character in input other than the last character is a U+0025 PERCENT SIGN character (%), then fail.
If the last character in input is not a U+0025 PERCENT SIGN character (%), then fail.
Ignoring the trailing percent sign, interpret input as a real number. Let that number be the percentage.
If percentage is outside the range 0..100, then fail.
Return percentage.
When the algorithm above requires that the user agent collect WebVTT cue timings and settings from a string input using a text track list of regions regions for a text track cue cue, the user agent must run the following algorithm.
Let input be the string being parsed.
Let position be a pointer into input, initially pointing at the start of the string.
Collect a WebVTT timestamp. If that algorithm fails, then abort these steps and return failure. Otherwise, let cue's text track cue start time be the collected time.
If the character at position is not a U+002D HYPHEN-MINUS character (-) then abort these steps and return failure. Otherwise, move position forwards one character.
If the character at position is not a U+002D HYPHEN-MINUS character (-) then abort these steps and return failure. Otherwise, move position forwards one character.
If the character at position is not a U+003E GREATER-THAN SIGN character (>) then abort these steps and return failure. Otherwise, move position forwards one character.
Collect a WebVTT timestamp. If that algorithm fails, then abort these steps and return failure. Otherwise, let cue's text track cue end time be the collected time.
Let remainder be the trailing substring of input starting at position.
Parse the WebVTT cue settings from remainder using regions for cue.
When the user agent is to parse the WebVTT cue settings from a string input using a text track list of regions regions for a text track cue cue, the user agent must run the following steps:
Let settings be the result of splitting input on spaces.
For each token setting in the list settings, run the following substeps:
If setting does not contain a U+003A COLON character (:), or if the first U+003A COLON character (:) in setting is either the first or last character of setting, then jump to the step labeled next setting.
Let name be the leading substring of setting up to and excluding the first U+003A COLON character (:) in that string.
Let value be the trailing substring of setting starting from the character immediately after the first U+003A COLON character (:) in that string.
Run the appropriate substeps that apply for the value of name, as follows:
region
"Let cue's text track cue region be the last text track region in regions whose text track region identifier is value, if any, or null otherwise.
vertical
"If value is a case-sensitive match for the string
"rl
", then let cue's text track cue writing direction be
vertical growing
left.
Otherwise, if value is a case-sensitive match for the string
"lr
", then let cue's text track cue writing direction be
vertical growing
right.
line
"If value contains a U+002C COMMA character (,), then let linepos be the leading substring of value up to and excluding the first U+002C COMMA character (,) in that string and let linealign be the trailing substring of value starting from the character immediately after the first U+002C COMMA character (,) in that string.
Otherwise let linepos be the full value string and linealign be the empty string.
If linepos does not contain at least one ASCII digit, then jump to the step labeled next setting.
If the last character in linepos is a U+0025 PERCENT SIGN character (%)
If parse a percentage string from linepos doesn't fail, let number be the returned percentage, otherwise jump to the step labeled next setting.
Otherwise
If linepos contains any characters other than U+002D HYPHEN-MINUS characters (-) and ASCII digits, then jump to the step labeled next setting.
If any character in linepos other than the first character is a U+002D HYPHEN-MINUS character (-), then jump to the step labeled next setting.
Interpret linepos as a (potentially signed) integer, and let number be that number.
Let cue's text track cue line position be number.
If the last character in linepos is a U+0025 PERCENT SIGN character (%), then let cue's text track cue snap-to-lines flag be false. Otherwise, let it be true.
If linealign is a case-sensitive match for the string
"start
", then let cue's text track cue line alignment be
start alignment.
If linealign is a case-sensitive match for the string
"middle
", then let cue's text track cue line alignment be
middle alignment.
If linealign is a case-sensitive match for the string
"end
", then let cue's text track cue line alignment be end alignment.
position
"If value contains a U+002C COMMA character (,), then let colpos be the leading substring of value up to and excluding the first U+002C COMMA character (,) in that string and let colalign be the trailing substring of value starting from the character immediately after the first U+002C COMMA character (,) in that string.
Otherwise let colpos be the full value string and colalign be the empty string.
If parse a percentage string from colpos doesn't fail, let number be the returned percentage, otherwise jump to the step labeled next setting (text track cue text position's value remains the special value auto).
Let cue's text track cue text position be number.
If colalign is a case-sensitive match for the string
"start
", then let cue's text track cue text position
alignment be start
alignment.
If colalign is a case-sensitive match for the string
"middle
", then let cue's text track cue text position
alignment be middle
alignment.
If colalign is a case-sensitive match for the string
"end
", then let cue's text track cue text position
alignment be end
alignment.
size
"If parse a percentage string from value doesn't fail, let number be the returned percentage, otherwise jump to the step labeled next setting.
Let cue's text track cue size be number.
align
"If value is a case-sensitive match for the string
"start
", then let cue's text track cue text alignment be
start alignment.
If value is a case-sensitive match for the string
"middle
", then let cue's text track cue text alignment be
middle alignment.
If value is a case-sensitive match for the string
"end
", then let cue's text track cue text alignment be end alignment.
If value is a case-sensitive match for the string
"left
", then let cue's text track cue text alignment be left alignment.
If value is a case-sensitive match for the string
"right
", then let cue's text track cue text alignment be
right alignment.
Next setting: Continue to the next token, if any.
When this specification says that a user agent is to collect a WebVTT timestamp, the user agent must run the following steps:
Let input and position be the same variables as those of the same name in the algorithm that invoked these steps.
Let most significant units be minutes.
If position is past the end of input, return an error and abort these steps.
If the character indicated by position is not an ASCII digit, then return an error and abort these steps.
Collect a sequence of characters that are ASCII digits, and let string be the collected substring.
Interpret string as a base-ten integer. Let value1 be that integer.
If string is not exactly two characters in length, or if value1 is greater than 59, let most significant units be hours.
If position is beyond the end of input or if the character at position is not a U+003A COLON character (:), then return an error and abort these steps. Otherwise, move position forwards one character.
Collect a sequence of characters that are ASCII digits, and let string be the collected substring.
If string is not exactly two characters in length, return an error and abort these steps.
Interpret string as a base-ten integer. Let value2 be that integer.
If most significant units is hours, or if position is not beyond the end of input and the character at position is a U+003A COLON character (:), run these substeps:
If position is beyond the end of input or if the character at position is not a U+003A COLON character (:), then return an error and abort these steps. Otherwise, move position forwards one character.
Collect a sequence of characters that are ASCII digits, and let string be the collected substring.
If string is not exactly two characters in length, return an error and abort these steps.
Interpret string as a base-ten integer. Let value3 be that integer.
Otherwise (if most significant units is not hours, and either position is beyond the end of input, or the character at position is not a U+003A COLON character (:)), let value3 have the value of value2, then value2 have the value of value1, then let value1 equal zero.
If position is beyond the end of input or if the character at position is not a U+002E FULL STOP character (.), then return an error and abort these steps. Otherwise, move position forwards one character.
Collect a sequence of characters that are ASCII digits, and let string be the collected substring.
If string is not exactly three characters in length, return an error and abort these steps.
Interpret string as a base-ten integer. Let value4 be that integer.
If value2 is greater than 59 or if value3 is greater than 59, return an error and abort these steps.
Let result be value1×60×60 + value2×60 + value3 + value4∕1000.
Return result.
A WebVTT Node Object is a conceptual construct used to represent components of WebVTT cue text so that its processing can be described without reference to the underlying syntax.
There are two broad classes of WebVTT Node Objects: WebVTT Internal Node Objects and WebVTT Leaf Node Objects.
WebVTT Internal Node Objects are those that can contain further WebVTT Node Objects. They are conceptually similar to elements in HTML or the DOM. WebVTT Internal Node Objects have an ordered list of child WebVTT Node Objects. The WebVTT Internal Node Object is said to be the parent of the children. Cycles do not occur; the parent-child relationships so constructed form a tree structure. WebVTT Internal Node Objects also have an ordered list of class names, known as their applicable classes, and a language, known as their applicable language, which is to be interpreted as a BCP 47 language code. [BCP47]
There are several concrete classes of WebVTT Internal Node Objects:
These are used as root nodes for trees of WebVTT Node Objects.
These represent spans of text (a WebVTT cue class span) in WebVTT cue text, and are used to annotate parts of the cue with applicable classes without implying further meaning (such as italics or bold).
These represent spans of italic text (a WebVTT cue italics span) in WebVTT cue text.
These represent spans of bold text (a WebVTT cue bold span) in WebVTT cue text.
These represent spans of underline text (a WebVTT cue underline span) in WebVTT cue text.
These represent spans of ruby (a WebVTT cue ruby span) in WebVTT cue text.
These represent spans of ruby text (a WebVTT cue ruby text span) in WebVTT cue text.
These represent spans of text associated with a specific voice (a WebVTT cue voice span) in WebVTT cue text. A WebVTT Voice Object has a value, which is the name of the voice.
These represent spans of text (a WebVTT cue language span) in WebVTT cue text, and are used to annotate parts of the cue where the applicable language might be different than the surrounding text's, without implying further meaning (such as italics or bold).
WebVTT Leaf Node Objects are those that contain data, such as text, and cannot contain child WebVTT Node Objects.
There are two concrete classes of WebVTT Leaf Node Objects:
A fragment of text. A WebVTT Text Object has a value, which is the text it represents.
A timestamp. A WebVTT Timestamp Object has a value, in seconds and fractions of a second, which is the time represented by the timestamp.
To parse a string input supposedly containing WebVTT cue text, user agents must use the following algorithm. This algorithm returns a list of WebVTT Node Objects.
Let input be the string being parsed.
Let position be a pointer into input, initially pointing at the start of the string.
Let result be a list of WebVTT Node Objects, initially empty.
Let current be the WebVTT Internal Node Object result.
Let language stack be a stack of language codes, initially empty.
Loop: If position is past the end of input, return result and abort these steps.
Let token be the result of invoking the WebVTT cue text tokenizer.
Run the appropriate steps given the type of token:
Create a WebVTT Text Object whose value is the value of the string token token.
Append the newly created WebVTT Text Object to current.
How the start tag token token is processed depends on its tag name, as follows:
c
"i
"b
"u
"ruby
"rt
"If current is a WebVTT Ruby Object, then attach a WebVTT Ruby Text Object.
v
"Attach a WebVTT Voice Object, and set its value to the token's annotation string, or the empty string if there is no annotation string.
lang
"Push the value of the token's annotation string, or the empty string if there is no annotation string, onto the language stack; then attach a WebVTT Language Object.
Ignore the token.
When the steps above say to attach a WebVTT Internal Node Object of a particular concrete class, the user agent must run the following steps:
Create a new WebVTT Internal Node Object of the specified concrete class.
Set the new object's list of applicable classes to the list of classes in the token, excluding any classes that are the empty string.
Set the new object's applicable language to the top entry on the language stack, if the stack is not empty.
Append the newly created node object to current.
Let current be the newly created node object.
If any of the following conditions is true, then let current be the parent node of current.
c
" and
current is a WebVTT Class Object.i
" and
current is a WebVTT Italic Object.b
" and
current is a WebVTT Bold Object.u
" and
current is a WebVTT Underline Object.ruby
" and
current is a WebVTT Ruby Object.rt
" and
current is a WebVTT Ruby Text Object.v
" and
current is a WebVTT Voice Object.Otherwise, if the tag name of the end tag token token is "lang
"
and current is a WebVTT Language Object, then let current be
the parent node of current, and pop the top value from the language
stack.
Otherwise, if the tag name of the end tag token token is "ruby
"
and current is a WebVTT Ruby Text Object, then let current be
the parent node of the parent node of current.
Otherwise, ignore the token.
Let input be the tag value.
Let position be a pointer into input, initially pointing at the start of the string.
If that algorithm does not fail, and if position now points at the end of input (i.e. there are no trailing characters after the timestamp), then create a WebVTT Timestamp Object whose value is the collected time, then append it to current.
Otherwise, ignore the token.
Jump to the step labeled loop.
The WebVTT cue text tokenizer is as follows. It emits a token, which is either a string (whose value is a sequence of characters), a start tag (with a tag name, a list of classes, and optionally an annotation), an end tag (with a tag name), or a timestamp tag (with a tag value).
Let input and position be the same variables as those of the same name in the algorithm that invoked these steps.
Let tokenizer state be WebVTT data state.
Let result be the empty string.
Let buffer be the empty string.
Let classes be an empty list.
Loop: If position is past the end of input, let c be an end-of-file marker. Otherwise, let c be the character in input pointed to by position.
An end-of-file marker is not a Unicode character, it is used to end the tokenizer.
Jump to the state given by tokenizer state:
Jump to the entry that matches the value of c:
Set buffer to c, set tokenizer state to the WebVTT escape state, and jump to the step labeled next.
If result is the empty string, then set tokenizer state to the WebVTT tag state and jump to the step labeled next.
Otherwise, return a string token whose value is result and abort these steps.
Return a string token whose value is result and abort these steps.
Append c to result and jump to the step labeled next.
Jump to the entry that matches the value of c:
Append buffer to result, set buffer to c, and jump to the step labeled next.
Append c to buffer and jump to the step labeled next.
First, examine the value of buffer:
If buffer is the string "&
", then append a U+0026
AMPERSAND character (&) to result.
If buffer is the string "<
", then append a U+003C
LESS-THAN SIGN character (<) to result.
If buffer is the string ">
", then append a U+003E
GREATER-THAN SIGN character (>) to result.
If buffer is the string "&lrm
", then append a U+200E
LEFT-TO-RIGHT MARK character to result.
If buffer is the string "&rlm
", then append a U+200F
RIGHT-TO-LEFT MARK character to result.
If buffer is the string " 
", then append a U+00A0
NO-BREAK SPACE character to result.
Otherwise, append buffer followed by a U+003B SEMICOLON character (;) to result.
Then, in any case, set tokenizer state to the WebVTT data state, and jump to the step labeled next.
Append buffer to result, return a string token whose value is result, and abort these steps.
Append buffer to result, append c to result, set tokenizer state to the WebVTT data state, and jump to the step labeled next.
Jump to the entry that matches the value of c:
Set tokenizer state to the WebVTT start tag annotation state, and jump to the step labeled next.
Set tokenizer state to the WebVTT start tag class state, and jump to the step labeled next.
Set tokenizer state to the WebVTT end tag state, and jump to the step labeled next.
Set result to c, set tokenizer state to the WebVTT timestamp tag state, and jump to the step labeled next.
Advance position to the next character in input, then jump to the next "end-of-file marker" entry below.
Return a start tag whose tag name is the empty string, with no classes and no annotation, and abort these steps.
Set result to c, set tokenizer state to the WebVTT start tag state, and jump to the step labeled next.
Jump to the entry that matches the value of c:
Set tokenizer state to the WebVTT start tag annotation state, and jump to the step labeled next.
Set buffer to c, set tokenizer state to the WebVTT start tag annotation state, and jump to the step labeled next.
Set tokenizer state to the WebVTT start tag class state, and jump to the step labeled next.
Advance position to the next character in input, then jump to the next "end-of-file marker" entry below.
Return a start tag whose tag name is result, with no classes and no annotation, and abort these steps.
Append c to result and jump to the step labeled next.
Jump to the entry that matches the value of c:
Append to classes an entry whose value is buffer, set buffer to the empty string, set tokenizer state to the WebVTT start tag annotation state, and jump to the step labeled next.
Append to classes an entry whose value is buffer, set buffer to c, set tokenizer state to the WebVTT start tag annotation state, and jump to the step labeled next.
Append to classes an entry whose value is buffer, set buffer to the empty string, and jump to the step labeled next.
Advance position to the next character in input, then jump to the next "end-of-file marker" entry below.
Append to classes an entry whose value is buffer, then return a start tag whose tag name is result, with the classes given in classes but no annotation, and abort these steps.
Append c to buffer and jump to the step labeled next.
Jump to the entry that matches the value of c:
Advance position to the next character in input, then jump to the next "end-of-file marker" entry below.
Remove any leading or trailing space characters from buffer, and replace any sequence of one or more consecutive space characters in buffer with a single U+0020 SPACE character; then, return a start tag whose tag name is result, with the classes given in classes, and with buffer as the annotation, and abort these steps.
Append c to buffer and jump to the step labeled next.
Jump to the entry that matches the value of c:
Advance position to the next character in input, then jump to the next "end-of-file marker" entry below.
Return an end tag whose tag name is result and abort these steps.
Append c to result and jump to the step labeled next.
Jump to the entry that matches the value of c:
Advance position to the next character in input, then jump to the next "end-of-file marker" entry below.
Return a timestamp tag whose tag name is result and abort these steps.
Append c to result and jump to the step labeled next.
Next: Advance position to the next character in input.
Jump to the step labeled loop.
To convert a list of WebVTT Node Objects to a DOM tree for
Document
owner, user agents must create a tree of DOM nodes that
is isomorphous to the tree of WebVTT Node Objects, with the
following mapping of WebVTT Node Objects to DOM nodes:
HTMLElement
nodes created as part of the mapping described above must
have their namespaceURI
set to the HTML namespace, and, if the
corresponding WebVTT Internal Node Object has any applicable classes, must have a class
attribute set
to the string obtained by concatenating all those classes, each separated from the next by a
single U+0020 SPACE character.
The ownerDocument
attribute of all nodes in the DOM tree must be set to
the given document owner.
All characteristics of the DOM nodes that are not described above or dependent on characteristics defined above must be left at their initial values.
The rules for interpreting WebVTT cue text (e.g. for use as chapter titles) are as follows:
Let nodes be the list of WebVTT Node Objects obtained by applying the WebVTT cue text parsing rules to the cue's text track cue text.
...
The rules for updating the display of WebVTT text tracks render the text tracks of a media element (specifically, a
video
element), or of another playback mechanism, by applying the steps
below. All the text tracks that use these rules for a given media
element, or other playback mechanism, are rendered together, to avoid overlapping subtitles
from multiple tracks.
The output of the steps below is a set of CSS boxes that covers the rendering area of the media element or other playback mechanism, which user agents are expected to render in a manner suiting the user.
The rules are as follows:
If the media element is an audio
element, or is another
playback mechanism with no rendering area, abort these steps. There is nothing to
render.
Let video be the media element or other playback mechanism.
Let output be an empty list of absolutely positioned CSS block boxes.
If the user agent is exposing a user interface for video, add to output one or more completely transparent positioned CSS block boxes that cover the same region as the user interface.
If the last time these rules were run, the user agent was not exposing a user interface for video, but now it is, optionally let reset be true. Otherwise, let reset be false.
Let tracks be the subset of video's list of text tracks that have as their rules for updating the text track rendering these rules for updating the display of WebVTT text tracks, and whose text track mode is showing.
Let cues be an empty list of text track cues.
For each track track in tracks, append to cues all the cues from track's list of cues that have their text track cue active flag set.
Let regions be an empty list of text track regions.
For each track track in tracks, append to regions all the regions from track's list of regions.
If reset is false, then, for each text track region region in regions let regionNode be a WebVTT region object.
Apply the following steps for each regionNode:
Prepare some variables for the application of CSS properties to regionNode as follows:
Let regionWidth be the text track region width. Let width be 'regionWidth vw' ('vw' is a CSS unit). [CSSVALUES]
Let lineHeight be '5.33vh' ('vh' is a CSS unit) [CSSVALUES] and regionHeight be the text track region lines. Let lines be 'lineHeight multiplied by regionHeight.
Let viewportAnchorX be the x dimension of the text track region viewport anchor and regionAnchorX be the x dimension of the text track region anchor. Let leftOffset be regionAnchorX multiplied by width divided by 100.0. Let left be leftOffset subtracted from 'viewportAnchorX vw'.
Let viewportAnchorY be the y dimension of the text track region viewport anchor and regionAnchorY be the y dimension of the text track region anchor. Let topOffset be regionAnchorY multiplied by lines divided by 100.0. Let top be topOffset subtracted from 'viewportAnchorY vh'.
Apply the terms of the CSS specifications to regionNode within the following constraints, thus obtaining a CSS box box positioned relative to an initial containing block:
No style sheets are associated with regionNode. (The regionNodes are subsequently restyled using style sheets after their boxes are generated, as described below.)
Properties on regionNode have their values set as defined in the next section. (That section uses some of the variables whose values were calculated earlier in this algorithm.)
The viewport (and initial containing block) is video's rendering area.
Add the CSS box box to output.
If reset is false, then, for each text track cue cue in cues: if cue's text track cue display state has a set of CSS boxes, then:
If cue's text track cue region is not null, add those boxes to that region's box and remove cue from cues.
Otherwise, add those boxes to output and remove cue from cues.
For each text track cue cue in cues that has not yet had corresponding CSS boxes added to output, in text track cue order, run the following substeps:
Let nodes be the list of WebVTT Node Objects obtained by applying the WebVTT cue text parsing rules to the cue's text track cue text.
If cue's text track cue region is null, run the following substeps:
Let cue's text track cue display state have the CSS boxes in boxes.
Add the CSS boxes in boxes to output.
Otherwise, run the following substeps:
Let region be cue's text track cue region.
If region's text track region scroll setting is 'up
'
and region already has one child, set region's 'transition-property'
to 'top' and 'transition-duration' to '0.433s'.
Apply the Unicode Bidirectional Algorithm's Paragraph Level steps to the concatenation of the values of each WebVTT Text Object in nodes, in a pre-order, depth-first traversal, excluding WebVTT Ruby Text Objects and their descendants, to determine the paragraph embedding level of the first Unicode paragraph of the cue. [BIDI]
Within a cue, paragraph boundaries are only denoted by Type B characters, such as U+000A LINE FEED (LF), U+0085 NEXT LINE (NEL), and U+2029 PARAGRAPH SEPARATOR. (This means each line of the cue is reordered as if it was a separate paragraph.)
If the paragraph embedding level determined in the previous step is even (the paragraph direction is left-to-right), let direction be 'ltr', otherwise, let it be 'rtl'.
Let offset be the text track cue computed text position multiplied by region's text track region width and divided by 100 (i.e. interpret it as a percentage of the region width).
Adjust offset using the text track cue computed text position alignment as follows:
Subtract half of region's text track region width from offset.
Subtract region's text track region width from offset.
Let left be 'offset %'. ('%' is a CSS unit.) [CSSVALUES]
Apply the terms of the CSS specifications to nodes with the same constraints that are used when they are applied to nodes of a cue that is not part of a region.
Let boxes be the boxes generated as descendants of the initial containing block, along with their positions.
If there are no line boxes in boxes, skip the remainder of these substeps for cue. The cue is ignored.
Let cue's text track cue display state have the CSS boxes in boxes.
Add the CSS boxes in boxes to region.
If the CSS boxes boxes together have a height less than the height of the region box, let diff be the absolute difference between the two height values. Increase top by diff and re-apply it to regionNode.
Return output.
User agents may allow the user to override the above algorithm's positioning of cues, e.g.
by dragging them to another location on the video
, or even off the
video
entirely.
When the algorithm above requires that the user agent apply WebVTT cue settings to obtain CSS boxes from a list of WebVTT Node Objects nodes, the user agent must run the following algorithm.
Apply the Unicode Bidirectional Algorithm's Paragraph Level steps to the concatenation of the values of each WebVTT Text Object in nodes, in a pre-order, depth-first traversal, excluding WebVTT Ruby Text Objects and their descendants, to determine the paragraph embedding level of the first Unicode paragraph of the cue. [BIDI]
Within a cue, paragraph boundaries are only denoted by Type B characters, such as U+000A LINE FEED (LF), U+0085 NEXT LINE (NEL), and U+2029 PARAGRAPH SEPARATOR. (This means each line of the cue is reordered as if it was a separate paragraph.)
If the paragraph embedding level determined in the previous step is even (the paragraph direction is left-to-right), let direction be 'ltr', otherwise, let it be 'rtl'.
If the text track cue writing direction is horizontal, then let writing-mode be 'horizontal-tb'. Otherwise, if the text track cue writing direction is vertical growing left, then let writing-mode be 'vertical-rl'. Otherwise, the text track cue writing direction is vertical growing right; let writing-mode be 'vertical-lr'.
Determine the value of maximum size for cue as per the appropriate rules from the following list:
Let maximum size be the text track cue computed text position subtracted from 100.
Let maximum size be the text track cue computed text position.
Let maximum size be the text track cue computed text position multiplied by two.
Let maximum size be the result of subtracting text track cue computed text position from 100 and then multiplying the result by two.
If the text track cue size is less than maximum size, then let size be text track cue size. Otherwise, let size be maximum size.
If the text track cue writing direction is horizontal, then let width be 'size vw' and height be 'auto'. Otherwise, let width be 'auto' and height be 'size vh'. (These are CSS values used by the next section to set CSS properties for the rendering; 'vw' and 'vh' are CSS units.) [CSSVALUES]
Determine the value of x-position or y-position for cue as per the appropriate rules from the following list:
Let x-position be the text track cue computed text position.
Let x-position be the text track cue computed text position minus half of size.
Let x-position be the text track cue computed text position minus size.
Let y-position be the text track cue computed text position.
Let y-position be the text track cue computed text position minus half of size.
Let y-position be the text track cue computed text position minus size.
Determine the value of whichever of x-position or y-position is not yet calculated for cue as per the appropriate rules from the following list:
Let y-position be the text track cue computed line position.
Let x-position be the text track cue computed line position.
Let y-position be 0.
Let x-position be 0.
These are not final positions, they are merely temporary positions used to calculate box dimensions below.
If the text track cue snap-to-lines flag is set, then run the appropriate steps from the following list:
Let edge margin be a user-agent-defined horizontal length, expressed as a percentage of the width of the video's rendering area, which will be used to define a margin at the left and right edges of the video into which this cue will not be placed. In situations with overscan, this margin should be sufficient to place the cue within the title-safe area. In the absence of overscan, this value should be picked for aesthetics (to avoid text being aligned precisely on the left or right edge of the video, which can be ugly).
If x-position is less than edge margin and the sum of x-position and size is more than edge margin, then increase x-position by edge margin and decrease size by the same amount.
Let right margin edge be 100 minus edge margin.
If x-position is less than right margin edge, and the sum of x-position and size is more than right margin edge, then decrease size by edge margin.
Let edge margin be a user-agent-defined vertical length, expressed as a percentage of the height of the video's rendering area, which will be used to define a margin at the top and bottom edges of the video into which this cue will not be placed. In situations with overscan, this margin should be sufficient to place the cue within the title-safe area. In the absence of overscan, this value should be picked for aesthetics (to avoid text being aligned precisely on the top or bottom edge of the video, which can be ugly).
If y-position is less than edge margin and the sum of y-position and size is more than edge margin, then increase y-position by edge margin and decrease size by the same amount.
Let bottom margin edge be 100 minus edge margin.
If y-position is less than bottom margin edge, and the sum of y-position and size is more than right margin edge, then decrease size by edge margin.
Let left be 'x-position vw' and top be 'y-position vh'. (These are CSS values used by the next section to set CSS properties for the rendering; 'vw' and 'vh' are CSS units.) [CSSVALUES]
Apply the terms of the CSS specifications to nodes within the following constraints, thus obtaining a set of CSS boxes positioned relative to an initial containing block: [CSS]
The document tree is the tree of WebVTT Node Objects rooted at nodes.
For the purposes of processing by the CSS specification, WebVTT Internal Node Objects are equivalent to elements with the same contents.
Text
nodes.Let boxes be the boxes generated as descendants of the initial containing block, along with their positions.
If there are no line boxes in boxes, skip the remainder of these substeps for cue. The cue is ignored.
Adjust the positions of boxes according to the appropriate steps from the following list:
Many of the steps in this algorithm vary according to the text track cue writing direction. Steps labeled "Horizontal" must be followed only when the text track cue writing direction is horizontal, steps labeled "Vertical" must be followed when the text track cue writing direction is either vertical growing left or vertical growing right, steps labeled "Vertical Growing Left" must be followed only when the text track cue writing direction is vertical growing left, and steps labeled "Vertical Growing Right" must be followed only when the text track cue writing direction is vertical growing right.
Horizontal: Let margin be a user-agent-defined vertical length which will be used to define a margin at the top and bottom edges of the video into which cues will not be placed. In situations with overscan, this margin should be sufficient to place all cues within the title-safe area. In the absence of overscan, this value should be picked for aesthetics (to avoid text being aligned precisely on the bottom edge of the video, which can be ugly).
Vertical: Let margin be a user-agent-defined horizontal length which will be used to define a margin at the left and right edges of the video into which cues will not be placed. In situations with overscan, this margin should be sufficient to place all cues within the title-safe area. In the absence of overscan, this value should be picked for aesthetics (to avoid text being aligned precisely on the left or right edges of the video, which can be ugly).
Horizontal: Let full dimension be the height of video's rendering area.
Vertical: Let full dimension be the width of video's rendering area.
These dimensions must not be adjusted for overscan. (The algorithm does that separately.)
Let max dimension be full dimension - (2 × margin).
Horizontal: Let step be the height of the first line box in boxes.
Vertical: Let step be the width of the first line box in boxes.
If step is zero, then jump to the step labeled done positioning below.
Let line position be the text track cue computed line position.
Round line position to an integer by adding 0.5 and then flooring it.
Vertical Growing Left: Add one to line position then negate it.
Let position be the result of multiplying step and line position.
Vertical Growing Left: Decrease position by the width of the bounding box of the boxes in boxes, then increase position by step.
If line position is less than zero then increase position by max dimension, and negate step.
Otherwise, increase position by margin.
Horizontal: Move all the boxes in boxes down by the distance given by position.
Vertical: Move all the boxes in boxes right by the distance given by position.
Remember the position of all the boxes in boxes as their specified position.
Let best position be null. It will hold a position for boxes, much like specified position in the previous step.
Let best position score be null.
Let switched be false.
Horizontal: Let title area be a box that covers all of the video's rendering area except for a height of margin at the top of the rendering area and a height of margin at the bottom of the rendering area.
Vertical: Let title area be a box that covers all of the video's rendering area except for a width of margin at the left of the rendering area and a width of margin at the right of the rendering area.
Step loop: If none of the boxes in boxes would overlap any of the boxes in output, and all of the boxes in output are entirely within the title area box, then jump to the step labeled done positioning below.
Let current position score be the percentage of the area of the bounding box of the boxes in boxes that is outside the title area box.
If best position is null (i.e. this is the first run through this loop, switched is still false, the boxes in boxes are at their specified position, and best position score is still null), or if current position score is a lower percentage than that in best position score, then remember the position of all the boxes in boxes as their best position, and set best position score to current position score.
Horizontal: If step is negative and the top of the first line box in boxes is now above the top of the title area, or if step is positive and the bottom of the first line box in boxes is now below the bottom of the title area, jump to the step labeled switch direction.
Vertical: If step is negative and the left edge of the first line box in boxes is now to the left of the left edge of the title area, or if step is positive and the right edge of the first line box in boxes is now to the right of the right edge of the title area, jump to the step labeled switch direction.
Horizontal: Move all the boxes in boxes down by the distance given by step. (If step is negative, then this will actually result in an upwards movement of the boxes in absolute terms.)
Vertical: Move all the boxes in boxes right by the distance given by step. (If step is negative, then this will actually result in a leftwards movement of the boxes in absolute terms.)
Jump back to the step labeled step loop.
Switch direction: If switched is true, then move all the boxes in boxes back to their best position, and jump to the step labeled done positioning below.
Otherwise, move all the boxes in boxes back to their specified position as determined in the earlier step.
Negate step.
Set switched to true.
Jump back to the step labeled step loop.
Let bounding box be the bounding box of the boxes in boxes.
Run the appropriate steps from the following list:
Move all the boxes in boxes up by half of the height of bounding box.
Move all the boxes in boxes up by the height of bounding box.
Move all the boxes in boxes left by half of the width of bounding box.
Move all the boxes in boxes left by the width of bounding box.
If none of the boxes in boxes would overlap any of the boxes in output, and all the boxes in output are within the video's rendering area, then jump to the step labeled done positioning below.
If there is a position to which the boxes in boxes can be moved while maintaining the relative positions of the boxes in boxes to each other such that none of the boxes in boxes would overlap any of the boxes in output, and all the boxes in output would be within the video's rendering area, then move the boxes in boxes to the closest such position to their current position, and then jump to the step labeled done positioning below. If there are multiple such positions that are equidistant from their current position, use the highest one amongst them; if there are several at that height, then use the leftmost one amongst them.
Otherwise, jump to the step labeled done positioning below. (The boxes will unfortunately overlap.)
Done positioning: If there are any line boxes in the (possibly now repositioned) boxes that do not completely fit inside video's rendering area, remove those offending line boxes from boxes.
Return boxes.
When following the rules for updating the display of WebVTT text tracks, user agents must set properties of WebVTT Node Objects at the CSS user agent cascade layer as defined in this section. [CSS]
Initialize the (root) list of WebVTT Node Objects with the following CSS settings:
The variables direction, writing-mode, top, left, width, and height are the values with those names determined by the rules for updating the display of WebVTT text tracks for the text track cue from whose text the list of WebVTT Node Objects was constructed.
The 'text-align' property on the (root) list of WebVTT Node Objects must be set to the value in the second cell of the row of the table below whose first cell is the value of the corresponding cue's text track cue text alignment:
Text track cue text alignment | 'text-align' value |
---|---|
Start alignment | 'start' |
Middle alignment | 'center' |
End alignment | 'end' |
Left alignment | 'left' |
Right alignment | 'right' |
The 'font' shorthand property on the (root) list of WebVTT Node Objects must be set to '5vh sans-serif'. [CSSRUBY] [CSSVALUES]
The 'color' property on the (root) list of WebVTT Node Objects must be set to 'rgba(255,255,255,1)'. [CSSCOLOR]
The 'background' shorthand property on the WebVTT cue background box must be set to 'rgba(0,0,0,0.8)'. [CSSCOLOR]
The 'white-space' property on the (root) list of WebVTT Node Objects must be set to 'pre-line'. [CSS]
The 'font-style' property on WebVTT Italic Objects must be set to 'italic'.
The 'font-weight' property on WebVTT Bold Objects must be set to 'bold'.
The 'text-decoration' property on WebVTT Underline Objects must be set to 'underline'.
The 'display' property on WebVTT Ruby Objects must be set to 'ruby'. [CSSRUBY]
The 'display' property on WebVTT Ruby Text Objects must be set to 'ruby-text'. [CSSRUBY]
Every WebVTT region object is initialized with the following CSS settings:
The variables width, height, top, and left are the values with those names determined by the rules for updating the display of WebVTT text tracks for the text track region from which the WebVTT region object was constructed.
The children of every WebVTT region object are further initialized with these CSS settings:
All other non-inherited properties must be set to their initial values; inherited properties on the root list of WebVTT Node Objects must inherit their values from the media element for which the text track cue is being rendered, if any. If there is no media element (i.e. if the text track is being rendered for another media playback mechanism), then inherited properties on the root list of WebVTT Node Objects and the WebVTT region objects must take their initial values.
If there are style sheets that apply to the media element or other playback mechanism, then they must be interpreted as defined in the next section.
When a user agent is rendering one or more text track cues according to the rules for updating the display of WebVTT text tracks, WebVTT Node Objects in the list of WebVTT Node Objects used in the rendering can be matched by certain pseudo-selectors as defined below. These selectors can begin or stop matching individual WebVTT Node Objects while a cue is being rendered, even in between applications of the rules for updating the display of WebVTT text tracks (which are only run when the set of active cues changes). User agents that support the pseudo-element described below must dynamically update renderings accordingly. When either 'white-space' or one of the properties corresponding to the 'font' shorthand (including 'line-height') changes value, then the text track cue's text track cue display state must be emptied and the text track's rules for updating the text track rendering must be immediately rerun.
Pseudo-elements apply to elements that are matched by selectors. For the purpose of this section, that element is the matched element. The pseudo-elements defined in the following sections affect the styling of parts of text track cues that are being rendered for the matched element.
If the matched element is not a video
element, the
pseudo-elements defined below won't have any effect according to this specification.
A CSS user agent that implements the text tracks model must implement the '::cue' and '::cue(selector)' pseudo-elements, and the ':past' and ':future' pseudo-classes.
The '::cue' pseudo-element (with no argument) matches any list of WebVTT Node Objects constructed for the matched element, with the exception that the properties corresponding to the 'background' shorthand must be applied to the WebVTT cue background box rather than the list of WebVTT Node Objects.
The following properties apply to the '::cue' pseudo-element with no argument; other properties set on the pseudo-element must be ignored:
The '::cue(selector)' pseudo-element with an argument must have an argument that consists of a group of selectors. It matches any WebVTT Internal Node Object constructed for the matched element that also matches the given group of selectors, with the nodes being treated as follows:
The document tree against which the selectors are matched is the tree of WebVTT Node Objects rooted at the list of WebVTT Node Objects for the cue.
WebVTT Internal Node Objects are elements in the tree.
For the purposes of element type selectors, the names of WebVTT Internal Node Objects are as given by the following table, where objects having the concrete class given in a cell in the first column have the name given by the second column of the same row:
Concrete class | Name |
---|---|
WebVTT Class Objects | c |
WebVTT Italic Objects | i |
WebVTT Bold Objects | b |
WebVTT Underline Objects | u |
WebVTT Ruby Objects | ruby |
WebVTT Ruby Text Objects | rt |
WebVTT Voice Objects | v |
WebVTT Language Objects | lang |
Other elements (specifically, lists of WebVTT Node Objects) | No explicit name. |
For the purposes of element type and universal selectors, WebVTT Internal Node Objects are considered as being in the namespace expressed as the empty string.
For the purposes of attribute selector matching, WebVTT Internal Node Objects have no attributes, except for WebVTT Voice Objects, which have a single attribute named
"voice
" whose value is the value of the WebVTT Voice Object, and WebVTT Language Objects, which have a single attribute
named "lang
" whose value is the object's applicable language.
For the purposes of class selector matching, WebVTT Internal Node Objects have the classes described as the WebVTT Node Object's applicable classes.
For the purposes of the :lang()
pseudo-class, WebVTT Internal Node Objects have the language
described as the WebVTT Node Object's applicable language.
For the purposes of ID selector matching, lists of WebVTT Node Objects have the ID given by the cue's text track cue identifier, if any.
The following properties apply to the '::cue()' pseudo-element with an argument:
In addition, the following properties apply to the '::cue()' pseudo-element with an argument when the selector does not contain the ':past' and ':future' pseudo-classes:
Properties that do not apply must be ignored.
As a special exception, the properties corresponding to the 'background' shorthand, when they would have been applied to the list of WebVTT Node Objects, must instead be applied to the WebVTT cue background box.
The ':past' and ':future' pseudo-classes sometimes match WebVTT Node Objects. [SELECTORS]
The ':past' pseudo-class only matches WebVTT Node Objects that are in the past.
A WebVTT Node Object c is in the past if, in a pre-order, depth-first traversal of the text track cue's list of WebVTT Node Objects, there exists a WebVTT Timestamp Object whose value is less than the current playback position of the media element that is the matched element, entirely after the WebVTT Node Object c.
The ':future' pseudo-class only matches WebVTT Node Objects that are in the future.
A WebVTT Node Object c is in the future if, in a pre-order, depth-first traversal of the text track cue's list of WebVTT Node Objects, there exists a WebVTT Timestamp Object whose value is greater than the current playback position of the media element that is the matched element, entirely before the WebVTT Node Object c.
Pseudo-elements apply to elements that are matched by selectors. For the purpose of this section, that element is the matched element. The pseudo-element defined below affects the styling of text track regions that are being rendered for the matched element.
If the matched element is not a video element, the pseudo-element defined below won't have any effect according to this specification.
The '::cue-region' pseudo-element (with no argument) matches any list of WebVTT region objects constructed for the matched element.
The same properties that apply to '::cue' apply to the '::cue-region' pseudo-element with no argument; other properties set on the pseudo-element must be ignored.
When a user agent is rendering one or more text track regions according to the rules for updating the display of WebVTT text tracks, WebVTT region objects used in the rendering can be matched by the above pseudo-element. User agents that support the pseudo-element must dynamically update renderings accordingly. When either 'white-space' or one of the properties corresponding to the 'font' shorthand (including 'line-height') changes value, then the text track cue display state of all the text track cues in the region must be emptied and the text track's rules for updating the text track rendering must be immediately rerun.
A CSS user agent that implements the text tracks model must implement the '::cue-region' pseudo-element.
VTTCue
interfaceThe following interface is used to expose WebVTT cues in the DOM API:
enum AutoKeyword { "auto" }; enum DirectionSetting { "" /* horizontal */, "rl", "lr" }; enum LineAlignSetting { "start", "middle", "end" }; enum PositionAlignSetting { "start", "middle", "end", "auto" }; enum AlignSetting { "start", "middle", "end", "left", "right" }; [Constructor(double startTime, double endTime, DOMString text)] interface VTTCue : TextTrackCue { attribute VTTRegion? region; attribute DirectionSetting vertical; attribute boolean snapToLines; attribute (double or AutoKeyword) line; attribute LineAlignSetting lineAlign; attribute (double or AutoKeyword) position; attribute PositionAlignSetting positionAlign; attribute double size; attribute AlignSetting align; attribute DOMString text; DocumentFragment getCueAsHTML(); };
VTTCue
(
startTime, endTime, text )Returns a new VTTCue
object, for use with the
addCue()
method.
The startTime argument sets the text track cue start time.
The endTime argument sets the text track cue end time.
The text argument sets the text track cue text.
Returns the VTTRegion
object to which this cue belongs, if any, or null
otherwise.
Can be set.
Returns a string representing the text track cue writing direction, as follows:
The empty string.
The string "rl
".
The string "lr
".
Can be set.
Returns true if the text track cue snap-to-lines flag is set, false otherwise.
Can be set.
Returns the text track cue line position. In the case of the value being auto, the string "auto
" is
returned.
Can be set.
Returns a string representing the text track cue line alignment, as follows:
The string "start
".
The string "middle
".
The string "end
".
Can be set.
Returns the text track cue text position. In the case of the value being auto, the string "auto
" is
returned.
Can be set.
Returns a string representing the text track cue text position alignment, as follows:
The string "start
".
The string "middle
".
The string "end
".
The string "auto
".
Can be set.
Returns the text track cue size.
Can be set.
Returns a string representing the text track cue text alignment, as follows:
The string "start
".
The string "middle
".
The string "end
".
The string "left
".
The string "right
".
Can be set.
Returns the text track cue text in raw unparsed form.
Can be set.
Returns the text track cue text as a DocumentFragment
of HTML
elements and other DOM nodes.
The VTTCue(startTime, endTime,
text)
constructor, when invoked, must run the following steps:
Create a new text track cue. Let cue be that text track cue.
Let cue's text track cue start time be the value of the startTime argument, interpreted as a time in seconds.
Let cue's text track cue end time be the value of the endTime argument, interpreted as a time in seconds.
Let cue's text track cue text be the value of the text argument, and let the rules for rendering the cue in isolation be the rules for interpreting WebVTT cue text.
Let cue's text track cue identifier be the empty string.
Let cue's text track cue pause-on-exit flag be false.
Let cue's text track cue region be null.
Let cue's text track cue writing direction be horizontal.
Let cue's text track cue snap-to-lines flag be true.
Let cue's text track cue line position be auto.
Let cue's text track cue line alignment be start alignment.
Let cue's text track cue text position be auto.
Let cue's text track cue text position alignment be auto.
Let cue's text track cue size be 100.
Let cue's text track cue text alignment be middle alignment.
Return the VTTCue
object representing cue.
The region
attribute, on getting, must return
the VTTRegion
object representing the text track cue region of the
text track cue that the VTTCue
object represents, if any; or null
otherwise. On setting, the text track cue region must be set to the new value.
The vertical
attribute, on getting, must
return the string from the second cell of the row in the table below whose first cell is the
text track cue writing direction of the text track cue that the
VTTCue
object represents:
Text track cue writing direction | vertical value |
---|---|
Horizontal | " " (the empty string) |
Vertical growing left | "rl " |
Vertical growing right | "lr " |
On setting, the text track cue writing direction must be set to the value given in the first cell of the row in the table above whose second cell is a case-sensitive match for the new value.
The snapToLines
attribute, on getting,
must return true if the text track cue snap-to-lines flag of the text track cue
that the VTTCue
object represents is set; or false otherwise. On setting,
the text track cue snap-to-lines flag must be set if the new value is true, and must be
unset otherwise.
The line
attribute, on getting, must return the
text track cue line position of the text track cue that the
VTTCue
object represents. The special value auto must be represented as the string "auto
". On
setting, the text track cue line position must be set to the new value; if the new value
is the string "auto
", then it must be interpreted as the special value auto.
The lineAlign
attribute, on getting, must
return the string from the second cell of the row in the table below whose first cell is the
text track cue line alignment of the text track cue that the
VTTCue
object represents:
Text track cue line alignment | lineAlign value |
---|---|
Start alignment | "start " |
Middle alignment | "middle " |
End alignment | "end " |
On setting, the text track cue line alignment must be set to the value given in the first cell of the row in the table above whose second cell is a case-sensitive match for the new value.
The position
attribute, on getting, must
return the text track cue text position of the text track cue that the
VTTCue
object represents. The special value auto must be represented as the string "auto
". On
setting, if the new value is negative or greater than 100, then an
IndexSizeError
exception must be thrown. Otherwise, the text track cue
text position must be set to the new value; if the new value is the string
"auto
", then it must be interpreted as the special value auto.
The positionAlign
attribute, on getting,
must return the string from the second cell of the row in the table below whose first cell is
the text track cue text position alignment of the text track cue that the
VTTCue
object represents:
Text track cue text position alignment | positionAlign value |
---|---|
Start alignment | "start " |
Middle alignment | "middle " |
End alignment | "end " |
Automatic alignment | "auto " |
On setting, the text track cue text position alignment must be set to the value given in the first cell of the row in the table above whose second cell is a case-sensitive match for the new value.
The size
attribute, on getting, must return the
text track cue size of the text track cue that the VTTCue
object represents. On setting, if the new value is negative or greater than 100, then an
IndexSizeError
exception must be thrown. Otherwise, the text track cue
size must be set to the new value.
The align
attribute, on getting, must return the
string from the second cell of the row in the table below whose first cell is the text track
cue text alignment of the text track cue that the VTTCue
object
represents:
Text track cue text alignment | align value |
---|---|
Start alignment | "start " |
Middle alignment | "middle " |
End alignment | "end " |
Left alignment | "left " |
Right alignment | "right " |
On setting, the text track cue text alignment must be set to the value given in the first cell of the row in the table above whose second cell is a case-sensitive match for the new value.
The text
attribute, on getting, must return the
raw text track cue text of the text track cue that the VTTCue
object represents. On setting, the text track cue text must be set to the new value.
The getCueAsHTML()
method must convert
the text track cue text to a DocumentFragment
for the responsible
document specified by the entry settings object by applying the WebVTT cue text
DOM construction rules to the result of applying the WebVTT cue text parsing rules to
the text track cue text.
VTTRegion
interfaceThe following interface is used to expose WebVTT regions in the DOM API:
enum ScrollSetting { "" /* none */, "up" }; [Constructor] interface VTTRegion { attribute double width; attribute long lines; attribute double regionAnchorX; attribute double regionAnchorY; attribute double viewportAnchorX; attribute double viewportAnchorY; attribute ScrollSetting scroll; };
VTTRegion
()Returns a new VTTRegion
object.
Returns the text track region width as a percentage of the video width. Can be set. Throws
an IndexSizeError
if the new value is not in the range 0..100.
Returns the text track region height as a number of lines. Can be set.
Returns the text track region anchor X offset as a percentage of the region width. Can be
set. Throws an IndexSizeError
if the new value is not in the range
0..100.
Returns the text track region anchor Y offset as a percentage of the region height. Can be
set. Throws an IndexSizeError
if the new value is not in the range
0..100.
Returns the text track region viewport anchor X offset as a percentage of the video width.
Can be set. Throws an IndexSizeError
if the new value is not in the range
0..100.
Returns the text track region viewport anchor Y offset as a percentage of the video height.
Can be set. Throws an IndexSizeError
if the new value is not in the range
0..100.
Returns a string representing the text track region scroll as follows:
The empty string.
The string "up
".
Can be set.
The VTTRegion()
constructor, when
invoked, must run the following steps:
Create a new text track region. Let region be that text track region.
Let region's text track region identifier be the empty string.
Let region's text track region width be 100.
Let region's text track region lines be 3.
Let region's text track region regionAnchorX be 0.
Let region's text track region regionAnchorY be 100.
Let region's text track region viewportAnchorX be 0.
Let region's text track region viewportAnchorY be 100.
Let region's text track region scroll be the empty string.
Return the VTTRegion
object representing region.
The width
attribute, on getting, must return
the text track region width of the text track region that the
VTTRegion
object represents, in percent of video width. On setting, the
text track region width must be set to the new value, interpreted as a percentage.
The lines
attribute, on getting, must return
the text track region lines of the text track region that the
VTTRegion
object represents, as number of lines. On setting, the text
track region lines must be set to the new value, interpreted as a number of lines.
The regionAnchorX
attribute, on
getting, must return the text track region anchor X offset of the text track
region that the VTTRegion
object represents, in percent of region width.
On setting, the text track region anchor X distance must be set to the new value,
interpreted as a percentage.
The regionAnchorY
attribute, on
getting, must return the text track region anchor Y offset of the text track
region that the VTTRegion
object represents, in percent of region
height. On setting, the text track region anchor Y distance must be set to the new value,
interpreted as a percentage.
The viewportAnchorX
attribute, on
getting, must return the text track region viewport anchor X offset of the text track
region that the VTTRegion
object represents, in percent of video width.
On setting, the text track region viewport anchor X distance must be set to the new
value, interpreted as a percentage.
The viewportAnchorY
attribute, on
getting, must return the text track region viewport anchor Y offset of the text track
region that the VTTRegion
object represents, in percent of video height.
On setting, the text track region viewport anchor Y distance must be set to the new
value, interpreted as a percentage.
The scroll
attribute, on getting, must
return the string from the second cell of the row in the table below whose first cell is the
text track region scroll setting of the text track region that the
VTTRegion
object represents:
Text track region scroll | scroll value |
---|---|
None | " " (the empty string) |
Up | "up " |
On setting, the text track region scroll must be set to the value given on the first cell of the row in the table above whose second cell is a case-sensitive match for the new value.
text/vtt
This registration is for community review and will be submitted to the IESG for review, approval, and registration with IANA.
Text track files themselves pose no immediate risk unless sensitive information is included within the data. Implementations, however, are required to follow specific rules when processing text tracks, to ensure that certain origin-based restrictions are honored. Failure to correctly implement these rules can result in information leakage, cross-site scripting attacks, and the like.
Rules for processing both conforming and non-conforming content are defined in this specification.
WebVTT files all begin with one of the following byte sequences (where "EOF" means the end of the file):
(An optional UTF-8 BOM, the ASCII string "WEBVTT
", and finally
a space, tab, line break, or the end of the file.)
vtt
"Fragment identifiers have no meaning with text/vtt
resources.
Thanks to the SubRip community, including in particular Zuggy and ai4spam, for their work on the SubRip software program whose SRT file format was used as the basis for the WebVTT text track file format.
Thanks to the many contributors to the HTML standard, where WebVTT was originally specified. [HTML]
Thanks to the following active contributors to this spec: Glenn Adams, Victor Cărbune, Eric Carlson, Anna Cavender, Cyril Concolato, Rick Eyre, fantasai, John Foliot, Lawrence Forooghian, Ralph Giles, Loretta Guarino Reid, Kyle Huey, Anne van Kesteren, Glenn Maynard, Ronny Mennerich, Ms2ger, Frank Olivier, Giuseppe Pascale, Simon Pieters, Caitlin Potter, David Singer.
See a problem?
Select text and file a bug!
1.2 Comments
This section is non-normative.
Comments can be included in WebVTT files.
Comments are just blocks that are preceded by a blank line, start with the word "
NOTE
" (followed by a space or newline), and end at the first blank line.Here, a one-line comment is used to note a possible problem with a cue.
In this example, the author has written many comments.