To start the slide show, turn on JavaScript and press ‘A’. Return to the index by pressing ‘A’ or ‘Esc’. On a touch screen, use a 3-finger touch. Double click to open a specific slide. In slide mode, use ‘?’ to get a list of available commands.

Leaving presentation mode.

Revisiting in-band text tracks in MediaSource Extensions

Alicia Boya García
(Igalia, W3C MEIG)

Participation policies

Assumptions

I will assume you have some familiarity with MSE (MediaSource Extensions).

Knowledge of specific text track formats is not assumed.

Agenda

Informative

Discussion

Introduction to WebVTT

WebVTT

Web Video Text Tracks Format

Simplest WebVTT

Basic syntax inspired by SRT

WEBVTT

00:00:01.000 --> 00:00:02.430
Good evening!

00:00:02.800 --> 00:00:05.000
Is anyone there?

Cue settings

WEBVTT

00:00:01.000 --> 00:00:02.430 position:10% align:left
Good evening!

00:00:02.800 --> 00:00:05.000 position:90% align:right
Is anyone there?

Cue IDs

IDs are available to scripting and stylesheets

WEBVTT

An ID for an important cue
00:00:01.000 --> 00:00:02.430
Good evening!

Comment blocks

WEBVTT

00:00:01.000 --> 00:00:02.430
Good evening!

NOTE Is it late enough to use "good evening"? I'm not sure.

00:00:02.800 --> 00:00:05.000
Is anyone there?

Stylesheets

WEBVTT

STYLE
::cue {
  background-color: lightgray;
  color: black;
}

00:00:01.000 --> 00:00:02.430
Good evening!

Regions

WEBVTT

REGION
id:fred
width:40%
lines:3
regionanchor:0%,100%
viewportanchor:10%,90%
scroll:up

Cues overlapping in time

The start timestamps of cues must be in increasing order

WEBVTT

00:00:01.000 --> 00:00:02.430
Good evening!

00:00:01.400 --> 00:00:06.120 region:sfx_top
(bells chime)

00:00:02.800 --> 00:00:05.000
Is anyone there?

Delayed parts

WEBVTT

00:00:01.000 --> 00:00:05.000
Good evening...! <00:00:02.800>Is anyone there?

00:00:01.400 --> 00:00:06.120 region:sfx_top
(bells chime)

A look at in-band WebVTT

In-band WebVTT

WebVTT is placed inside a container format:

WebVTT in ISO BMFF (MP4)

WebVTT in ISO BMFF (MP4)

Initialization segment (moov)

Text track with a WebVTT-specific sample entry (codec configuration):

wvtt WVTTSampleEntry
  • vttC WebVTTConfigurationBox
    • 1 String: WebVTT file header
  • vlab WebVTTSourceLabelBox (optional)
    • 0..1 String: opaque URI.
      Used to tell apart any two cues from two different movies

WebVTT in ISO BMFF (MP4)

Media segment (mdat)

Timing is handled by the container.

Cues are split into continuous non-overlapping frames (samples).

The frame contents are ISO BMFF boxes.

WebVTT in ISO BMFF (MP4)

Media segment (mdat)

Two types of frames:

vttc VTTCueBox
  • 0..1 vsid CueSourceIDBox
    • int32: along with source label, uniquely identifies this cue
  • 0..1 iden CueIDBox
    • string: WebVTT cue identifier (e.g. for scripts and CSS)
  • 0..1 ctim CueTimeBox
    • string: Original cue timestamp (used for cues with delayed parts)
  • 0..1 sttg CueSettingsBox
  • 1 payl CuePayloadBox

WebVTT in WebM

Two competing representations:

WebVTT in WebM

Common to D_WEBVTT/kind and S_TEXT/WEBVTT

That's enough background...

Let's talk about MSE

Cues vs MSE coded frames

Coded frames in the MSE spec roughly correspond to frames in a container.

How many coded frames is a WebVTT cue?

Gaps and sparse streams

Consider WebVTT inside MP4

Is a VTTEmptyCueBox frame an MSE coded frame?
... or should it be something new, e.g. coded gap?
... or should it be ignored per spec?

Gaps and sparse streams

Consider other formats

Gaps and sparse streams

Consider generalization to non-text streams

Use cases

Asuming audio and video in separate SourceBuffer's...

SourceBuffer with only a text track

Currently de-facto unsupported

Cues across segment boundaries

WebVTT in MP4

WebVTT in WebM MSE bytestream

Potential problems identified

Embedded text tracks

CTA/CEA/EIA-608/708

Embedded text tracks

ID3 Timed Text

This is the end of the slides

Discussion time