Revisiting in-band text tracks in MediaSource Extensions

Alicia Boya García
(Igalia, W3C MEIG)

Participation policies

Assumptions

I will assume you have some familiarity with MSE (MediaSource Extensions).

Knowledge of specific text track formats is not assumed.

Out-of-band text track: provided as a separate file: .srt, .vtt
In-band text track: part of a container file: .mp4, .webm, .mkv
- The same file can contain video and audio
  ... but it doesn't necessarily have to: .mks

Agenda

Informative

Introduction to WebVTT
WebVTT representations in MP4 and WebM

Discussion

Challenges with text tracks in MSE

WebVTT

Web Video Text Tracks Format

Reasonable first target for in-band support in MSE
W3C Candidate Recommendation
Widely available in browsers for out-of-band text tracks
Supported by some non-browser players as well

Simplest WebVTT

Basic syntax inspired by SRT

WEBVTT

00:00:01.000 --> 00:00:02.430
Good evening!

00:00:02.800 --> 00:00:05.000
Is anyone there?

Cue settings

WEBVTT

00:00:01.000 --> 00:00:02.430 position:10% align:left
Good evening!

00:00:02.800 --> 00:00:05.000 position:90% align:right
Is anyone there?

Cue IDs

IDs are available to scripting and stylesheets

WEBVTT

An ID for an important cue
00:00:01.000 --> 00:00:02.430
Good evening!

Comment blocks

WEBVTT

00:00:01.000 --> 00:00:02.430
Good evening!

NOTE Is it late enough to use "good evening"? I'm not sure.

00:00:02.800 --> 00:00:05.000
Is anyone there?

Stylesheets

WEBVTT

STYLE
::cue {
  background-color: lightgray;
  color: black;
}

00:00:01.000 --> 00:00:02.430
Good evening!

Regions

WEBVTT

REGION
id:fred
width:40%
lines:3
regionanchor:0%,100%
viewportanchor:10%,90%
scroll:up

Cues overlapping in time

The start timestamps of cues must be in increasing order

WEBVTT

00:00:01.000 --> 00:00:02.430
Good evening!

00:00:01.400 --> 00:00:06.120 region:sfx_top
(bells chime)

00:00:02.800 --> 00:00:05.000
Is anyone there?

Delayed parts

WEBVTT

00:00:01.000 --> 00:00:05.000
Good evening...! <00:00:02.800>Is anyone there?

00:00:01.400 --> 00:00:06.120 region:sfx_top
(bells chime)

A look at in-band WebVTT

In-band WebVTT

WebVTT is placed inside a container format:

ISO BMFF/MP4
WebM/Matroska
I'm not aware of any MPEG2-TS representation

WebVTT in ISO BMFF (MP4)

Base specification: ISO/IEC 14496 Part 12:
ISO base media file format
WebVTT representation specified in ISO/IEC 14496 Part 30:
Timed text and other visual overlays in ISO base media file format
'codecs' should contain 'wvtt'

WebVTT in ISO BMFF (MP4)

Initialization segment (moov)

Text track with a WebVTT-specific sample entry (codec configuration):

wvtt WVTTSampleEntry

vttC WebVTTConfigurationBox
- 1 String: WebVTT file header
vlab WebVTTSourceLabelBox (optional)
- 0..1 String: opaque URI.
  Used to tell apart any two cues from two different movies

WebVTT in ISO BMFF (MP4)

Media segment (mdat)

Timing is handled by the container.

Cues are split into continuous non-overlapping frames (samples).

The frame contents are ISO BMFF boxes.

WebVTT in ISO BMFF (MP4)

Media segment (mdat)

Two types of frames:

Gap: No cues for a certain period
- 1 vtte VTTEmptyCueBox
Non-gap:
- 1..* vttc VTTCueBox
- 0..* vtta VTTAdditionalBox
  - Used for notes/comments

vttc VTTCueBox

0..1 vsid CueSourceIDBox
- int32: along with source label, uniquely identifies this cue
0..1 iden CueIDBox
- string: WebVTT cue identifier (e.g. for scripts and CSS)
0..1 ctim CueTimeBox
- string: Original cue timestamp (used for cues with delayed parts)
0..1 sttg CueSettingsBox
1 payl CuePayloadBox

WebVTT in WebM

Two competing representations:

D_WEBVTT/kind: Early proposal from the WebM project (2012)
- ✅ Featured in the WebM documentation WebVTT guidelines
- ❌ No file header (and therefore CSS and regions) support
- ❌ Encoding of cues with delayed parts not specified
S_TEXT/WEBVTT: Later Matroska IETF standards track draft (2018)
- ✅ WebVTT header goes in CodecPrivate
- ✅ Encoding of cues with delayed parts is specified

WebVTT in WebM

Common to `D_WEBVTT/kind` and `S_TEXT/WEBVTT`

✅ Timing is handled by the container.
🤔 One cue = one frame
- Overlapping cues are encoded as overlapping frames
❌ Gaps are not explicitly encoded
❌ No provision for how to join cues split at segment boundaries

That's enough background...

Let's talk about MSE

Cues vs MSE coded frames

Coded frames in the MSE spec roughly correspond to frames in a container.

How many coded frames is a WebVTT cue?

one MSE coded frame = one WebVTT cue?
have it be dependent on the bytestream format (MP4 vs WebM)?
have it be an implementation detail?
have it be consistent, but something else (maybe similar to MP4)?

Gaps and sparse streams

Consider WebVTT inside MP4

Is a VTTEmptyCueBox frame an MSE coded frame?
... or should it be something new, e.g. coded gap?
... or should it be ignored per spec?

Gaps and sparse streams

Consider other formats

3GPP Timed Text (common in MP4) encodes gaps as cues with empty text
Matroska does not encode WebVTT gaps explicitly

Gaps and sparse streams

Consider generalization to non-text streams

Audio gap: silent section.
Video gap: continuation of the last frame or replacement image.

Use cases

Asuming audio and video in separate SourceBuffer's...

Live playback (e.g. sports)
- Continuing playback even if chunks of audio and/or video are missing
Splicing a silent ad into a video with audio

SourceBuffer with only a text track

Currently de-facto unsupported

application/mp4; codecs="wvtt" missing in the MSE bytestream spec
MSE spec currently assumes text streams are discontinuous
- Buffered ranges are computed only from video and audio
- As a result, the SourceBuffer buffered ranges are empty, stalling playback
Should a SourceBuffer with only a text track work?
- ... only when using representations with explicit gaps?

Cues across segment boundaries

WebVTT in MP4

One container frame ≠ one cue
The demuxer can tell you're extending the cue
Can MSE tell you're extending the cue that spans two appends?
- Requirement or quality of implementation issue?
If it can tell, how should it present it to the user?
- Update the cue and emit "oncuechange"?

WebVTT in WebM MSE bytestream

Potential problems identified

S_TEXT/WEBVTT vs D_WEBVTT/kind
- As it stands, only S_TEXT/WEBVTT can fully support WebVTT
No explicit gaps
No way to split cues across segments

Are the existing representations viable for MSE?
- If not, what would we need?
Should we pick one for the bytestream spec?

Embedded text tracks

CTA/CEA/EIA-608/708

Widely used format for closed captions, especially in broadcast
Originally encoded in analog broadcast
Often carried inside h.264/h.265 using SEI messages
- One of the very few ways to stream captions through MPEG2-TS
- Currently recommended by DASH-IF for interoperability
- Can't be detected without some external signalling

Embedded text tracks

ID3 Timed Text

ID3 tags interleaved with an MPEG2-TS stream (usually HLS)
... or emsg boxes between MP4 fragments
Normally used for application-specific use cases, not captions
- Ad insertion
- Time-specific metadata
Can't be detected without some external signalling

This is the end of the slides

Discussion time

Revisiting in-band text tracks in MediaSource Extensions

Participation policies

Assumptions

Agenda

Informative

Discussion

Introduction to WebVTT

WebVTT

Simplest WebVTT

Cue settings

Cue IDs

Comment blocks

Stylesheets

Regions

Cues overlapping in time

Delayed parts

A look at in-band WebVTT

In-band WebVTT

WebVTT in ISO BMFF (MP4)

WebVTT in ISO BMFF (MP4)

Initialization segment (moov)

WebVTT in ISO BMFF (MP4)

Media segment (mdat)

WebVTT in ISO BMFF (MP4)

Media segment (mdat)

WebVTT in WebM

WebVTT in WebM

Common to D_WEBVTT/kind and S_TEXT/WEBVTT

That's enough background...

Let's talk about MSE

Cues vs MSE coded frames

Gaps and sparse streams

Consider WebVTT inside MP4

Gaps and sparse streams

Consider other formats

Gaps and sparse streams

Consider generalization to non-text streams

Use cases

SourceBuffer with only a text track

Currently de-facto unsupported

Cues across segment boundaries

WebVTT in MP4

WebVTT in WebM MSE bytestream

Potential problems identified

Embedded text tracks

CTA/CEA/EIA-608/708

Embedded text tracks

ID3 Timed Text

Common to `D_WEBVTT/kind` and `S_TEXT/WEBVTT`