Leaving presentation mode.
Revisiting in-band text tracks in MediaSource Extensions
Alicia Boya García
(Igalia, W3C MEIG)
Assumptions
I will assume you have some familiarity with MSE (MediaSource Extensions).
Knowledge of specific text track formats is not assumed.
- Out-of-band text track: provided as a separate file: .srt, .vtt
- In-band text track: part of a container file: .mp4, .webm, .mkv
- The same file can contain video and audio
... but it doesn't necessarily have to: .mks
Agenda
Informative
- Introduction to WebVTT
- WebVTT representations in MP4 and WebM
Discussion
- Challenges with text tracks in MSE
WebVTT
- Reasonable first target for in-band support in MSE
- W3C Candidate Recommendation
- Widely available in browsers for out-of-band text tracks
- Supported by some non-browser players as well
Simplest WebVTT
Basic syntax inspired by SRT
WEBVTT
00:00:01.000 --> 00:00:02.430
Good evening!
00:00:02.800 --> 00:00:05.000
Is anyone there?
Cue settings
WEBVTT
00:00:01.000 --> 00:00:02.430 position:10% align:left
Good evening!
00:00:02.800 --> 00:00:05.000 position:90% align:right
Is anyone there?
Cue IDs
IDs are available to scripting and stylesheets
WEBVTT
An ID for an important cue
00:00:01.000 --> 00:00:02.430
Good evening!
Comment blocks
WEBVTT
00:00:01.000 --> 00:00:02.430
Good evening!
NOTE Is it late enough to use "good evening"? I'm not sure.
00:00:02.800 --> 00:00:05.000
Is anyone there?
Stylesheets
WEBVTT
STYLE
::cue {
background-color: lightgray;
color: black;
}
00:00:01.000 --> 00:00:02.430
Good evening!
Regions
WEBVTT
REGION
id:fred
width:40%
lines:3
regionanchor:0%,100%
viewportanchor:10%,90%
scroll:up
Cues overlapping in time
The start timestamps of cues must be in increasing order
WEBVTT
00:00:01.000 --> 00:00:02.430
Good evening!
00:00:01.400 --> 00:00:06.120 region:sfx_top
(bells chime)
00:00:02.800 --> 00:00:05.000
Is anyone there?
Delayed parts
WEBVTT
00:00:01.000 --> 00:00:05.000
Good evening...! <00:00:02.800>Is anyone there?
00:00:01.400 --> 00:00:06.120 region:sfx_top
(bells chime)
In-band WebVTT
WebVTT is placed inside a container format:
- ISO BMFF/MP4
- WebM/Matroska
- I'm not aware of any MPEG2-TS representation
WebVTT in ISO BMFF (MP4)
- Base specification: ISO/IEC 14496 Part 12:
ISO base media file format
- WebVTT representation specified in ISO/IEC 14496 Part 30:
Timed text and other visual overlays in ISO base media file format
- 'codecs' should contain 'wvtt'
WebVTT in ISO BMFF (MP4)
Initialization segment (moov)
Text track with a WebVTT-specific sample entry
(codec configuration):
wvtt
WVTTSampleEntry
- vttC WebVTTConfigurationBox
- 1 String: WebVTT
file header
- vlab WebVTTSourceLabelBox (optional)
- 0..1 String: opaque URI.
Used to tell apart any two cues from two different movies
WebVTT in ISO BMFF (MP4)
Media segment (mdat)
Timing is handled by the container.
Cues are split into continuous non-overlapping frames (samples
).
The frame contents are ISO BMFF boxes.
WebVTT in ISO BMFF (MP4)
Media segment (mdat)
Two types of frames:
- Gap: No cues for a certain period
- Non-gap:
- 1..* vttc VTTCueBox
- 0..* vtta VTTAdditionalBox
vttc
VTTCueBox
- 0..1 vsid CueSourceIDBox
- int32: along with
source label
, uniquely identifies this cue
- 0..1 iden CueIDBox
- string: WebVTT cue identifier (e.g. for scripts and CSS)
- 0..1 ctim CueTimeBox
- string: Original cue timestamp (used for cues with delayed parts)
- 0..1 sttg CueSettingsBox
- 1 payl CuePayloadBox
WebVTT in WebM
Two competing representations:
WebVTT in WebM
Common to D_WEBVTT/kind and S_TEXT/WEBVTT
- ✅ Timing is handled by the container.
- 🤔 One cue = one frame
- Overlapping cues are encoded as overlapping frames
- ❌ Gaps are not explicitly encoded
- ❌ No provision for how to join cues split at segment boundaries
That's enough background...
Let's talk about MSE
Cues vs MSE coded frames
Coded frames
in the MSE spec roughly correspond to frames in a container.
How many coded frames is a WebVTT cue?
- one MSE coded frame = one WebVTT cue?
- have it be dependent on the bytestream format (MP4 vs WebM)?
- have it be an implementation detail?
- have it be consistent, but something else (maybe similar to MP4)?
Gaps and sparse streams
Consider WebVTT inside MP4
Is a VTTEmptyCueBox frame an MSE coded frame?
... or should it be something new, e.g. coded gap
?
... or should it be ignored per spec?
Gaps and sparse streams
Consider other formats
Gaps and sparse streams
Consider generalization to non-text streams
- Audio gap: silent section.
- Video gap: continuation of the last frame or replacement image.
Asuming audio and video in separate SourceBuffer's...
- Live playback (e.g. sports)
- Continuing playback even if chunks of audio and/or video are missing
- Splicing a silent ad into a video with audio
SourceBuffer with only a text track
Currently de-facto unsupported
- application/mp4; codecs="wvtt" missing in the MSE bytestream spec
- MSE spec currently assumes text streams are discontinuous
- Buffered ranges are computed only from video and audio
- As a result, the SourceBuffer buffered ranges are empty, stalling playback
- Should a SourceBuffer with only a text track work?
- ... only when using representations with explicit gaps?
Cues across segment boundaries
WebVTT in MP4
- One container frame ≠ one cue
- The demuxer can tell you're extending the cue
- Can MSE tell you're extending the cue that spans two appends?
- Requirement or quality of implementation issue?
- If it can tell, how should it present it to the user?
- Update the cue and emit "oncuechange"?
WebVTT in WebM MSE bytestream
Potential problems identified
- S_TEXT/WEBVTT vs D_WEBVTT/kind
- As it stands, only S_TEXT/WEBVTT can fully support WebVTT
- No explicit gaps
- No way to split cues across segments
- Are the existing representations viable for MSE?
- If not, what would we need?
- Should we pick one for the bytestream spec?
Embedded text tracks
CTA/CEA/EIA-608/708
- Widely used format for closed captions, especially in broadcast
- Originally encoded in analog broadcast
- Often carried inside h.264/h.265 using SEI messages
- One of the very few ways to stream captions through MPEG2-TS
- Currently recommended by DASH-IF for interoperability
- Can't be detected without some external signalling
Embedded text tracks
ID3 Timed Text
- ID3 tags interleaved with an MPEG2-TS stream (usually HLS)
... or
emsg boxes between MP4 fragments
- Normally used for application-specific use cases, not captions
- Ad insertion
- Time-specific metadata
- Can't be detected without some external signalling
This is the end of the slides
Discussion time
To start the slide show, turn on JavaScript and press ‘A’. Return to the index by pressing ‘A’ or ‘Esc’. On a touch screen, use a 3-finger touch. Double click to open a specific slide. In slide mode, use ‘?’ to get a list of available commands.