Professional Documents
Culture Documents
Format (OMAF)
Ye-Kui Wang
Director, Technical Standards
December 10, 2017
A tutorial @ VCIP2017
Tutorial material available here: https://goo.gl/zFqqsC
2
Outline
Concepts: OMAF, 360o video, VR
OMAF: when, who, what, and architecture
OMAF coordinate system and key processing steps
Fisheye 360o video/image
Media storage and metadata signalling in ISOBMFF
Media encapsulation and metadata signalling in DASH
Media profiles and presentation profiles
HEVC omnidirectional video SEI messages
Viewport-dependent omnidirectional video processing
Acknowledgements
References
3
What is OMAF?
It is a systems standard
developed by MPEG
that defines a media format
that enables omnidirectional media applications,
focusing on 360° video, images, and audio, as well as
associated timed text.
4
What is 360o video?
Z
Yaw α
is a simple version
Roll γ
is supported
The user's viewing perspective is from the center of the sphere looking outward towards the inside surface of the sphere.
Purely translational movement of the user would not result in different omnidirectional media being rendered to the user.
5
What is VR?
6
Achieving immersive VR is challenging
Intuitive
Minimal latency interactions Precise motion tracking
Minimized system latency Accurate on-device motion tracking
to remove perceptible lag
8
OMAF – when
• MPEG started looking at VR standardization and started the OMAF project in Oct. 2015
• Draft international standard (DIS): Apr. 2017 MPEG meeting output (N16824)
• Final draft international standard (FDIS): Oct. 2017 MPEG meeting output (N17235)
9
MPEG-I (ISO/IEC 23090): Coded Representation of Immersive Media
10
OMAF – who
11
OMAF – what
• Scope: 360o video, images, audio, and associated timed text, 3 DOF only
• Specifies
• A coordinate system
• that consists of a unit sphere and three coordinate axes, namely the x (back-to-front) axis, the y (lateral, side-to-side) axis, and the z (vertical, up) axis
12
OMAF architecture – projected omnidirectional video
A Ba A: Real-world scene
Acquisition Audio encoding Ea
Image stitching, Ev B: Multiple-sensors-
Video encoding File/segment Fs
Bi rotation, projection, captured video or audio
D encapsulation
and region-wise F
packing Ei
Image encoding
D/D’: Projected/packed
Metadata video
File playback
Delivery E/E’: Coded video or
OMAF player audio bitstream
Loudspeakers/ A’a B’a
Audio rendering Audio decoding F/F’: ISOBMFF
headphones E’a
file/segment
Metadata
E’v File/segment F’
Video decoding
A’i
decapsulation F’s
Display Image rendering D’ E’i
Image decoding
Head/eye
Orientation/viewport metadata tracking Orientation/
viewport metadata
13
OMAF conformance points
Orientation /
metadata
viewport
Other Media
OMAF conformance
File decoder
OMAF
track(s) Video or
File Synchronized and
or Mapping of decoded Rendered
image spatially-aligned
image parser viewport
decoder picture(s) onto sphere presentation
item
14
OMAF coordinate system and key processing steps
15
OMAF coordinate system and key processing steps
The coordinate system
Projection and region-wise packing
− Concept
− Equirectangular projection
− Cubemap projection
− Region-wise packing and guard band
Basic OMAF video processing steps and order
Frame packing for stereoscopic 360o video/image
Rendering
16
The coordinate system
X: back-to-front
Y: lateral, side-to-side
Z: vertical, up
ɸd (ɸd,Ѳd)
Pitch
Ѳd Y
For a sphere location
with sphere coordinates
X Roll (, ) on the local
coordinate axes:
1. Find the 3D
𝑥1 = cos cos , 𝑦1 = sin cos , 𝑧1 = sin Cartesian coordinate
(x1, y1, z1).
𝑥2 cos 𝛽 cos 𝛾 − cos 𝛽 sin 𝛾 sin 𝛽 𝑥1
𝑦2 = cos 𝛼 sin 𝛾 + sin 𝛼 sin 𝛽 cos 𝛾 cos 𝛼 cos 𝛾 − sin 𝛼 sin 𝛽 sin 𝛾 − sin 𝛼 cos 𝛽 𝑦1 2. Apply the 3D
𝑧2 sin 𝛼 sin 𝛾 − cos 𝛼 sin 𝛽 cos 𝛾 sin 𝛼 cos 𝛾 + cos 𝛼 sin 𝛽 sin 𝛾 cos 𝛼 cos 𝛽 𝑧1 Cartesian rotation to
180° obtain the 3D
′ = atan2( y2 , x2 )×
𝜋 Cartesian
180°
′ = sin−1 z2 × coordinates (x2, y2,
𝜋
z2) on the global
Where α, β, and γ are the yaw, pitch, and roll rotation angles, respectively. coordinate axes.
18
3. Convert to the
Projection and region-wise packing: concept
• Projection and region-wise packing are the geometric operational processes used at the
content production side to generate 2D video pictures from the sphere signal
• And their inverse operations used in rendering
Monoscopic
Stereoscopic
19
Projection
• Projection is a fundamental processing step in 360o video
• OMAF supports two projection types: equirectangular and cubemap
• Descriptions of more projection types can be found in JVET-H1004
20
Equirectangular projection (ERP)
Z
PZ Top
NX Back NY Right
Six square faces
Y
3x2 layout
PY Left PX Front
==0
Some faces
NZ Bottom
rotated to
NX Back
increasing
PZ Top
maximize face
X PX Front
edge continuity
NZ
Bottom
22
Region-wise packing
Region-wise packing is
an optional step after
projection (content
production side).
It enables manipulations
(resize, reposition,
rotation, and mirroring)
of any rectangular region
of the packed picture
before encoding.
A simple example
23
Guard band
Basically allows adding some additional pixels at
geometric boundaries when generating the 2D
pictures for encoding
Frame
Stitching Rotation Projection
packing
Region-wise
Encapsulation Encoding
packing
25
Basic OMAF video processing steps and order
- OMAF player side
Inverse of
Rendering Rotation
Projection
26
Frame packing for stereoscopic 360o video/image
• OMAF supports the following three types of frame packing arrangement:
• Side-by-side
• Top-bottom
• Temporal interleaving
27
Side-by-side frame packing arrangement
X X X X X X X X X X X X
X X X X X X X X X X X X
X X X X X X X X X X X X
Upconversion
X X X X X X X X X X X X
processing
X X X X X X X X X X X X
X X X X X X X X X X X X
X X X X X X X X X X X X
X X X X O O O O X X X X X X X X X X X X
X X X X O O O O Samples of colour Upconverted colour
X X X X O O O O component plane of component plane of
Side-by-side constituent frame 0 constituent frame 0
X X X X O O O O
packing
X X X X O O O O rearrangement
X X X X O O O O
X X X X O O O O
X X X X O O O O O O O O O O O O O O O O
Interleaved colour O O O O O O O O O O O O
component plane of O O O O O O O O O O O O
side-by-side packed O O O O O O O O O O O O
decoded frame Upconversion
O O O O processing O O O O O O O O
O O O O O O O O O O O O
O O O O O O O O O O O O
O O O O O O O O O O O O
Samples of colour Upconverted colour
component plane of component plane of
constituent frame 1 constituent frame 1
H.265v2(14)_FD.4
28
Top-bottom frame packing arrangement
X X X X X X X X
X X X X X X X X X X X X X X X X
X X X X X X X X
Upconversion
X X X X X X X X X X X X X X X X
processing
X X X X X X X X
X X X X X X X X X X X X X X X X
X X X X X X X X
X X X X X X X X X X X X X X X X X X X X X X X X
X X X X X X X X Samples of colour Upconverted colour
X X X X X X X X component plane of component plane of
Top-bottom constituent frame 0 constituent frame 0
X X X X X X X X
packing
O O O O O O O O rearrangement
O O O O O O O O
O O O O O O O O
O O O O O O O O O O O O O O O O O O O O O O O O
Interleaved colour O O O O O O O O
component plane of O O O O O O O O O O O O O O O O
top-bottom packed O O O O O O O O
decoded frame Upconversion
O O O O O O O O processing O O O O O O O O
O O O O O O O O
O O O O O O O O O O O O O O O O
O O O O O O O O
Samples of colour Upconverted colour
component plane of component plane of
constituent frame 1 constituent frame 1
H.265v2(14)_FD.7
29
Temporal interleaving frame packing arrangement
2N+2
2N O O O O O O O O
O O O O O O O O
O O O O O O
O O
O O
O O O O O O
O O O O O O O O
Samples of colour
O O O O O O O O component plane of
O O O O O O
O O
O O
O O O O O O constituent frames 0
Sequentially decoded frames with O O O O O O O O
temporal interleaving frame arrangement O O O O O O O O
O O O O O O
O O
O O
O O O O O O
2N+3 O O O O O O O O
O O O O O O O O
O O O O O O O O
2N+2 X X X X X X X X
O O O O O O O O
X X X X X X X X
2N+1 O O O O O
X O
X O
X O
X X X X X
O O O O O
X O
X O
X O
X X X X X Temporal
2N X X X X X
O X
O X
O X
O O O O O interleaving
X X X X X X X X
X X X X X X X X 2N+3
O O O O O
X O
X O
X O
X X X X X decomposition
O O O O O O
X O
X O
X X X X X X
O O O O O
X O
X O
X O
X X X X X 2N+1 X X X X X X X X
O O O O O O
X O
X O
X X X X X X
O O O O O
X O
X O
X O
X X X X X X X X X X X X X
O O O O O O
X O
X O
X X X X X X X X X X X X X X
O O O O O O O O X X X X X X X X Samples of colour
O O O O O O
X O
X O
X X X X X X X X X X X X X X
O O O O O O O O X X X X X X X X component plane of
O O O O O O
X O
X O
X X X X X X X X X X X X X X
X X X X X X X X constituent frames 1
O O O O O O
X O
X O
X X X X X X X X X X X X X X
t X X X X X X X X
O O O O O O O O X X X X X X X X
X X X X X X X X
O O O O O O O O X X X X X X X X
X X X X X X X X
X X X X X X X X
X X X X X X X X
t H.265v2(14)_FD.9
30
Rendering
• The rendering process typically involves generation of a viewport
• Using the rectilinear projection
D
u
v
Y
A
P Z O
C X
• In implementations, the viewport can also be directly generated from the decoded picture
• Where the geometric processing steps like de-packing, inverse of projection, etc. are combined in an optimized
manner
31
Fisheye 360o video/image
32
Fisheye 360o video/image support in OMAF
• Fisheye is a special feature supported in OMAF
• It does not use projection or region-wise packing
33
OMAF architecture – fisheye omnidirectional video
A Ba A: Real-world scene
Acquisition Audio encoding Ea
Ev B: Multiple-sensors-
Video encoding File/segment Fs
Bi Circular image captured video or audio
D encapsulation
mapping F
Ei
Image encoding
D/D’: Fisheye video
Metadata pictures
File playback
Delivery E/E’: Coded video or
OMAF player audio bitstream
Loudspeakers/ A’a B’a
Audio rendering Audio decoding F/F’: ISOBMFF
headphones E’a
file/segment
Metadata
E’v File/segment F’
Video decoding
A’i
decapsulation F’s
Display Image stitching & E’i
D’
rendering
Image decoding
Head/eye
Orientation/viewport metadata tracking Orientation/
viewport metadata
34
Media storage and metadata signalling in ISOBMFF
35
File format basics
Why file formats?
ISOBMFF basics
Typical ISOBMFF box hierarchy
An example ISOBMFF file
36
Media file format basics
A video application always needs more than just the video bitstream.
37
Protocol stack of an HTTP adaptive streaming system
38
Why file formats?
A video application always needs more than just the video bitstream.
− Metadata, including timing information etc., to ease content exchange, editing, streaming, playback
operations like seeking, …
Lots of today’s video applications, e.g., all video streaming systems, are based on a file format.
One of the most widely used standard file format is the ISO base media file format (ISOBMFF)
− ISO/IEC 14496-12
Each media codec typically has a codec-specific file format based on ISOBMFF, for carriage of
media coded using that codec in ISOBMFF
ISO/IEC 14496-15 includes
− AVC file format
− SVC file format
− MVC file format
− HEVC file format
− Layered HEVC file format
− File format for HEVC and layered HEVC tiled video
39
ISO base media file format (ISOBMFF) basics
Object-oriented files
Based on the data structure called box
− Box type, flags, version, length, box data
Separate media data and metadata
− Media data are the coded media (video, audio, …) data in access units or samples
− Metadata includes media type, codec, timestamps, sample size and location, random access
indications, …
40
ISOBMFF – typical box hierarchy
41
ISOBMFF – some boxes
42
ISOBMFF – an example file
43
OMAF signalling in ISOBMFF
General rules for signalling of important information
Overall omnidirectional video indication
Signalling of projection format
Signalling of region-wise packing and guard bands
Signalling of rotation
Signalling of frame packing
Signalling of content coverage
Region-wise quality ranking
Signalling of fisheye video parameters
Storage and signalling of omnidirectional images
Storage and signalling of timed text
OMAF timed metadata
44
General rules for signalling of important information
Important video information: information that may be used for content selection,
e.g., selection of a video track or a part thereof for consumption
Important video information should be signalled in a manner that is easily
accessible, including a location that is easily found (e.g., in sample entry) and
easily parsed (e.g., using fixed length coding instead of entropy coding)
All pieces of important video information should be easily aggregated to be
imposed to higher level systems, e.g., to be aggregated and included in the
MIME type parameter, the ‘codecs’ parameter, for easy access and inclusion in a
DASH media presentation description (MPD)
− MPD is also referred to as manifest
45
Overall omnidirectional video indication
To indicate whether the video in a file track is 360o video or traditional video
− This is a piece of important information
− By using a transformed sample entry type (‘resv’) and the restricted scheme type in the sample
entry
− Advantage of this approach
− Legacy file parsers would just ignore a 360o video track instead of trying to request, download, or play it,
which will result in bad user experiences
If 360o video, whether projected video or fisheye video, by the restricted scheme type
− ‘podv’: projected
− If yes, what type of projected video?
− ‘erpv’: Equirectangular projected video, with essentially no region-wise packing, and other constraints
− ‘ercm’: Packed equirectangular or cubemap projected video, with no restriction on rectangular region-wise packing,
and other constraints
− ‘fodv’: fisheye
46
Signalling of projection format
To indicate which projection format is used
− This is also a piece of important information
By using the projection format box (‘prfr’) in the sample entry
The box contains the following structure:
aligned(8) class ProjectionFormatStruct() {
bit(3) reserved = 0;
unsigned int(5) projection_type;
}
projection_type Omnidirectional
projection
0 Equirectangular
projection
1 Cubemap projection 47
Signalling of region-wise packing and guard bands
To indicate whether region-wise packing and/or guard bands are used
− Another piece of important information
By using the region-wise packing box (‘rwpk’) in the sample entry
The box contains the following structure:
− RegionWisePackingStruct()
48
Signalling of region-wise packing and guard bands
aligned(8) class RectRegionPacking(i) {
unsigned int(32) proj_reg_width[i];
unsigned int(32) proj_reg_height[i];
aligned(8) class RegionWisePackingStruct() { unsigned int(32) proj_reg_top[i];
unsigned int(1) constituent_picture_matching_flag; unsigned int(32) proj_reg_left[i];
bit(7) reserved = 0; unsigned int(3) transform_type[i];
unsigned int(8) num_regions; bit(5) reserved = 0;
unsigned int(32) proj_picture_width; unsigned int(16) packed_reg_width[i];
unsigned int(32) proj_picture_height; unsigned int(16) packed_reg_height[i];
unsigned int(16) packed_picture_width; unsigned int(16) packed_reg_top[i];
unsigned int(16) packed_picture_height; unsigned int(16) packed_reg_left[i];
for (i = 0; i < num_regions; i++) { }
bit(3) reserved = 0;
unsigned int(1) guard_band_flag[i];
unsigned int(4) packing_type[i]; aligned(8) class GuardBand(i) {
if (packing_type[i] == 0) { unsigned int(8) left_gb_width[i];
RectRegionPacking(i); unsigned int(8) right_gb_width[i];
if (guard_band_flag[i]) unsigned int(8) top_gb_height[i];
GuardBand(i); unsigned int(8) bottom_gb_height[i];
} unsigned int(1) gb_not_used_for_pred_flag[i];
} for (j = 0; j < 4; j++)
} unsigned int(3) gb_type[i][j];
bit(3) reserved = 0;
}
49
Signalling of rotation
To indicate whether and how much rotation is used
− Another piece of important information
By using the rotation box (‘rotn’) in the sample entry
The box contains the following structure:
50
Signalling of frame packing
To indicate whether frame packing is in use, and if yes, what types of frame packing arrangement
− Another piece of important information
By using the stereo video box (‘stvi’) in the sample entry (existing, defined in ISOBMFF, with new
amendment)
The box has the following syntax:
51
Signalling of content coverage
To indicate which area(s) on the sphere are covered by the content
− Another piece of important information, which can be used, e.g., by the OMAF player to choose the right
track(s) that cover the viewport the user is looking at.
By using the coverage information box (‘covi’) in the sample entry
The box contains the following structure:
aligned(8) class ContentCoverageStruct() { aligned(8) SphereRegionStruct(range_included_flag) {
unsigned int(8) coverage_shape_type; signed int(32) center_azimuth;
unsigned int(8) num_regions; signed int(32) center_elevation;
unsigned int(1) view_idc_presence_flag; singed int(32) center_tilt;
if (view_idc_presence_flag == 0) { if (range_included_flag) {
unsigned int(2) default_view_idc; unsigned int(32) azimuth_range;
bit(5) reserved = 0; unsigned int(32) elevation_range;
} else }
bit(7) reserved = 0; unsigned int(1) interpolate;
for ( i = 0; i < num_regions; i++) { bit(7) reserved = 0;
if (view_idc_presence_flag == 1) { }
unsigned int(2) view_idc[i];
bit(6) reserved = 0;
}
SphereRegionStruct(1);
}
} 52
Sphere region shape types
53
Signalling of region-wise quality ranking
To indicate relative quality of regions on the sphere or on the 2D picture domain
− Another piece of important information
− Can be used by the OMAF player to choose the track(s)
− That cover the viewport the user is currently looking at, and
− That have the highest relative quality among viewports of the same projected picture
− This is particularly useful in viewport-dependent 360o video streaming based on multiple alternative
representations
− Multiple streams are encoded, within each one particular viewport is of high quality and all other areas are of
lower quality
− In streaming when the user turns head to a different viewport, stream switching occurs
− The goal is to minimize bandwidth and maximize the quality of the viewport being viewed
By using the sphere region quality ranking box (‘srqr’) or the 2D region quality ranking box
(‘2dqr’) in the sample entry
54
Sphere region quality ranking box
aligned(8) class SphereRegionQualityRankingBox extends FullBox('srqr', 0, 0) {
unsigned int(8) region_definition_type;
unsigned int(8) num_regions;
unsigned int(1) remaining_area_flag;
unsigned int(1) view_idc_presence_flag;
unsigned int(1) quality_ranking_local_flag;
unsigned int(4) quality_type;
bit(1) reserved = 0;
if (view_idc_presence_flag == 0) {
unsigned int(2) default_view_idc;
bit(6) reserved = 0;
}
for (i = 0; i < num_regions; i++) {
unsigned int(8) quality_ranking;
if (view_idc_presence_flag == 1) {
unsigned int(2) view_idc;
bit(6) reserved = 0;
}
if (quality_type == 1) {
unsigned int(16) orig_width;
unsigned int(16) orig_height;
}
if ((i < (num_regions − 1)) || (remaining_area_flag == 0))
SphereRegionStruct(1);
}
}
55
2D region quality ranking box
aligned(8) class 2DRegionQualityRankingBox extends FullBox('2dqr', 0, 0) {
unsigned int(8) num_regions;
unsigned int(1) remaining_area_flag;
unsigned int(1) view_idc_presence_flag;
unsigned int(1) quality_ranking_local_flag;
unsigned int(4) quality_type;
bit(1) reserved = 0;
if (view_idc_presence_flag == 0) {
unsigned int(2) default_view_idc;
bit(6) reserved = 0;
}
for (i = 0; i < num_regions; i++) {
unsigned int(8) quality_ranking;
if (view_idc_presence_flag == 1) {
unsigned int(2) view_idc;
bit(6) reserved = 0;
}
if (quality_type == 1) {
unsigned int(16) orig_width;
unsigned int(16) orig_height;
}
if ((i < (num_regions - 1)) || (remaining_area_flag == 0)) {
unsigned int(16) left_offset;
unsigned int(16) top_offset;
unsigned int(16) region_width;
unsigned int(16) region_height;
}
}
} 56
Signalling of fisheye video parameters
To signal fisheye video parameters that can be used to select the desired track(s) and for
rendering
− Another piece of important information
By using the fisheye omnidirectional video box (‘fodv’) in the sample entry
The box contains the following structures:
− Mandatory: FisheyeVideoEssentialInfoStruct(), essential parameters for enabling stitching and rendering at
the OMAF player
− Optional: FisheyeVideoSupplementalInfoStruct(), supplemental parameters for enhanced rendering and
delivery
− They signal
− View dimension information
− Region information of circular images in the coded picture
− Field of view and camera parameters of fisheye lens
− Lens distortion correction (LDC) parameters with local variation of FOV
− Lens shading compensation (LSC) parameters with RGB gains
− Deadzone information
57
Storage and signalling of omnidirectional images
Omnidirectional images are stored in a file as image items per ISO/IEC 23008-
12 (High Efficiency Image File Format, HEIF)
Similar information as for omnidirectional video is stored in item properties
− Projection format item property (‘prfr’)
− Region-wise packing item property (‘rwpk’)
− Rotation item property (‘rotn’)
− Frame packing item property (‘stvi’)
− Essential fisheye image item property (‘fovi’)
− Supplemental fisheye image item property (‘fvsi’)
− Coverage information item property (‘covi’)
− Initial viewing orientation item property (‘iivo’)
58
Storage and signalling of timed text
Timed text is used for providing subtitles and closed captions for omnidirectional
video.
In OMAF, the timed text may be either
− Fixed-positioned: not moving as the user’s viewing orientation moves, or
− Always-visible: always visible to the user irrespective of the user’s viewing orientation
Fixed-positioned timed text should be used for those that are specific to a
particular object, thus when that object is not visible, it is not rendered to the
timed text
Always-visible timed text should be used for those that are global to the entire
omnidirectional video
59
Storage and signalling of timed text
Two timed text format options
− TTML profiles for Internet media subtitles and captions 1.0
(IMSC1)
− WebVTT
A timed text configuration box (‘otcf’) was designed for both
formats. It contains
− The mode: fixed-positioned or always-visible
− Information for determining the position of the rendering plane in
the 3D space
− A sphere location, the line segment between which and the sphere
center is orthogonal to the rendering plane
− Depth of the rendering plane relative to the sphere center
The size of the rectangle on the rendering plane for the
timed text is signalled as part of the IMSC1/WebVTT track
60
OMAF timed metadata
Timed metadata are contained in their own tracks (separate from the media
tracks)
A timed metadata track is linked to media tracks by a 'cdsc' track reference
OMAF includes the designs of three types of timed metadata tracks
− Initial viewing orientation
− Recommended viewport
− Timed text sphere location metadata
They are about sphere regions or a sphere location (a point on the sphere)
They all use the same sample entry syntax and the same base sample syntax
− The syntaxes were designed in a manner that they can be used to efficiently represent both
sphere regions and sphere locations
61
OMAF timed metadata sample entry and sample syntaxes
64
Recommended viewport
Identified by the sample entry type ‘rcvp’
Sphere region shape type 0
Two specified recommended viewport types
Recommen
ded
Description
viewport
type
A recommended viewport per the director's cut, i.e., a
0 viewport suggested according to the creative intent of
the content author or content provider
A recommended viewport selected based on
1
measurements of viewing statistics
Reserved (for use by future extensions of ISO/IEC
2..239
23090-2)
65
Timed text sphere location metadata
Signals the following information (which is also
signalled in the timed text configuration box) in
a timely dynamic fashion:
− Information for determining the position of the
rendering plane in the 3D space
− A sphere location, the line segment between which and
the sphere center is orthogonal to the rendering plane
− Depth of the rendering plane relative to the sphere
center
66
Media encapsulation and metadata
signalling in DASH
DASH basics
OMAF DASH delivery architecture and procedure
OMAF signalling in DASH
67
DASH basics
A simple example DASH streaming procedure
Why adaptive streaming over HTTP?
Scalability and cost: leveraging HTTP caches
DASH data model
Example DASH Representation and Segments for ISOBMFF
68
A simple example DASH streaming procedure
2) The client requests the desired representation(s), one segment (or a part
thereof) at a time.
• Based on information in the MPD and the client's local information, e.g., network bandwidth, decoding/display
capabilities, and/or user preference.
69
Why adaptive streaming over HTTP?
Basic Approach: Adapt Video to Web rather than Changing the Web
70
Scalability and cost: leveraging HTTP caches
Time
71
DASH Data Model
Provide information to a client, where and when to find the data that composes A/V
experience MPD
Provide the ability to offer a service on the cloud and HTTP-CDNs HTTP-URLs and
MIME Types
Provide service provider the ability to combine/splice content with different properties into
a single media presentation Periods
Provide service provider to enable the client/user selection of media content components
based on user preferences, user interaction device profiles and capabilities, using
conditions or other metadata Adaptation Sets
Provide ability to provide the same content with different encodings (bitrate, resolution,
codecs) Representations
Provide extensible syntax and semantics for describing Representation and Adaptation
Set properties Descriptors
Provide ability to access content in small pieces and do proper scheduling of access
Segments and Subsegments
Provide ability for efficient signaling and deployment optimized addressing Playlist,
Templates, Segment Index
72
DASH Data Model
MPD Segment Access
Period id = 2 Adaptation Set 1
BaseURL=http://abr.rocks.com/
Period id = 1 start = 100 s Initialization Segment
start = 0 s Representation 3 http://abr.rocks.com/3/0.mp4
Adaptation Set 0 Representation 1
Rate = 500 Kbps Rate = 2 Mbps
subtitle turkish Resolution = 720p
Period id = 2 Media Segment 1
start = 100 s Representation 2 start = 0 s
http://abr.rocks.com/3/1.mp4
Adaptation Set 1 Rate = 1 Mbps
Segment Info
video Duration = 10 s
Period id = 3 Representation 3 Media Segment 2
start = 300 s Rate = 2 Mbps Template: start = 10 s
3/$Number$.mp4
Adaptation Set 2 http://abr.rocks.com/3/2.mp4
Media Delivery
Selecting/Switching Well-defined
Splicing of arbitrary Selection of Format, chunks
of Representation media format, i.e.
content, e.g. ad Components/Tracks with unique
based on ISO BMFF or MPEG-
insertion based on properties addresses +
bandwidth, etc. 2 TS
associated timing
73
Example DASH Representation and Segments for ISOBMFF
ftyp moov moof mdat moof mdat moof mdat moof mdat
Initialization Media
Media Segment
Segment Segment
Representation
74
OMAF DASH delivery architecture and procedure
DASH delivery
Fs/F’s: Initialization and media segments
Fs Fs G: MPD
DASH MPD
Server
generation G − It additionally includes OMAF-specific metadata,
H such as information on projection and region-wise
packing
Basic OMAF DASH streaming procedure
1) The client gets the MPD.
2) The client obtains the current viewing orientation
H’
and gets the estimated bandwidth.
3) The client chooses the Adaptation Set(s) and the
DASH MPD Representation(s), and requests the
& segment (Sub)Segments to match the client’s capabilities,
reception incl. OMAF-specific capabilities, and to maximize
F’s the quality, under the network bandwidth
constraints, for the current viewing orientation.
Head/eye 4) Repeat steps 2 and 3.
tracking Orientation/
viewport metadata 75
OMAF signalling in DASH
OMAF-specific information carried on file format level and needed for content selection
is also carried in the MPD
− Thus either the same or a subset of that in file format
By using the newly defined OMAF-specific DASH MPD descriptors
− All under the URN "urn:mpeg:mpegI:omaf:2017“
− Projection format (PF) descriptor
− Region-wise packing (RWPK) descriptor
− Content coverage (CC) descriptor
− Spherical region-wise quality ranking (SRQR) descriptor
− 2D region-wise quality ranking (2DQR) descriptor
− Fisheye omnidirectional video (FOMV) descriptor
The frame packing information is signalled using the existing DASH FramePacking
element.
76
Media profiles and presentation profiles
77
Media profiles
A media profile for timed media is defined as requirements and constraints for a set of
one or more ISOBMFF tracks of a single media type.
The conformance of a set of one or more ISOBMFF tracks to a media profile is specified
as a combination of:
− Specification of which sample entry type(s) are allowed, and which constraints and extensions are
required in addition to those imposed by the sample entry type(s).
− Constraints on the samples of the tracks, typically expressed as constraints on the elementary
stream contained within the samples of the tracks.
A media profile for static media is defined as requirements and constraints for a set of
one or more ISOBMFF items of a single media type.
The conformance of a set of one or more ISOBMFF items to a media profile is specified
as a combination of:
− Specification of which item type(s) are allowed, and which constraints and extensions are required
in addition to those imposed by the item type(s).
− Constraints on the content of the items, typically expressed as constraints on the elementary
stream contained within the items.
78
Presentation profiles
A presentation profile is defined as requirements and constraints for an
ISOBMFF file containing tracks or items of any number of media types.
A specification of a presentation profile should refer to the specified media
profiles and may include additional requirements or constraints.
A file conforming to a presentation profile typically provides an omnidirectional
audio-visual experience.
79
OMAF specifies 9 media profiles
3 video profiles
− HEVC-based viewport-independent OMAF video profile
− HEVC-based viewport-dependent OMAF video profile
− AVC-based viewport-dependent OMAF video profile
2 audio profile
− OMAF 3D audio baseline profile
− OMAF 2D audio legacy profile
2 image profiles
− OMAF HEVC image profile
− OMAF legacy image profile
2 timed text profiles
− OMAF IMSC1 timed text profile
− OMAF WebVTT timed text profile
80
OMAF video media profiles
Media Profile Codec Profile Level Required Scheme Brand
Types
HEVC-based viewport- HEVC Main 10 5.1 podv and erpv
hevi
independent OMAF video profile
HEVC-based viewport- HEVC Main 10 5.1 podv and at least one hevd
dependent OMAF video profile of erpv and ercm
AVC-based viewport-dependent AVC Progressive 5.1 podv and at least one avde
OMAF video profile High of erpv and ercm
Note that HEVC Level 5.1 supports up to 3840x2160 @ 64 fps and 4096x2160 @ 60 fps, and the
Main 10 profile does not exclude support of HDR/WCG video.
Key differences between the two HEVC-based OMAF video media profiles:
1) The viewport-dependent profile supports unconstrained region-wise packing while the other does not.
2) The viewport-dependent profile supports file format extractors to get a conforming HEVC bitstream
when tile-based streaming is used (while the other does not).
81
OMAF audio media profiles
Max
3D
Media Profile Codec Profile Level Sampling Brand
Metadata
Rate
Low
OMAF 3D audio MPEG-H 1, 2 or included in
Complexit 48 kHz oabl
baseline profile Audio 3 codec
y
OMAF 2D audio HE-AACv no 3D
AAC 4 48 kHz oa2d
legacy profile 2 metadata
82
OMAF image media profiles
83
OMAF timed text media profiles
84
OMAF specifies 2 presentation profiles
OMAF viewport-independent baseline presentation profile
− File band: ‘ovdp’
− Video: At least one video track shall conform to the HEVC-based viewport-independent
OMAF video profile
− Audio: At least one audio track shall conform to the OMAF 3D audio baseline profile
OMAF viewport-dependent baseline presentation profile
− File band: ‘ompp’
− Video: At least one video track shall conform to the HEVC-based viewport-dependent OMAF
video profile
− Audio: At least one audio track shall conform to the OMAF 3D audio baseline profile
85
HEVC omnidirectional video SEI messages
86
HEVC omnidirectional video SEI messages
Both of the OMAF HEVC-based video media profiles mandate the presence of SEI
messages for signalling of projection, region-wise packing, etc.
This is to enable OMAF player implementations that rely on elementary-stream level
signalling for rendering of omnidirectional video.
The following omnidirectional video SEI messages have been recently specified for
HEVC (see JCTVC-AC1005):
− Equirectangular projection SEI message
− Cubemap projection SEI message:
− Sphere rotation SEI message
− Region-wise packing SEI message
− Omnidirectional viewport SEI message
These SEI messages and the corresponding OMAF signalling are basically aligned with
each other.
These SEI messages are expected to be ported to the AVC/H.264 specification soon.
87
Equirectangular projection SEI message
88
Cubemap projection SEI message
89
Sphere rotation SEI message
90
Region-wise packing SEI message
regionwise_packing( payloadSize ) { Descriptor
rwp_cancel_flag u(1)
if( !rwp_cancel_flag ) {
rwp_persistence_flag u(1)
constituent_picture_matching_flag u(1)
rwp_reserved_zero_5bits u(5)
num_packed_regions u(8)
proj_picture_width u(32)
proj_picture_height u(32)
packed_picture_width u(16)
packed_picture_height u(16)
for( i = 0; i < num_packed_regions; i++ ) {
rwp_reserved_zero_4bits[ i ] u(4)
transform_type[ i ] u(3)
guard_band_flag[ i ] u(1)
proj_region_width[ i ] u(32)
proj_region_height[ i ] u(32)
proj_region_top[ i ] u(32)
proj_region_left[ i ] u(32)
packed_region_width[ i ] u(16)
packed_region_height[ i ] u(16)
packed_region_top[ i ] u(16)
packed_region_left[ i ] u(16)
if( guard_band_flag[ i ] ) {
left_gb_width[ i ] u(8)
right_gb_width[ i ] u(8)
top_gb_height[ i ] u(8)
bottom_gb_height[ i ] u(8)
gb_not_used_for_pred_flag[ i ] u(1)
for( j = 0; j < 4; j++ )
gb_type[ i ][ j ] u(3)
rwp_gb_reserved_zero_3bits[ i ] u(3)
}
}
} 91
}
Omnidirectional viewport SEI message
omni_viewport( payloadSize ) { Descriptor
omni_viewport_id u(10)
omni_viewport_cancel_flag u(1)
if( !omni_viewport_cancel_flag ) {
omni_viewport_persistence_flag u(1)
omni_viewport_cnt_minus1 u(4)
for( i = 0; i <= omni_viewport_cnt_minus1; i++ ) {
omni_viewport_azimuth_centre[ i ] i(32)
omni_viewport_elevation_centre[ i ] i(32)
omni_viewport_tilt_centre[ i ] i(32)
omni_viewport_hor_range[ i ] u(32)
omni_viewport_ver_range[ i ] u(32)
}
}
}
92
Viewport-dependent omnidirectional video processing
93
Viewport-dependent omnidirectional video processing
Multiple approaches documented in the informative Annex D of OMAF
− To tackle the bandwidth and processing complexity challenges
− To utilize the fact that only a part of entire encoded sphere video is rendered at any moment
− The approaches are enabled by the video media profiles
Viewport-dependent omnidirectional video processing approaches
(Some of them are not documented in the final version of the OMAF specification, and some of the approached documented in the final
version of the OMAF specification are not included here)
− The conventional approach
− Region-wise quality ranked encoding of omnidirectional content
− Merging of HEVC MCTS-based sub-picture tracks of the same resolution
− Merging of HEVC MCTS-based sub-picture tracks of different resolutions with multiple decoders
− Merging of HEVC MCTS-based sub-picture tracks of different resolutions with one decoder
− SHVC with MCTS-based enhancement layer
− Simulcast with MCTS-based HEVC high-resolution representation
94
360o video encoding and decoding – conventional
95
Region-wise quality ranked encoding of omnidirectional content
Multiple coded single-layer bitstreams are stored at a server in different tracks.
Each bitstream contains the whole omnidirectional video.
Each bitstream has a different high quality encoded region.
HQ HQ HQ HQ HQ
96
Merging of HEVC MCTS-based sub-picture tracks of the same resolution
97
Merging of HEVC MCTS-based sub-picture tracks of different resolutions with >1 decoders
98
Merging of HEVC MCTS-based sub-picture tracks of different resolutions with one decoder
99
SHVC with MCTS-based enhancement layer
101
Acknowledgements
Thanks to the MPEG OMAF ad-hoc group members and others who contributed
to the development of the OMAF standard.
Thanks to Miska Hannuksela and Sachin Deshpande for having helped me chair
reviewing and making decisions for a number of OMAF proposals when needed
(particularly those from myself or my company).
Thanks to Miska Hannuksela, Thomas Stockhammer, Byeongdoo Choi, Sachin
Deshpande, Yago Sanchez, Robert Skupin, Alexandre Gabriel, Imed Bouazizi,
Cyril Concolato, et al., who drafted some of the figures and/or slides that I used
as is or with modifications in this slide deck.
102
References
ISO/IEC 23090-2: Information technology — Coded representation of immersive media
(MPEG-I) — Part 2: Omnidirectional media format
− The finalized FDIS text will be included in MPEG output document N17235.
− Some latest working-in-progress draft versions of the FDIS text are included in MPEG input
document m41922.
ISO/IEC 14496-12, Information technology — Coding of audio-visual objects — Part 12:
ISO base media file format
ISO/IEC 14496-15, Information technology — Coding of audio-visual objects — Part 15:
Carriage of network abstraction layer (NAL) unit structured video in the ISO base media
file format
ISO/IEC 23009-1, Information technology — Dynamic adaptive streaming over HTTP
(DASH) — Part 1: Media presentation description and segment formats
JVET-H1004, Algorithm descriptions of projection format conversion and video quality
metrics in 360Lib version 5.
JCTVC-AC1005, HEVC additional supplemental enhancement information (draft 4)
103
Outline
Concepts: OMAF, 360o video, VR
OMAF: when, who, what, and architecture
OMAF coordinate system and key processing steps
Fisheye 360o video/image
Media storage and metadata signalling in ISOBMFF
Media encapsulation and metadata signalling in DASH
Media profiles and presentation profiles
HEVC omnidirectional video SEI messages
Viewport-dependent omnidirectional video processing
Acknowledgements
References
104
Thank you
Follow us on:
For more information on Qualcomm, visit us at:
www.qualcomm.com & www.qualcomm.com/blog
105