Discussion:
[xiph-rtp] Chaining
Luca Barbato
2005-08-26 19:28:18 UTC
Permalink
Probably not everybody would like that proposal, but for simplicity sake
it should be valued as the others.

1 Each vorbis-rtp stream will map just to ONE vorbis stream, that mean
that chaining in rtp isn't allowed at all.

2 Once the first chained vorbis ends the server will just reset the
session parameters, if uses that model, or completely renegotiate the
connection.

That way is again implementation dependent about how you chose to use
vorbis-rtp (on sip and rtsp it won't be a large overhead) but is way
simpler to handle since there isn't additional logic and it would
support everything in a quite scalable way.

If possible I'd like to have a list of planned uses for vorbis-rtp (and
theora-rtp), so it could be possible to find out where that simple
approach won't work and a complex solution is required.

lu
--
Luca Barbato

Gentoo/linux Developer Gentoo/PPC Operational Leader
http://dev.gentoo.org/~lu_zero
Ralph Giles
2005-08-27 17:15:43 UTC
Permalink
Post by Luca Barbato
Probably not everybody would like that proposal, but for simplicity sake
it should be valued as the others.
1 Each vorbis-rtp stream will map just to ONE vorbis stream, that mean
that chaining in rtp isn't allowed at all.
2 Once the first chained vorbis ends the server will just reset the
session parameters, if uses that model, or completely renegotiate the
connection.
The server can also just transcode. We expect RTP transmission from a
chained Ogg stream to be something of an edge case. A lot of stations
will be encoding either directly from a live production feed (only
one stream in any case) or from a batch encode, for which ensuring
codebook uniformity isn't such a big issue. That just leaves casual
users with a heterogeneous Ogg collection on disk.

Can you explain a bit more about how the server would send the new
session parameters? Is it possible to have that work and keep gapless
playback?
Post by Luca Barbato
If possible I'd like to have a list of planned uses for vorbis-rtp (and
theora-rtp), so it could be possible to find out where that simple
approach won't work and a complex solution is required.
Not supporting chaining was in fact the original suggestion, made
initially by Jack about a year ago. If we'd done that we could have
been all finished six months ago. :)

To recap:

As I said above, I don't find support of transmission of chained Ogg
streams per se all that compelling. The whole playlist-based icecast
pseudo-stream isn't worth the pain; it's better to solve the issue
by making the source simplify the stream than to have the server
and decoders support all this complexity.

For me the two persuasive arguments were:

1. Adaptive bitrate switching. In a unicast RTP setting, the server
can use packet loss statistics to dynamically adjust the bitrate
sent to individual clients. In the case of configurable codecs like
Vorbis and Theora, this means being able to change the codebooks,
and even things like samplerate/framerate and image size (though
the player should rescale to avoid popping in the later case.)

Aaron essentially told us this was a requirement for Real, since
it's already a feature with their native codecs (though they
have fixed codebooks, so chaining isn't painful). That's why
we went with chaining support.

2. Video resampling is much more expensive and artefact-prone
than audio resampling, so at least in the medium term, it is
attractive to be able to use chaining in Theora to support
interleave of e.g. film and video without having to do format
conversion. This isn't compelling on its own, but makes reason
1 less lonely. :)

Anyway, that was the reasoning behind the decision. I don't
see any particular reason to revisit it unless you have either
a much simpler method to achieve equivalent results, or a
good argument why Real's requirements aren't worth addressing.

-r
Phil Kerr
2005-08-27 20:06:43 UTC
Permalink
Post by Ralph Giles
Not supporting chaining was in fact the original suggestion, made
initially by Jack about a year ago. If we'd done that we could have
been all finished six months ago. :)
Ralph,

I must protest in the strongest terms about your continuing line that
the development work carried out by me went against Xiph.

You have previously made snide remarks about such issues as the CRC32
field, insinuating the inclusion of this feature in submitted I-D's went
"off the rails" and was unsanctioned by Xiph. I have pointed you to the
discussions carried out on this list where we reached a consensus,
everything was done in public and even a "last-call" was made.

When we discussed the issue of chaining on this list on 20th October
last year you were probably the only one who was against making best
possible effort in trying to have chaining.

With your comment above it gives the impression that there was a
constant discussion against chaining by Jack. I cannot find any
objection posted by him on the Xiph-RTP list, in fact he posted here
only a few times (but I'm sure he read everything).

Please Ralph, stop what is essentially back-biting. You only started to
raise objections in February *after* the last I-D was submitted to the
IETF.

Best regards

Phil
Ralph Giles
2005-08-27 22:06:00 UTC
Permalink
Post by Phil Kerr
I must protest in the strongest terms about your continuing line that
the development work carried out by me went against Xiph.
Whoa. I didn't mean to imply anything of the sort. I very much
appreciate the work that you did to update the draft and submit
it to the IETF, and on the reference implementation.

Please don't misconstrue what I said. I was explaining my
reasoning for the decision for chaining support, from which
most of the discussion on this list has arisen, because it's
difficult.

-r
Luca Barbato
2005-08-27 23:49:57 UTC
Permalink
Post by Ralph Giles
The server can also just transcode. We expect RTP transmission from a
chained Ogg stream to be something of an edge case. A lot of stations
will be encoding either directly from a live production feed (only
one stream in any case) or from a batch encode, for which ensuring
codebook uniformity isn't such a big issue. That just leaves casual
users with a heterogeneous Ogg collection on disk.
In case of live/pseudolive feeds it is just one stream so the problem
doesn't apply regardless it uses rtsp or whatever for control.
Post by Ralph Giles
Can you explain a bit more about how the server would send the new
session parameters? Is it possible to have that work and keep gapless
playback?
IF you are using rtsp I'm pretty confident you may have gapless playback
of chained or not chained vorbis in rtp (you may push from the server to
the client the new metadata associated to the next stream and/or signal
that the client should prepare to switch to the other stream). From the
standard rtp point (eg, I just have inband metadata that marks the next
file to the previous) it would require a large buffer and some sort of
lookahead logic to do some sort of crossfade or you'll always have the
time to reinit the decoder with the new informations.

That said, from the rtp draft I'd keep one vorbis stream per rtp stream
and move the discussion about the usage of rtsp, sip or other
session/control protocols your application want to use.
Post by Ralph Giles
Not supporting chaining was in fact the original suggestion, made
initially by Jack about a year ago. If we'd done that we could have
been all finished six months ago. :)
The problem is where supporting chaining. I'd prefer to support it using
the control/session protocol since would make simpler implement some
features.
Post by Ralph Giles
1. Adaptive bitrate switching. In a unicast RTP setting, the server
can use packet loss statistics to dynamically adjust the bitrate
sent to individual clients. In the case of configurable codecs like
Vorbis and Theora, this means being able to change the codebooks,
and even things like samplerate/framerate and image size (though
the player should rescale to avoid popping in the later case.)
That would require RTCP for QoS and RTSP to push the right configuration
and do the switch, or other equivalent protocols.
Post by Ralph Giles
Aaron essentially told us this was a requirement for Real, since
it's already a feature with their native codecs (though they
have fixed codebooks, so chaining isn't painful). That's why
we went with chaining support.
If the "chained" stream is just a transcoded pseudo live stream won't be
a problem since would be just one codebook
Post by Ralph Giles
2. Video resampling is much more expensive and artefact-prone
than audio resampling, so at least in the medium term, it is
attractive to be able to use chaining in Theora to support
interleave of e.g. film and video without having to do format
conversion. This isn't compelling on its own, but makes reason
1 less lonely. :)
The same problems and issues would apply and the same solutions should work.
Post by Ralph Giles
Anyway, that was the reasoning behind the decision. I don't
see any particular reason to revisit it unless you have either
a much simpler method to achieve equivalent results, or a
good argument why Real's requirements aren't worth addressing.
Everything is up to the application usage and the protocols that the
application have, that's why I requested a list of planned applications
and scenarios.


So far I could think about:

1 Netradio using rtp/rtsp:
- Chaining required to make the playlist look like a flat stream, you
can archive the same result with dynamic stream switching and having the
client supporting crossfade.

- It would be probably multicast so the RTCP information could be
meaningfull only if you have a load balancing setting by twins or
providing content using "repeaters" as particular case.


2 Conference/Voip using rtp/rtsp or rtp and sip:
- You won't have the problem of chaining but you'll have the problem of
syncing many different streams

- You may want to dynamic switch bitrate using the QoS. If the stream
could be optimized for bitpeeling that would be quite interesting and
quite inespensive to implement.

In both cases adding Video or subtitle or other rtp streams to the audio
scenario just adds the problem of syncronization

I hope that clarifies my point of view.

lu
--
Luca Barbato

Gentoo/linux Developer Gentoo/PPC Operational Leader
http://dev.gentoo.org/~lu_zero
Aaron Colwell
2005-08-29 00:37:09 UTC
Permalink
Hi Luca,

Welcome to the discussion and thank you for taking on this work. I would have
liked to do this myself, but I couldn't guarentee getting it done in a timely
manner. Thank you for stepping up to the challenge.

I've spent the weekend thinking over the need for chaining in the RTP spec. I
may be leaning towards dropping it as well. First, I'll outline the use cases
that Real has typically used RTP for. Then I'll describe the Helix rate
adaptation since Ralph mentioned it in another email. Finally I'll outline
why we may not want to do chaining after all.

Use cases:
- On-demand playback of static files.

* Files may be chained. Solution should provide a way to playback any valid
.ogg file.

* Since many clients can access the same file at different times, transcoding
to a common set of parameters kills server scalability.

- Live broadcast from a camera and/or mic.
* Usually the encoding parameters are static so chaining support is not
needed.

- Simulated Live broadcast. This is where you take a playlist of static files
and broadcast them as a "live" stream.

* On the output side you don't necessarily need to support chaining if
you transcode on the input side. Transcoding might be an acceptable option
since you only have to do it once independent of the number of clients
connected to the stream.

- Forward channel only broadcast. This is basically a live or simulated live
feed that has no backchannel. This means that HTTP requests for codebooks
are not allowed. The main example of this would be satellite distribution.


Comments on rate adaptation:

I didn't read over all my historical comments about this topic before writing
this so please don't flame me if I contradict myself. This represents my
current thinking.

Here is how the Real/Helix system currently does rate adaptation. When encoding
content you select a set of bitrates for the audio and video. The encoder then
creates independent streams for each of the bitrates. An multi-rate A/V file
has 2 logical streams, one audio and one video. Each of those logical streams
has a set of physical streams, one for each bitrate. Each of the physical
streams are assigned a rule number.(Technically each physical stream has 2
rules 1 keyframe, and 1 non-keyframe, but that isn't overly important for this
discussion). Each logical stream has a "rule book" that tells the media engine
what rules to select when different connection bitrates are detected. The
rule book contains a set of expressions for each rule. These expressions are
evaluated periodically during playback and control which physical stream is
sent. The Real/Helix adaptation mechanism was originally designed for our
proprietary RDT transport protocol, but we adapted it to RTP when we started
doing multicast transmission. Since we can switch physical streams at any point
during playback we needed a way to identify what physical stream the packet
data is associated with. In RDT we have a ruleID field for each packet. When
we stream over RTP we add an RTP header extension that contains this ruleID.
On the client side we use this ruleID to demultiplex the different bitstreams
and handle the stream switches. Historically most of our rate adaptation has
been client driven. The client would monitor the connection throughput and
send subscription change messages to change the physical streams. This allowed
to know when the proper time was to cross-fade between physical streams.

Whew.. did you get all that? Here are a few things to note about the physical
streams.

- All codec configuration data for each physical stream is send in the SDP.
This data only takes up a few hundred bytes max. This data usually contains
data equivalent to the ident headers for the Xiph codecs. The codebooks
are fixed.

- In the case of audio different codecs may be used. For low bitrates a
voice codec may be used and a music codec could be used for higher bitrates.

- In the case of video the codec isn't different. Frame size is constant, but
frame rate isn't necessarily.

- Codecs for all physical stream are initalized when the client receives the
SDP so they are always ready when data arrives

Eventhough we have a system that works we aren't doing a few things the
"RTP way". This was done way before I was involved with the Helix code so
please don't flame the messenger. :) I'm mentioning these because I think
they are examples of things we shouldn't do for the Xiph RTP specs.

- We interleave different codecs into the same RTP session. This is a fuzzy
area of the RTP spec. Supposedly you can have multiple payload types in a
single session, but I've never got a clear answer out of the IETF about
what that is supposed to mean.

- We don't actually use the sample rate of the media data for the RTP
timestamps. All our RA/RV streams over RTP use a 1000Hz clock. The only
magic about this rate is that the core keeps track of time in milliseconds.
Technically we should be using the audio sample rate for audio. For video we
should probably be using 90000Hz like all the other video payload formats.

- We use an RTP header extension to transport our ruleID. Sure the extension
is part of the spec, but I don't know of any other payload that uses it. We
should have made it part of the payload.

- We use a non-standard SET_PARAMETER RTSP request to control physical stream
selection.

Why we might want to revisit the chaining question:

I've been thinking about this quite a bit this weekend. I've also been thinking
about all the complexity we've talked about just to support chained files.
Is it really necessary? I'm not so sure anymore.

One of my main arguments in the past is that I wanted an on-demand server to
be able to deliver any valid .ogg file over RTP. I focused mainly on trying to
cram the chained streams into RTP sessions. I think all the complexity that
I/we created was a sign that we were trying to fit a round peg into a square
hole. What if a request for a chained file returned a playlist of URLs that
represented the various chain segments? The player could then take this
playlist and request the URLs for each segment. Most players out there I'm
pretty sure can handle playlists properly. It also fixes several problems
associated with chained files. It makes it MUCH easier for the player to deal
with files that have a different number of streams in each chain. My Helix
plugins handle this case, but it means having figure out the max number of
audio and video streams across the whole file, create RTP sessions for that
worst case, and then dynamically map streams to RTP sessions during playback.
It's a pain. If you could just expose the chains as seperate URLs, then you
can have a relatively simple implementation on the client and server and
leverage the players existing playlist functionality.

Another argument I had for chaining was for supporting the simulated live
case. Since you could have chained files or files with different codebooks,
I believe that this required chained file support. I'm starting to believe that
perhaps transcoding is a better solution here. You don't have to worry about
scalability as much since you only have to do 1 transcode for each simulated
live stream, not 1 per listener. If you were going to have a ton of simulated
live streams then perhaps it makes more sense to unify your content to use a
single codebook. It doesn't seem fair that the transport layer should have to
shoulder the burden of content author laziness.

I think at some point I had a rate adaptation argument too. I'm a little more
familiar with Theora so I'll start with that. With the current codebooks that
ship with the encoder you could effectively do rate adaptation without the
need to change codebooks. You can basically do one of 2 things. You can drop
frames so that you have a lower frame rate. The client would be able to figure
out what is happening by seeing that there aren't any lost packets and the
timestamps are farther apart. You can also just lower the Q being used. This
increases the quatization which will lower the bitrate. I believe derf_
mentioned these facts before, but I don't remember. I don't really know
how the Vorbis code works so I don't know if a similar mechanism can be
exploited. I admit that these mechanisms may not be able to provide the most
optimal rate control, but I think it provides something reasonable for now.

If we allow chaining we also will have to solve the problem where the
sample-rate / frame-rate of one of the chains is not an even multiple of the
RTP timestamp sample rate. This can lead to all sorts of rounding headaches.
We have tons of code in Helix dealing with this. It isn't fun.


I realize this is almost a complete 180 for me. I'm sorry if I was the sole
cause of this delay. I also apologize to anyone who pointed this stuff out and
I didn't see it's truth at the time. I would be fine with persuing a
non-chaining RTP spec. Here are the only things that I would suggest we
insure are in the spec.

- Allow inline transmission of the info header. This is to allow TAC changes
in a live/simulated live scenario.

- Allow inline transmission of the ident and codebook headers. This is mainly
to support forward link only scenarios.

- Allow for a "chainID" field. Basically I'd like a bit that signals the
presence of the field. If the bit is set a chainID field will be present.
I'm fine with 16 or 24 bits for this field. The main idea here is to allow
for chaining support to be added later. If you don't want to have the field
in there then just make sure that there is at least 1 bit that is reserved
for this purpose. If you do decide to have the field then there should be
text indicating that says "If the chainID field is present then it must
always be 0 to comply with this spec."

Sorry for the marathon email. I just wanted to get all my current thinking out there.

Aaron
Ralph Giles
2005-08-29 17:17:30 UTC
Permalink
Post by Aaron Colwell
I've spent the weekend thinking over the need for chaining in the RTP spec. I
may be leaning towards dropping it as well. First, I'll outline the use cases
that Real has typically used RTP for. Then I'll describe the Helix rate
adaptation since Ralph mentioned it in another email. Finally I'll outline
why we may not want to do chaining after all.
Oh my. :-) Well, it is nice that you've come around to my way of
thinking on the complexity tradeoff.

In light of this, I'm in favor of reversal. If you still think chaining
support should be in the spec, time to argue for it again. Please
address the use cases Aaron nicely outlined.

Following discussion on IRC, Aaron and I have the following revised
proposal. (Please correct me if I got something wrong or leave something
out.)

A given RTP session has only one corresponding info and setup header
pair. There's no longer need for a chain id, so we're back to a
single-byte rtp payload header:

0 1 2 3 4 5 6 7
+-+-+-+-+-+-+-+-+
|C|F|R| # pkts. | for vorbis.
+-+-+-+-+-+-+-+-+

0 1 2 3 4 5 6 7
+-+-+-+-+-+-+-+-+
|C|F|RR |# pkts.| for theora.
+-+-+-+-+-+-+-+-+

After which the payload is the normal packet data as expected by the
respective decoders.

The reserved bits must be zero to comply with this specification,
and the C,F flags use the improved semantics as we agreed earlier
on the list.

A note should be added explaining that we consided a decoder
configuration id in the payload header to allow support of
the "chaining" feature of Ogg bitstreams, but it was rejected
because of complexity.

(We could obviously use the reserved bits to add the chain id
back in as an optional field if we ever change our minds).

And that's that, really. Luca's job just got easier. :)

Luca, I'd like to see a revised draft based on the above ASAP.

If you want, it would also be reasonable to do one based on the old
32-bit payload header if we get lots of objections, or if you're
almost done with it anyway. :-)

-r
David Barrett
2005-08-29 17:50:12 UTC
Permalink
Post by Ralph Giles
In light of this, I'm in favor of reversal. If you still think chaining
support should be in the spec, time to argue for it again. Please
address the use cases Aaron nicely outlined.
Can you give a 20-second review of what all this means to those of us
(ie, me) who haven't followed the issue closely? Specifically:

1) What is the difference between "chaining" and "inline codebook
transmission"? It sounds like you're de-supporting the former while
retaining the latter.

2) What's the latest on the "codebook ID" member of the Theora RTP
header? It sounds like you're suggesting the whole RTP header is just a
single byte -- wasn't there a field that associated each packet with a
specific codebook (so the SDP could define a codebook library using HTTP
downloads, and the session could switch back and forth on a per-packet
basis)?

3) You said "A given RTP session has only one corresponding info and
setup header pair": what's a "setup header"? I thought there were
"info", "command", and "tables" headers, which together comprise the
"codebook". Is this changing, or am I just misunderstanding something?
If there's only one per session, does this mean codebook changing has
been de-supported as well?


I'm sorry if I'm out of the loop; been focused on other things and got
behind. Any clarification you can offer will be appreciated.

-david
Ralph Giles
2005-08-29 18:22:56 UTC
Permalink
Post by David Barrett
Can you give a 20-second review of what all this means to those of us
1) What is the difference between "chaining" and "inline codebook
transmission"? It sounds like you're de-supporting the former while
retaining the latter.
Chaining means switching between multiple decoder configs within a
single RTP session. Inline codebook transmission just means that
the decoder may receive its configuration data as packets send at
the beginning of the RTP stream as well as through some out of band
method. It's the simplest option, and the only one is some applications.

Chaining might still be possible with inline codebook transmission;
the problem is RTP packets are usually sent without delivery guarantees,
so if the SDP only defines one sent, you can't transition reliably,
and the decoder may suddenly start producing garbage drop in the middle
of the stream.

In a multicast setting, there's no also way to syncronize the change.
(Maybe you could use different SSRC id's to mark the distinction?)
Post by David Barrett
2) What's the latest on the "codebook ID" member of the Theora RTP
header? It sounds like you're suggesting the whole RTP header is just a
single byte -- wasn't there a field that associated each packet with a
specific codebook (so the SDP could define a codebook library using HTTP
downloads, and the session could switch back and forth on a per-packet
basis)?
That's the main change; we've done away with that field because there's
only one codebook option--nothing to choose to between.
Post by David Barrett
3) You said "A given RTP session has only one corresponding info and
setup header pair": what's a "setup header"? I thought there were
"info", "command", and "tables" headers, which together comprise the
"codebook". Is this changing, or am I just misunderstanding something?
The vorbis and theora codecs have three required headers that configure
their decoders. Their specifications call them the 'info, comment (or
metadata), and setup (or codebook)' header packets. collectively they're
often referred to as "the codebooks". The 'info' and 'setup' headers are
required for meaningful decode of data packets. The comment header is
required by the spec, but not necessary on technical grounds.
Post by David Barrett
If there's only one per session, does this mean codebook changing has
been de-supported as well?
That's that same as "chaining" so yes it would no longer be supported.

If you want to change the codebooks, you must get the player a new
SDP and or otherwise open a new RTP connection. A stack of rtsp:
playlist entries would work fine, for example.

Hope that clarification helps,

-r
David Barrett
2005-08-29 21:23:37 UTC
Permalink
Oh, that's unfortunate. I'd come to depend on the chaining feature to
support adaptive encoding (a rather important feature in my product).

My reason is as follows: in a P2P scenario, setting up connections is
incredibly time consuming and unreliable.* Once a connection works, I'm
loathe to kill and re-create it. Furthermore, "session establishment"
and "session send/recv" are actually quite far apart in the code. It
was nice negotiating an entire set of codebooks exactly once, during
session initialization. I'm not eager to trigger a SIP/RTP
re-negotiation each time I want to change my broadcast settings (a
reasonably frequent activity in my app).

* Granted, if the new connection uses same endpoints it's less painful,
but it's still not fun.

However, I could just be misunderstanding this. Here's what I'm
currently doing -- could you please summarize for me how this latest
change affects my approach?

1) Broadcaster sends SDP in SIP INVITE describing the codebooks I want
to to broadcast
2) Receiver sends ACK and obtains codebooks (HTTP, cache, generate, etc)
3) Receiver prepares to receive RTP using listed codebooks
4) Broadcaster picks initial codebook and begins broadcasting
5) Broadcaster wishes to change settings (resolution, encoding quality,
framerate)
6) Broadcaster resets encoder using new codebook, broadcasts using new
chainID
7) Receiver notices the new chainID, resets decoder with new codebook
8) Receiver has seamlessly begun decoding with new settings, within the
span of a single frame.

One option would be to pre-establish several RTP sessions and just stop
broadcasting on one and start broadcasting on the other. Another option
is to re-INVITE the existing SIP session with the new parameters. But
neither are as seamless and elegant as the chaining solution.

Perhaps I just don't see the complexity introduced by chaining. I do
see the complexity of inline and HTTP codebook transmission, but these
aren't what are being removed/redesigned. I guess the SDP syntax is a
bit complicated to set up the chains (chainID->codebook mapping), but is
that it? Where is the complexity that forces the dropping of this feature?

-david

(Again, sorry for falling behind if I'm missing this obvious point.)
Post by Ralph Giles
Post by David Barrett
Can you give a 20-second review of what all this means to those of us
1) What is the difference between "chaining" and "inline codebook
transmission"? It sounds like you're de-supporting the former while
retaining the latter.
Chaining means switching between multiple decoder configs within a
single RTP session. Inline codebook transmission just means that
the decoder may receive its configuration data as packets send at
the beginning of the RTP stream as well as through some out of band
method. It's the simplest option, and the only one is some applications.
Chaining might still be possible with inline codebook transmission;
the problem is RTP packets are usually sent without delivery guarantees,
so if the SDP only defines one sent, you can't transition reliably,
and the decoder may suddenly start producing garbage drop in the middle
of the stream.
In a multicast setting, there's no also way to syncronize the change.
(Maybe you could use different SSRC id's to mark the distinction?)
Post by David Barrett
2) What's the latest on the "codebook ID" member of the Theora RTP
header? It sounds like you're suggesting the whole RTP header is just a
single byte -- wasn't there a field that associated each packet with a
specific codebook (so the SDP could define a codebook library using HTTP
downloads, and the session could switch back and forth on a per-packet
basis)?
That's the main change; we've done away with that field because there's
only one codebook option--nothing to choose to between.
Post by David Barrett
3) You said "A given RTP session has only one corresponding info and
setup header pair": what's a "setup header"? I thought there were
"info", "command", and "tables" headers, which together comprise the
"codebook". Is this changing, or am I just misunderstanding something?
The vorbis and theora codecs have three required headers that configure
their decoders. Their specifications call them the 'info, comment (or
metadata), and setup (or codebook)' header packets. collectively they're
often referred to as "the codebooks". The 'info' and 'setup' headers are
required for meaningful decode of data packets. The comment header is
required by the spec, but not necessary on technical grounds.
Post by David Barrett
If there's only one per session, does this mean codebook changing has
been de-supported as well?
That's that same as "chaining" so yes it would no longer be supported.
If you want to change the codebooks, you must get the player a new
playlist entries would work fine, for example.
Hope that clarification helps,
-r
Aaron Colwell
2005-08-30 14:33:24 UTC
Permalink
David,

You present some good arguments. I do believe that in your particular use
case it make sense to allow multiple encoding settings for a session. I'm
calling it different encoding settings instead of "chaining" because I think
the intent of the two uses are a little different. In your case you just want
to provide a set of alternate ways to encode something. Chaining is a generic
mechanism for stringing clips together. In some cases I'll admit that the
distinction is subtle.

You haven't mentioned what codecs your using right now. If you are using
Theora and/or Speex you don't need to change the codebooks to change the rate.
All you need to do is reinit the decoder with different parameters. In the
case of Theora, you can just reinit the current encoder with a different
bitrate or quality and it will output bits at a different rate. If you rely
on the RTP timestamps instead of what the decoder tells you, then you can also
change the frame rate without needing to update the codebook. Vorbis may not
have this sort of flexibility. I'm not an expert, but I believe that this is
what MikeS said on IRC.

Since you may not be able to rate adapt Vorbis streams we may want to allow
multiple encodings to be specified at session establishment time. This makes
things a little more flexible so that we can accomadate your use case, but
doesn't open the flood gates of all sorts of wacky complexity that full fledged
chaining support would require. If we decide to allow this I'd suggest we
put the following requirements on the encodings that are specified in a
session.

- All encodings must have a sample rate that is an integer multiple of the
RTP timestamp sample rate. For example an RTP session that has 44100 RTP
timestamp sample rate can only have 44100, 22050, 11025. This is mainly to
avoid round off problems. I think it also may make things easier for a
resampler that has to handle all these streams.

- Video should use a sample rate that can accommodate various frame rates.
Most other video payloads use 90000Hz which allows NTSC and PAL frame rates
to be represented. Frame rates for all the encodings in the session must
not produce timestamps that need to be rounded. This is basically the same
requirement as above, just for video.

- I'd like to say no switching codecs, but I'm not as rigid about this one. As
long as the client has enough info when it gets the SDP to know what codecs
are going to be used, it can detect whether it will be able to playback the
stream or not.


I think keeping this sort of functionality is reasonable. We would just have to
add an encoding ID to the currently proposed SDP format. We'd have to add back
the chainingID field, but I think we should call it something like
"encoding ID" and make it 8-bits or less. I don't think anyone will need more
than 255 encodings.

Aaron
Ralph Giles
2005-08-30 15:49:34 UTC
Permalink
Post by Aaron Colwell
You haven't mentioned what codecs your using right now. If you are using
Theora and/or Speex you don't need to change the codebooks to change the rate.
[...]
Vorbis may not
have this sort of flexibility. I'm not an expert, but I believe that this is
what MikeS said on IRC.
That was my understanding as well. One cannot vary the blocksize or
sample rate, so while it's probably possible to get a factor of two
or so in bitrate with quality loss at the edges, one can't scale
Vorbis all that much without changing the codebooks.

-r
David Barrett
2005-08-30 16:01:44 UTC
Permalink
Post by Aaron Colwell
You haven't mentioned what codecs your using right now. If you are using
Theora and/or Speex you don't need to change the codebooks to change the rate.
All you need to do is reinit the decoder with different parameters. In the
case of Theora, you can just reinit the current encoder with a different
bitrate or quality and it will output bits at a different rate.
I'm using Theora and Speex, as you guessed. But I'm changing all the
video settings on the fly -- framerate, frame size, encoding quality,
etc. (Audio too, but to a lesser degree.)

If you rely
Post by Aaron Colwell
on the RTP timestamps instead of what the decoder tells you, then you can also
change the frame rate without needing to update the codebook. Vorbis may not
have this sort of flexibility. I'm not an expert, but I believe that this is
what MikeS said on IRC.
RTP timestamps are another discussion that I haven't followed closely.
I actually didn't realize there were any restrictions on them (ie, that
they need to be a multiple of anything). I just have my decoder accept
arbitrary timestamps and sync up with the audio, even if the video
framerate is irregular. I ignore the decoder timestamps.
Post by Aaron Colwell
Since you may not be able to rate adapt Vorbis streams we may want to allow
multiple encodings to be specified at session establishment time. This makes
things a little more flexible so that we can accomadate your use case, but
doesn't open the flood gates of all sorts of wacky complexity that full fledged
chaining support would require.
I guess I still don't see what this flood gate o' wackiness is that
chaining opens. (I'm sorry if this was discussed in detail and I fell
behind.) I see that the SDP gets horribly complicated if you need to
download a bunch of codebooks via HTTP (effectively solved by inline
codebook ack/retransmit). And perhaps it's harder to write a player
that accepts irregular framerates, framesizes, and so forth. But this
doesn't seem as bad as has been implied. What am I overlooking?
Post by Aaron Colwell
- All encodings must have a sample rate that is an integer multiple of the
RTP timestamp sample rate. For example an RTP session that has 44100 RTP
timestamp sample rate can only have 44100, 22050, 11025. This is mainly to
avoid round off problems. I think it also may make things easier for a
resampler that has to handle all these streams.
- Video should use a sample rate that can accommodate various frame rates.
Most other video payloads use 90000Hz which allows NTSC and PAL frame rates
to be represented. Frame rates for all the encodings in the session must
not produce timestamps that need to be rounded. This is basically the same
requirement as above, just for video.
All this talk on sample rates is scaring me. I didn't realize it was a
requirement to be *absolutely* regular with framerate. I thought we
were talking about *average* framerates. I assumed anything stating
framerate is purely advisory, and the decoder should be prepared to
handle frames that come in on any sample frequency (ie, not puke if it
gets a frame before or after what it's expecting).

After all, at the end of the day, these samples are coming from a live
source which *itself* isn't always producing a perfectly regular stream.
How could I possibly enforce absolute regularity if my camera actually
generates only ~30FPS instead of a mathematically perfect 30FPS?
Post by Aaron Colwell
I think keeping this sort of functionality is reasonable. We would just have to
add an encoding ID to the currently proposed SDP format. We'd have to add back
the chainingID field, but I think we should call it something like
"encoding ID" and make it 8-bits or less. I don't think anyone will need more
than 255 encodings.
I agree, 255 is probably enough (and 256 is even better). You might
make the statement that an inline codebook transmission for an "encoding
ID" that is already in use overrides the old codebook. Thus the encoder
can manage the "encoder-space" effectively.

-david
Ralph Giles
2005-08-30 16:46:55 UTC
Permalink
Post by David Barrett
All this talk on sample rates is scaring me. I didn't realize it was a
requirement to be *absolutely* regular with framerate. I thought we
were talking about *average* framerates. I assumed anything stating
framerate is purely advisory, and the decoder should be prepared to
handle frames that come in on any sample frequency (ie, not puke if it
gets a frame before or after what it's expecting).
Well, that works, but. All the Xiph codecs are notionally fixed frame
rate, and if your actual source material isn't, the encoder is expected
to resample it so that it is.

That said, the decoders always produce N samples for N samples of
input, so you can ignore the native framerate, or use an inflated
rate and implicitly drop a lot of samples to emulate variable
frame rates if you have an out-of-bad channel like the RTP header
timestamps. With Vorbis and Speex, changing the local sample rate
will cause some quality loss because the encoder's model assumes
a fixed rate. Current theora encoders don't really do anything
relative to absolute time, so there shouldn't be an effect
(so far).

What Aaron is talking about with the 90000Hz thing is (I believe)
the choice of timebase for the timestamps in the RTP header. This
timebase (what you multiply the 32 bit timestamp by to get wall
clock time) is defined by the payload spec, i.e. what we're
discussing.

The recommendation for audio is that this be the same as, or
some multiple of the actual sample rate. e.g. 44100 or 48000 Hz.
For video, RFC 1889 says that just using the native video
frame rate doesn't provide enough resolution for measuring
packet arrival jitter (presumedly in the same timebase). My
guess is that 90000 HZ provides plenty of such resolution,
as well as dividing evenly by standard frame rates: 24, 25,
29.97, 30, 50, 60. RFC 1890 doesn't mention this explicitly
but seems to assume it will be the time base for video codecs.

That's my understanding of what we're talking about anyway.

Variable framerates are really only made possible with software's
flexibility and only motivated by crappy clocks in consumer capture
devices. Professional A/V hardware tends to assume a fixed framerate,
and based on that so do a lot of digital standards.

-r
Tor-Einar Jarnbjo
2005-09-01 20:47:09 UTC
Permalink
Your "client" is a simple receiver that should be perfectly happy with
the current baseline discussed, isn't it?
Probably not.
If I'm wrong please tell me your constraints.
I already mentioned several in my responses to your and David Barrett's
mails on Monday and Tuesday, but I can of course summarize:

- The server does not know how fast it can stream the codebook. Limiting
the transmission speed to the audio stream bandwidth may cause an
inacceptable delay when starting playback and will in most cases cause a
longer delay than necessary compared to e.g. HTTP based retrieval.

- The server does not know when to start audio streaming if it doesn't
get any feedback from the client that the codebook has been completely
received.

- Repeating the codebook transmission at specific intervals to assure it
is received by the client may waste unavailable bandwidth and may
prevent the client from playing from the beginning of the audio stream
if it is not able to cache the audio streamed before the complete
codebook header is received.

- The "baseline" is not practiable for multicast transmissions.

- I doubt that IETF will approve an RFC for content over RTP depending
on reliable transport and with no multicast support.

Tor
Luca Barbato
2005-09-01 22:03:49 UTC
Permalink
Ops, I didn't replyall, resending to the list (that is slightly
modified, reread please)

First:

There aren't any streamcast (webradio) client that are using rtp only,
all I know use rtsp/rtp, if you have rtsp and rtcp you have plenty of
way to have feedbacks from client and push data asyncronously and deal
with the shortcomings.
Post by Tor-Einar Jarnbjo
I already mentioned several in my responses to your and David Barrett's
- The server does not know how fast it can stream the codebook. Limiting
the transmission speed to the audio stream bandwidth may cause an
inacceptable delay when starting playback and will in most cases cause a
longer delay than necessary compared to e.g. HTTP based retrieval.
Why is supposed to not know? btw if is an unicast transmission it will
just start streaming the codebooks and then the rest...
Post by Tor-Einar Jarnbjo
- The server does not know when to start audio streaming if it doesn't
get any feedback from the client that the codebook has been completely
received.
And should not be a server concern, the client would have to skip till
it can get them, check the mpeg4 case for a similar situation.
The only problem is the retransmission frequency.
Post by Tor-Einar Jarnbjo
- Repeating the codebook transmission at specific intervals to assure it
is received by the client may waste unavailable bandwidth and may
prevent the client from playing from the beginning of the audio stream
if it is not able to cache the audio streamed before the complete
codebook header is received.
to put some random values, you have 100packets in 2seconds, you chose to
have something like 5 retransmissions (one each 20packets). If you lose
that many packets in any case you'd have to skip.
Post by Tor-Einar Jarnbjo
- The "baseline" is not practiable for multicast transmissions.
It is, the scenario is quite simple: you join the group and you should
have to wait till you can get the codebooks, and that time is again
dependent on the retransmission frequency (worst case 1 period)
Post by Tor-Einar Jarnbjo
- I doubt that IETF will approve an RFC for content over RTP depending
on reliable transport and with no multicast support.
The baseline solution uses JUST rtp, supports multicast the same way it
supports unicast and could not be as bad as we are all thinking.

If you have better ideas I'm always open to suggestions.

If you like to have an HTTP based offband solution, please prepare a
separate I-D, if I have time I'd start an RTSP I-D proposal or I'll let
other interested group come with something.

lu
--
Luca Barbato

Gentoo/linux Developer Gentoo/PPC Operational Leader
http://dev.gentoo.org/~lu_zero
Tor-Einar Jarnbjo
2005-09-01 22:11:42 UTC
Permalink
Post by Luca Barbato
There aren't any streamcast (webradio) client that are using rtp only,
all I know use rtsp/rtp, if you have rtsp and rtcp you have plenty of
way to have feedbacks from client and push data asyncronously and deal
with the shortcomings.
So you intend to require RTSP as the only possible way to setup the RTP
stream? Using RTCP for client feedback is of course an option, but is
not necessarily very easy to implement or integrate in existing RTP
servers/clients.
Post by Luca Barbato
Why is supposed to not know? btw if is an unicast transmission it will
just start streaming the codebooks and then the rest...
I already explained this one. Using RTP, the server is pushing the
codebook data and has to know or assume the available bandwidth
available to the client. If it sends the codebook packets too fast, they
will be lost, if it sends them too slow, the startup delay will be
higher than strictly necessary. If codebook packets are lost and the
client must wait for a retransmission of parts of the codebook, it might
not have memory to buffer enough audio to play the audio stream from the
beginning.
Post by Luca Barbato
Post by Tor-Einar Jarnbjo
The server does not know when to start audio streaming
And should not be a server concern, the client would have to skip till
it can get them, check the mpeg4 case for a similar situation.
The only problem is the retransmission frequency.
I don't have time now to check any MPEG specification, but I think it
should be obvious that it is not acceptable that it is likely that
several seconds of the audio streamed must be skipped.
Post by Luca Barbato
to put some random values, you have 100packets in 2seconds, you chose to
have something like 5 retransmissions (one each 20packets). If you lose
that many packets in any case you'd have to skip.
Random or not random. If you choose to resend the codebook every 400ms,
you will need >100kbps just for codebook retransmissions, assuming that
the codebook is 5 kilobytes. To use realistic values: If your audio
stream is 100kbps and you try to send as large RTP packets as possible
(1000-1500 bytes) to minimise framing overhead, you will have merely
8-12 RTP packets/second.
Post by Luca Barbato
It is, the scenario is quite simple: you join the group and you should
have to wait till you can get the codebooks, and that time is again
dependent on the retransmission frequency (worst case 1 period)
Using random values, the period length may be acceptable. If we continue
with the realistic values of 100kbps audio and a 5kB codebook header, it
looks very different. We then have ~12000 bytes audio data per second
and if we accept an additional 5% bandwidth for codebook retransmissions
(which is IMHO still too much), it will take >8s to transmit the
codebook, which is not acceptable. Decrease the audio bandwidth or
increase the codebook size as you like and the delay before having a
complete codebook will make the implementation completely useless.
Post by Luca Barbato
If you like to have an HTTP based offband solution, please prepare a
separate I-D, if I have time I'd start an RTSP I-D proposal or I'll let
other interested group come with something.
Wasn't the HTTP delivery method already part of Phil's last draft (-05)?

Tor

Aaron Colwell
2005-08-30 18:50:34 UTC
Permalink
Post by David Barrett
I'm using Theora and Speex, as you guessed. But I'm changing all the
video settings on the fly -- framerate, frame size, encoding quality,
etc. (Audio too, but to a lesser degree.)
Ok well then like I said, other then the fact that you are changing frame
size, you don't actually need an encoding ID since you could use the same
codebook.
Post by David Barrett
RTP timestamps are another discussion that I haven't followed closely.
I actually didn't realize there were any restrictions on them (ie, that
they need to be a multiple of anything). I just have my decoder accept
arbitrary timestamps and sync up with the audio, even if the video
framerate is irregular. I ignore the decoder timestamps.
The RTP timestamps must always be in the units specified by the rtpmap SDP
field. In general most payload formats require the packets to represent an
integer number of RTP timestamp units. This mainly avoid rounding
problems when doing codec frame to timestamps conversions and vice-versa. It
can also effect loss handling and A/V sync if the RTP timestamp sample rate is
not high enough to represent the highest sample rate in the stream.
Post by David Barrett
I guess I still don't see what this flood gate o' wackiness is that
chaining opens. (I'm sorry if this was discussed in detail and I fell
behind.) I see that the SDP gets horribly complicated if you need to
download a bunch of codebooks via HTTP (effectively solved by inline
codebook ack/retransmit). And perhaps it's harder to write a player
that accepts irregular framerates, framesizes, and so forth. But this
doesn't seem as bad as has been implied. What am I overlooking?
There are several problems with allowing chaining to be supported in a general
sense.

- The client has no way to determine if it can actually play the stream
properly. A chain might show up for a codec that it doesn't support. That
is a bad user experience for the client. A slight variation on this is the
case where you have an embedded netradio device. It may be able to handle
certain codec parameters, but not others because of resource constraints. It
will have to blindly start playback of a stream and then potential have to
quit just because it encountered a stream it couldn't playback. This would
be a very bad experience and is something that the user would not be able to
figure out what was happening.

- The server might guess a bad value for the RTP timestamp sample rate. Say it
guesses 16000Hz and then a 44100Hz sample rate chain comes along. There may
not be enough information for the client to handle loss gracefully or ensure
proper A/V sync. You could try to use a sample rate that has integer
multiples for both 44100 and 48000 sample rates, but if I remember correctly
that sample rate is very high and doesn't provide a reasonable amount of time
before the timestamp loops. That in itself has a bunch of complicated issues
associated with it when loss occurs.

- Managing coordination of codebook downloads in the middle of playback is
non-trivial. Making sure the client is notified of the new codebooks
"in time" is not straight forward. How is notification done? How much time
to we have to let the client know before hand? That question directly impacts
the amount of delay the server must introduce when switching between live
feeds that use different codebooks.
Post by David Barrett
All this talk on sample rates is scaring me. I didn't realize it was a
requirement to be *absolutely* regular with framerate. I thought we
were talking about *average* framerates. I assumed anything stating
framerate is purely advisory, and the decoder should be prepared to
handle frames that come in on any sample frequency (ie, not puke if it
gets a frame before or after what it's expecting).
It is not an advisement. You are telling it the rate at which you are sampling
the video at.
Post by David Barrett
After all, at the end of the day, these samples are coming from a live
source which *itself* isn't always producing a perfectly regular stream.
How could I possibly enforce absolute regularity if my camera actually
generates only ~30FPS instead of a mathematically perfect 30FPS?
When you specify that the frame rate is 30fps you are effectively realigning
your images to that grid. Even if that isn't the rate that you are sampling at.
If you are specifying 30 fps as your frame rate then your capture code should
make sure that it doesn't generate more than 30 samples for each second. If
it does then you'll get sync problems because your capture app is not honoring
the 30fps contract that you set up with the encoder. This is true for most
video codecs that I'm aware of, H.263, MPEG4, RV.
Post by David Barrett
Post by Aaron Colwell
I think keeping this sort of functionality is reasonable. We would just have to
add an encoding ID to the currently proposed SDP format. We'd have to add back
the chainingID field, but I think we should call it something like
"encoding ID" and make it 8-bits or less. I don't think anyone will need more
than 255 encodings.
I agree, 255 is probably enough (and 256 is even better). You might
make the statement that an inline codebook transmission for an "encoding
ID" that is already in use overrides the old codebook. Thus the encoder
can manage the "encoder-space" effectively.
I don't think overriding is a good idea. It assumes that state was actually
properly transmitted to the client. You don't have any guarentee that the
client gets the change request. Say for some reason, the change information
packets get completely lost. The client will have no clue that anything changed
and will suddenly have problems decoding packets for that encoding ID.

I hope this clears things up a little.

Aaron
Post by David Barrett
-david
_______________________________________________
xiph-rtp mailing list
http://lists.xiph.org/mailman/listinfo/xiph-rtp
David Barrett
2005-08-30 18:33:17 UTC
Permalink
Post by Ralph Giles
Well, that works, but. All the Xiph codecs are notionally fixed frame
rate, and if your actual source material isn't, the encoder is expected
to resample it so that it is.
Ah, ok.
Post by Ralph Giles
What Aaron is talking about with the 90000Hz thing is (I believe)
the choice of timebase for the timestamps in the RTP header. This
timebase (what you multiply the 32 bit timestamp by to get wall
clock time) is defined by the payload spec, i.e. what we're
discussing.
Ohh, ok. I'm sorry, I was confused. So you'd need to decide this even
if you had a totally irregular framerate; it's an orthogonal issue.
I'll read up on the spec to understand this better.
Post by Ralph Giles
Variable framerates are really only made possible with software's
flexibility and only motivated by crappy clocks in consumer capture
devices. Professional A/V hardware tends to assume a fixed framerate,
and based on that so do a lot of digital standards.
Thanks, I hadn't considered the AV hardware issue.

Another motivation, however, is downsampling to a non-factor framerate.
For example, assume a 30fps camera -- you can only "regularly" broadcast
at 1, 3, 5, 15, and 30 fps. But 15-30fps is a big range. Assuming my
target is 22fps (such as picked by an adaptive rate adjuster based on
current conditons). 15 is unnecessarily slow, and 30 is too fast --
both by significant margins. Plus, switching from 15 to 30 instantly
(or worse, toggling back and forth if you're right on the border) is
jarring. However, I can get an average framerate of 22fps through
irregular frame dropping. It won't be as visually pleasing as a
"regular" framerate, but for some video (especially live video from
crappy webcams) it might be acceptable.

-david
Aaron Colwell
2005-08-30 18:56:01 UTC
Permalink
Post by David Barrett
Another motivation, however, is downsampling to a non-factor framerate.
For example, assume a 30fps camera -- you can only "regularly" broadcast
at 1, 3, 5, 15, and 30 fps. But 15-30fps is a big range. Assuming my
target is 22fps (such as picked by an adaptive rate adjuster based on
current conditons). 15 is unnecessarily slow, and 30 is too fast --
both by significant margins. Plus, switching from 15 to 30 instantly
(or worse, toggling back and forth if you're right on the border) is
jarring. However, I can get an average framerate of 22fps through
irregular frame dropping. It won't be as visually pleasing as a
"regular" framerate, but for some video (especially live video from
crappy webcams) it might be acceptable.
This is one of the reasons why I called out video as having slightly different
requirements than audio. If you use a 90kHz RTP timestamp sample rate you have
a lot more flexibility with the frame rates that you can use. Remember that
we are talking about timestamp sample rates not fps. If you take (1 / fps)
the numerator will be the timestamp delta between frames and the denominator
will be the RTP timestamp sample rate. If you use 90000 as the denominator,
you'll notice that there is a large fps range to play with.

Aaron
Post by David Barrett
-david
_______________________________________________
xiph-rtp mailing list
http://lists.xiph.org/mailman/listinfo/xiph-rtp
Loading...