Discussion:
[xiph-rtp] Theora RTP payload format
Steve Kann
2005-04-18 13:29:57 UTC
Permalink
Hi, List,

I've been working on building an implementation of a
video-conferencing endpoint using Theora, and have been working with the
draft-kerr-avt-theora-rtp-00 spec.

I've also read the archives of this list, about some of the proposed
changes. I'd like to describe here what I'm planning on doing, and see
how this might fit into your design.

Basically, what I'm working with is a project called "iaxclient".
iaxclient is a library for a VoIP softphone, which presently supports
only audio, but I am extending to support video as well. It uses the
IAX2 protocol, which is a lightweight VoIP protocol that does _not_ use
RTP. However, the payload format for IAX2 is generally compatible with
the payload format for RTP. Asterisk (the open-source PBX) includes
support for RTP-based VoIP protocols (SIP, H.323, etc), as well as
non-RTP-based VoIP protocols (IAX2, others).

There are basically two use cases for users making videoconferencing
calls using the application:

1) Point-to-Point calls: This case seems to be pretty easy to
handle, and fits into most of the designs I've seen so far:

2) Multi-party conferences: This is where some of the designs I've
seen so far seem to work well, and some of them do not.

The basic idea for multi-party conferences is that each user
maintains a virtual connection to a "conference engine" (this is
already in place for audio conferences). The conference engine
intelligently receives audio from the clients and sends audio to the
clients, so each client can hear the audio of any other speaking
participants.

The idea for video is that the clients each send their video to the
conference engine, and the conference engine will send zero or one video
stream to each participant, in one of two "modes"
a) Automatic mode: The conference engine will use some
heurestics to decide whose video should be shown to the participant --
Generally, this will be the only participant who is presently speaking
(in the case of multiple active speakers, or zero active speakers, there
will be some secondary criteria).

b) Request mode: The client itself will notify the conference
engine (perhaps out-of-band) and request to see a particular speaker's
video.

What this means for the video stream (and this works just fine for
any other video format, (i.e. h.26x, etc), is that we would like to be
able to change the video source at any time (or, at any keyframe at
least).

The whole setup headers business, of course, makes this design
particularly difficult. With the present draft-kerr-avt-theora-rtp-00
format, though, I think I could probably (with a great deal of
unnecessary overhead), send the setup headers occassionally, and then
switch at any time. The clients could then use "header caching", and,
if they've seen these headers before (matching CRC32), they could use
their cached copy, and if not, they'd just have to wait a few seconds to
get them before they could start decoding.

*Note: I also suspect, but I haven't researched, that if all the
clients are using the same version of the theora encoder, and the same
settings, that their setup headers would likely be the same; If this is
the case, then their CRC32's would be the same, and they could start
decoding at any keyframe..

With the latest idea I've read, though, it makes this process much more
inconvenient, because _each_ client would have their own 16bit "chain
ID", and these chain ID's would be duplicated in the streams sent by
each client, and therefore the server would need to deeply understand
and parse each of the streams in order to put them together, etc.

I think that my use case isn't all that unusual though; it's somewhat
like the properties you might have in multicasting, I think.

1) It would be ideal if the RTP payload format could be made independent
of SDP.
[the present theora rtp format exhibits this property; if you use
periodic inline
setup header transmission]
2) It would be ideal if the RTP payload format continued to allow inline
setup header
transmission.


It would be most convenient, if there were a "fixed setup" mode for
theora, where you could ask the theora-encoder to use fixed setup header
set, and have it act like other codecs in this respect. I understand
the flexibility that the setup headers give you in encoder design, but
it would be nice if there were a way to configure it otherwise..



-SteveK
Aaron Colwell
2005-04-18 13:54:00 UTC
Permalink
Post by Steve Kann
Hi, List,
I've been working on building an implementation of a
video-conferencing endpoint using Theora, and have been working with the
draft-kerr-avt-theora-rtp-00 spec.
I've also read the archives of this list, about some of the proposed
changes. I'd like to describe here what I'm planning on doing, and see
how this might fit into your design.
Basically, what I'm working with is a project called "iaxclient".
iaxclient is a library for a VoIP softphone, which presently supports
only audio, but I am extending to support video as well. It uses the
IAX2 protocol, which is a lightweight VoIP protocol that does _not_ use
RTP. However, the payload format for IAX2 is generally compatible with
the payload format for RTP. Asterisk (the open-source PBX) includes
support for RTP-based VoIP protocols (SIP, H.323, etc), as well as
non-RTP-based VoIP protocols (IAX2, others).
There are basically two use cases for users making videoconferencing
1) Point-to-Point calls: This case seems to be pretty easy to
2) Multi-party conferences: This is where some of the designs I've
seen so far seem to work well, and some of them do not.
The basic idea for multi-party conferences is that each user
maintains a virtual connection to a "conference engine" (this is
already in place for audio conferences). The conference engine
intelligently receives audio from the clients and sends audio to the
clients, so each client can hear the audio of any other speaking
participants.
The idea for video is that the clients each send their video to the
conference engine, and the conference engine will send zero or one video
stream to each participant, in one of two "modes"
a) Automatic mode: The conference engine will use some
heurestics to decide whose video should be shown to the participant --
Generally, this will be the only participant who is presently speaking
(in the case of multiple active speakers, or zero active speakers, there
will be some secondary criteria).
b) Request mode: The client itself will notify the conference
engine (perhaps out-of-band) and request to see a particular speaker's
video.
What this means for the video stream (and this works just fine for
any other video format, (i.e. h.26x, etc), is that we would like to be
able to change the video source at any time (or, at any keyframe at
least).
The whole setup headers business, of course, makes this design
particularly difficult. With the present draft-kerr-avt-theora-rtp-00
format, though, I think I could probably (with a great deal of
unnecessary overhead), send the setup headers occassionally, and then
switch at any time. The clients could then use "header caching", and,
if they've seen these headers before (matching CRC32), they could use
their cached copy, and if not, they'd just have to wait a few seconds to
get them before they could start decoding.
*Note: I also suspect, but I haven't researched, that if all the
clients are using the same version of the theora encoder, and the same
settings, that their setup headers would likely be the same; If this is
the case, then their CRC32's would be the same, and they could start
decoding at any keyframe..
With the latest idea I've read, though, it makes this process much more
inconvenient, because _each_ client would have their own 16bit "chain
ID", and these chain ID's would be duplicated in the streams sent by
each client, and therefore the server would need to deeply understand
and parse each of the streams in order to put them together, etc.
What did you do in the case where the CRC32 was different from each of the
clients? This is basically the same scenario isn't it?

I know nothing about IAX2, but I would assume that it has some sort of
offer/answer model to negotiate codec parameters and such. You could easily
put the chain ID in this negotiation so that all users in the conference use
the same codebook.
Post by Steve Kann
I think that my use case isn't all that unusual though; it's somewhat
like the properties you might have in multicasting, I think.
1) It would be ideal if the RTP payload format could be made independent
of SDP.
It is currently independent of SDP if you use inline codebook transmission. The
info in the SDP just allows you to know ahead of time what the info and setup
headers are going to be for each chain. It also provides a mechanism to grab
the codebooks ahead of time. You also can save bits if you don't want
to periodically transmit codebooks.
Post by Steve Kann
[the present theora rtp format exhibits this property; if you use
periodic inline
setup header transmission]
2) It would be ideal if the RTP payload format continued to allow inline
setup header
transmission.
To my knowledge we weren't going to get rid of inline transmission. I had
always intended to keep it.
Post by Steve Kann
It would be most convenient, if there were a "fixed setup" mode for
theora, where you could ask the theora-encoder to use fixed setup header
set, and have it act like other codecs in this respect. I understand
the flexibility that the setup headers give you in encoder design, but
it would be nice if there were a way to configure it otherwise..
If the encoder allowed you to specify a codebook on initialization, you could
effectively do this. Basically your app could just always specify the same
codebook to the encoder and then sent the hash to the other participants.
They would then verify that your hash matches the hash of their codebook and
then your done. This is basically the codebook cache hit scenario. If you get
a miss then you just make connection to the conference fail.


Aaron
Post by Steve Kann
-SteveK
_______________________________________________
xiph-rtp mailing list
http://lists.xiph.org/mailman/listinfo/xiph-rtp
Steve Kann
2005-04-18 14:19:25 UTC
Permalink
Post by Aaron Colwell
Post by Steve Kann
Hi, List,
I've been working on building an implementation of a
video-conferencing endpoint using Theora, and have been working with the
draft-kerr-avt-theora-rtp-00 spec.
I've also read the archives of this list, about some of the proposed
changes. I'd like to describe here what I'm planning on doing, and see
how this might fit into your design.
Basically, what I'm working with is a project called "iaxclient".
iaxclient is a library for a VoIP softphone, which presently supports
only audio, but I am extending to support video as well. It uses the
IAX2 protocol, which is a lightweight VoIP protocol that does _not_ use
RTP. However, the payload format for IAX2 is generally compatible with
the payload format for RTP. Asterisk (the open-source PBX) includes
support for RTP-based VoIP protocols (SIP, H.323, etc), as well as
non-RTP-based VoIP protocols (IAX2, others).
There are basically two use cases for users making videoconferencing
1) Point-to-Point calls: This case seems to be pretty easy to
2) Multi-party conferences: This is where some of the designs I've
seen so far seem to work well, and some of them do not.
The basic idea for multi-party conferences is that each user
maintains a virtual connection to a "conference engine" (this is
already in place for audio conferences). The conference engine
intelligently receives audio from the clients and sends audio to the
clients, so each client can hear the audio of any other speaking
participants.
The idea for video is that the clients each send their video to the
conference engine, and the conference engine will send zero or one video
stream to each participant, in one of two "modes"
a) Automatic mode: The conference engine will use some
heurestics to decide whose video should be shown to the participant --
Generally, this will be the only participant who is presently speaking
(in the case of multiple active speakers, or zero active speakers, there
will be some secondary criteria).
b) Request mode: The client itself will notify the conference
engine (perhaps out-of-band) and request to see a particular speaker's
video.
What this means for the video stream (and this works just fine for
any other video format, (i.e. h.26x, etc), is that we would like to be
able to change the video source at any time (or, at any keyframe at
least).
The whole setup headers business, of course, makes this design
particularly difficult. With the present draft-kerr-avt-theora-rtp-00
format, though, I think I could probably (with a great deal of
unnecessary overhead), send the setup headers occassionally, and then
switch at any time. The clients could then use "header caching", and,
if they've seen these headers before (matching CRC32), they could use
their cached copy, and if not, they'd just have to wait a few seconds to
get them before they could start decoding.
*Note: I also suspect, but I haven't researched, that if all the
clients are using the same version of the theora encoder, and the same
settings, that their setup headers would likely be the same; If this is
the case, then their CRC32's would be the same, and they could start
decoding at any keyframe..
With the latest idea I've read, though, it makes this process much more
inconvenient, because _each_ client would have their own 16bit "chain
ID", and these chain ID's would be duplicated in the streams sent by
each client, and therefore the server would need to deeply understand
and parse each of the streams in order to put them together, etc.
What did you do in the case where the CRC32 was different from each of the
clients? This is basically the same scenario isn't it?
In that case, the "setup header ident" of the payloads would be
different, and the receiver would know that it needs to wait until it
receives setup headers matching the CRC of these frames before decoding.
So, the only time we could accidentally try to decode frames using an
incorrect set of setup headers would be in the case of a CRC collision
(P ~= 0).

If the payload format only includes a "chain ID", then the chances of
two streams having the same "chain ID" when coming from different
sources is pretty much P==1. So, the server that's doing the switching
would need to actually muck with the payload in order to give each
sender a different Chain ID, and then keep track of which was which,
etc. It makes it impossible to just switch senders in the server without
the server understanding the internals of the codec payload.
Post by Aaron Colwell
I know nothing about IAX2, but I would assume that it has some sort of
offer/answer model to negotiate codec parameters and such. You could easily
put the chain ID in this negotiation so that all users in the conference use
the same codebook.
Presently, it's pretty simple, where it allows negotiation of the codec,
but not codec parameters. In practice, it hasn't been necessary to do
that. In the future, it might need to be extended to do so.

But, consider that users will join and leave the conference at arbitrary
times, so the conference engine can't know in advance all the codebooks
that might be used.

Also, as you elude to below, there's no way to seed an encoder with a
particular codebook (AFAIK).
Post by Aaron Colwell
Post by Steve Kann
I think that my use case isn't all that unusual though; it's somewhat
like the properties you might have in multicasting, I think.
1) It would be ideal if the RTP payload format could be made independent
of SDP.
It is currently independent of SDP if you use inline codebook transmission. The
info in the SDP just allows you to know ahead of time what the info and setup
headers are going to be for each chain. It also provides a mechanism to grab
the codebooks ahead of time. You also can save bits if you don't want
to periodically transmit codebooks.
I don't think so, I thought the latest proposal called for replacing the
"setup ident" field (32 bits) with a "chain ID" field (16 bits or so),
where the "chain ID" field would refer to a "chain-info" item in the SDP.

This would mean, that even for a RTP and SDP based conference
application like mine, if a client joined the conference with a
different codebook, then all the clients would need to re-fetch the SDP
in order to identify the codebook that's needed.

But, I guess, I haven't seen (maybe it hasn't been written yet), how the
inline codebook transfer would work.

In general, I don't think that my issues are unique to IAX2; Nor do I
think that they are things that can't be made to work with whatever
format you have. But the questions are 1) how complex will these
implementations need to be, and 2) how will they perform.

Most video codecs have the property where a "switch" (which is basically
what my conferencing application is), can "switch" between streams from
different sources, at any keyframe, as long as the width, height, and
framerates are the same (and in some cases even if they're not), without
needing to negotiate with the receiver at all. This can make a switch
fairly simple; It only needs to know, for each frame, whether it's a
keyframe or not, and then treat the whole thing as opaque data.

Theora has already moved away from this goal a bunch with the whole
codebook thing, but it would be nice to at least minimize the
inconvenience of dealing with the codebooks as much as possible.
Post by Aaron Colwell
Post by Steve Kann
[the present theora rtp format exhibits this property; if you use
periodic inline
setup header transmission]
2) It would be ideal if the RTP payload format continued to allow inline
setup header
transmission.
To my knowledge we weren't going to get rid of inline transmission. I had
always intended to keep it.
Would the format be the same as it is now (+- the setup header ident
field)? Would there be some way outside of SDP to indicate which
codebooks belonged to which "chain id?"
Post by Aaron Colwell
Post by Steve Kann
It would be most convenient, if there were a "fixed setup" mode for
theora, where you could ask the theora-encoder to use fixed setup header
set, and have it act like other codecs in this respect. I understand
the flexibility that the setup headers give you in encoder design, but
it would be nice if there were a way to configure it otherwise..
If the encoder allowed you to specify a codebook on initialization, you could
effectively do this. Basically your app could just always specify the same
codebook to the encoder and then sent the hash to the other participants.
They would then verify that your hash matches the hash of their codebook and
then your done. This is basically the codebook cache hit scenario. If you get
a miss then you just make connection to the conference fail.
Right. Something like this would allow for the most bit-efficient
method, because we could (rarely, if ever) retransmit codebooks if we
can control all the clients, and force them to use the same codebook.

One of the other things I'll need to do eventually is to "record" these
conferences, into some container, and make that format
forward-compatible; If all the clients use the same codebooks, that also
makes things much simpler, because we could write this all out as one
"chain".


-SteveK


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.xiph.org/pipermail/xiph-rtp/attachments/20050418/0ff5a46b/attachment.html
Aaron Colwell
2005-04-18 15:05:27 UTC
Permalink
Post by Steve Kann
Hi, List,
I've been working on building an implementation of a
video-conferencing endpoint using Theora, and have been working with the
draft-kerr-avt-theora-rtp-00 spec.
I've also read the archives of this list, about some of the proposed
changes. I'd like to describe here what I'm planning on doing, and see
how this might fit into your design.
Basically, what I'm working with is a project called "iaxclient".
iaxclient is a library for a VoIP softphone, which presently supports
only audio, but I am extending to support video as well. It uses the
IAX2 protocol, which is a lightweight VoIP protocol that does _not_ use
RTP. However, the payload format for IAX2 is generally compatible with
the payload format for RTP. Asterisk (the open-source PBX) includes
support for RTP-based VoIP protocols (SIP, H.323, etc), as well as
non-RTP-based VoIP protocols (IAX2, others).
There are basically two use cases for users making videoconferencing
1) Point-to-Point calls: This case seems to be pretty easy to
2) Multi-party conferences: This is where some of the designs I've
seen so far seem to work well, and some of them do not.
The basic idea for multi-party conferences is that each user
maintains a virtual connection to a "conference engine" (this is
already in place for audio conferences). The conference engine
intelligently receives audio from the clients and sends audio to the
clients, so each client can hear the audio of any other speaking
participants.
The idea for video is that the clients each send their video to the
conference engine, and the conference engine will send zero or one video
stream to each participant, in one of two "modes"
a) Automatic mode: The conference engine will use some
heurestics to decide whose video should be shown to the participant --
Generally, this will be the only participant who is presently speaking
(in the case of multiple active speakers, or zero active speakers, there
will be some secondary criteria).
b) Request mode: The client itself will notify the conference
engine (perhaps out-of-band) and request to see a particular speaker's
video.
What this means for the video stream (and this works just fine for
any other video format, (i.e. h.26x, etc), is that we would like to be
able to change the video source at any time (or, at any keyframe at
least).
The whole setup headers business, of course, makes this design
particularly difficult. With the present draft-kerr-avt-theora-rtp-00
format, though, I think I could probably (with a great deal of
unnecessary overhead), send the setup headers occassionally, and then
switch at any time. The clients could then use "header caching", and,
if they've seen these headers before (matching CRC32), they could use
their cached copy, and if not, they'd just have to wait a few seconds to
get them before they could start decoding.
*Note: I also suspect, but I haven't researched, that if all the
clients are using the same version of the theora encoder, and the same
settings, that their setup headers would likely be the same; If this is
the case, then their CRC32's would be the same, and they could start
decoding at any keyframe..
With the latest idea I've read, though, it makes this process much more
inconvenient, because _each_ client would have their own 16bit "chain
ID", and these chain ID's would be duplicated in the streams sent by
each client, and therefore the server would need to deeply understand
and parse each of the streams in order to put them together, etc.
What did you do in the case where the CRC32 was different from each of the
clients? This is basically the same scenario isn't it?
In that case, the "setup header ident" of the payloads would be different,
and the receiver would know that it needs to wait until it receives setup
headers matching the CRC of these frames before decoding. So, the only
time we could accidentally try to decode frames using an incorrect set of
setup headers would be in the case of a CRC collision (P ~= 0).
If the payload format only includes a "chain ID", then the chances of two
streams having the same "chain ID" when coming from different sources is
pretty much P==1. So, the server that's doing the switching would need to
actually muck with the payload in order to give each sender a different
Chain ID, and then keep track of which was which, etc. It makes it
impossible to just switch senders in the server without the server
understanding the internals of the codec payload.
Ok now I see.
Post by Steve Kann
I know nothing about IAX2, but I would assume that it has some sort of
offer/answer model to negotiate codec parameters and such. You could easily
put the chain ID in this negotiation so that all users in the conference use
the same codebook.
Presently, it's pretty simple, where it allows negotiation of the codec,
but not codec parameters. In practice, it hasn't been necessary to do
that. In the future, it might need to be extended to do so.
I see. I suppose you could append the codebook hash to the codec name. Instead
of just "Theora" you could have "Theora-2982394872479842". I don't know if
there are any limitations on codebook name.
Post by Steve Kann
But, consider that users will join and leave the conference at arbitrary
times, so the conference engine can't know in advance all the codebooks
that might be used.
This isn't necessarily a problem. It would just need to know the codebooks that
it allows to be used in the conference. That's basically what happens with all
other codecs. It's just implicit in their case instead of explicit like it
would have to be for Theora.
Post by Steve Kann
Also, as you elude to below, there's no way to seed an encoder with a
particular codebook (AFAIK).
I think that my use case isn't all that unusual though; it's somewhat
like the properties you might have in multicasting, I think.
1) It would be ideal if the RTP payload format could be made independent
of SDP.
It is currently independent of SDP if you use inline codebook transmission. The
info in the SDP just allows you to know ahead of time what the info and setup
headers are going to be for each chain. It also provides a mechanism to grab
the codebooks ahead of time. You also can save bits if you don't want
to periodically transmit codebooks.
I don't think so, I thought the latest proposal called for replacing the
"setup ident" field (32 bits) with a "chain ID" field (16 bits or so),
where the "chain ID" field would refer to a "chain-info" item in the SDP.
The chain ID really just represents a group of packets that use a particular
ident & codebook pair. You don't have to refer to the SDP. If the client
knows that there is going to be inline header and codebook transmission it
can just wait for the ident and codebook for that chain to arrive. In most
situations those would arrive before any data packets, but in your switching
situation, that might not happen.
Post by Steve Kann
This would mean, that even for a RTP and SDP based conference application
like mine, if a client joined the conference with a different codebook,
then all the clients would need to re-fetch the SDP in order to identify
the codebook that's needed.
No it wouldn't. It can just wait for the inline headers to arrive just like it
would in the CRC32 case.
Post by Steve Kann
But, I guess, I haven't seen (maybe it hasn't been written yet), how the
inline codebook transfer would work.
The way I envision it is that ident and codebook packets would be transmitted
just like data packets. They would have a chain ID associated with them as
well. This allows you to determine what headers go with which stream.
Post by Steve Kann
In general, I don't think that my issues are unique to IAX2; Nor do I
think that they are things that can't be made to work with whatever format
you have. But the questions are 1) how complex will these
implementations need to be, and 2) how will they perform.
Most video codecs have the property where a "switch" (which is basically
what my conferencing application is), can "switch" between streams from
different sources, at any keyframe, as long as the width, height, and
framerates are the same (and in some cases even if they're not), without
needing to negotiate with the receiver at all. This can make a switch
fairly simple; It only needs to know, for each frame, whether it's a
keyframe or not, and then treat the whole thing as opaque data.
In these cases the switch needs to know datatype specific info about the codec.
I'm assuming it cracks open the payload to determine the frame size and frame
rate. All Theora does is add codebook to the switch criteria. How are you
doing switching right now for Theora? The server would still need to keep state
for each client since the frame size & frame rate is not in the frame data.
Are you not enforcing that criteria for Theora? Is this something that is
determined at the time the client connects to the server?
Post by Steve Kann
Theora has already moved away from this goal a bunch with the whole
codebook thing, but it would be nice to at least minimize the
inconvenience of dealing with the codebooks as much as possible.
[the present theora rtp format exhibits this property; if you use
periodic inline
setup header transmission]
2) It would be ideal if the RTP payload format continued to allow inline
setup header
transmission.
To my knowledge we weren't going to get rid of inline transmission. I had
always intended to keep it.
Would the format be the same as it is now (+- the setup header ident
field)? Would there be some way outside of SDP to indicate which
codebooks belonged to which "chain id?"
The ident and codebook packets would have a chain ID in them.
Post by Steve Kann
It would be most convenient, if there were a "fixed setup" mode for
theora, where you could ask the theora-encoder to use fixed setup header
set, and have it act like other codecs in this respect. I understand
the flexibility that the setup headers give you in encoder design, but
it would be nice if there were a way to configure it otherwise..
If the encoder allowed you to specify a codebook on initialization, you could
effectively do this. Basically your app could just always specify the same
codebook to the encoder and then sent the hash to the other participants.
They would then verify that your hash matches the hash of their codebook and
then your done. This is basically the codebook cache hit scenario. If you get
a miss then you just make connection to the conference fail.
Right. Something like this would allow for the most bit-efficient method,
because we could (rarely, if ever) retransmit codebooks if we can control
all the clients, and force them to use the same codebook.
One of the other things I'll need to do eventually is to "record" these
conferences, into some container, and make that format
forward-compatible; If all the clients use the same codebooks, that also
makes things much simpler, because we could write this all out as one
"chain".
Thanks for bringing this up. It's nice to have input from a different use case.
I'm fine with making changes to the current thinking, but I just want to make
sure that we have a good understanding of the problems.

The whole reason that we went from the CRC32 -> chain ID thinking was because
there was concern about collisions in this value. Unique chain IDs fix that
problem, but cause a problem for you because you want a system that doesn't
have to worry about the chainID -> codebook mapping.

What isn't clear to me is whether your problem space actually needs to allow
arbitrary codebooks. It seems to me that allowing this causes more headaches
that it's worth since you could potentially waste a ton of bits on codebook
transmission if every client in the conference uses a slightly different
codebook.

Aaron
Post by Steve Kann
-SteveK
Steve Kann
2005-04-18 17:12:54 UTC
Permalink
Post by Aaron Colwell
Post by Steve Kann
Hi, List,
I've been working on building an implementation of a
video-conferencing endpoint using Theora, and have been working with the
draft-kerr-avt-theora-rtp-00 spec.
I've also read the archives of this list, about some of the proposed
changes. I'd like to describe here what I'm planning on doing, and see
how this might fit into your design.
Basically, what I'm working with is a project called "iaxclient".
iaxclient is a library for a VoIP softphone, which presently supports
only audio, but I am extending to support video as well. It uses the
IAX2 protocol, which is a lightweight VoIP protocol that does _not_ use
RTP. However, the payload format for IAX2 is generally compatible with
the payload format for RTP. Asterisk (the open-source PBX) includes
support for RTP-based VoIP protocols (SIP, H.323, etc), as well as
non-RTP-based VoIP protocols (IAX2, others).
There are basically two use cases for users making videoconferencing
1) Point-to-Point calls: This case seems to be pretty easy to
2) Multi-party conferences: This is where some of the designs I've
seen so far seem to work well, and some of them do not.
The basic idea for multi-party conferences is that each user
maintains a virtual connection to a "conference engine" (this is
already in place for audio conferences). The conference engine
intelligently receives audio from the clients and sends audio to the
clients, so each client can hear the audio of any other speaking
participants.
The idea for video is that the clients each send their video to the
conference engine, and the conference engine will send zero or one video
stream to each participant, in one of two "modes"
a) Automatic mode: The conference engine will use some
heurestics to decide whose video should be shown to the participant --
Generally, this will be the only participant who is presently speaking
(in the case of multiple active speakers, or zero active speakers, there
will be some secondary criteria).
b) Request mode: The client itself will notify the conference
engine (perhaps out-of-band) and request to see a particular speaker's
video.
What this means for the video stream (and this works just fine for
any other video format, (i.e. h.26x, etc), is that we would like to be
able to change the video source at any time (or, at any keyframe at
least).
The whole setup headers business, of course, makes this design
particularly difficult. With the present draft-kerr-avt-theora-rtp-00
format, though, I think I could probably (with a great deal of
unnecessary overhead), send the setup headers occassionally, and then
switch at any time. The clients could then use "header caching", and,
if they've seen these headers before (matching CRC32), they could use
their cached copy, and if not, they'd just have to wait a few seconds to
get them before they could start decoding.
*Note: I also suspect, but I haven't researched, that if all the
clients are using the same version of the theora encoder, and the same
settings, that their setup headers would likely be the same; If this is
the case, then their CRC32's would be the same, and they could start
decoding at any keyframe..
With the latest idea I've read, though, it makes this process much more
inconvenient, because _each_ client would have their own 16bit "chain
ID", and these chain ID's would be duplicated in the streams sent by
each client, and therefore the server would need to deeply understand
and parse each of the streams in order to put them together, etc.
What did you do in the case where the CRC32 was different from each of the
clients? This is basically the same scenario isn't it?
In that case, the "setup header ident" of the payloads would be different,
and the receiver would know that it needs to wait until it receives setup
headers matching the CRC of these frames before decoding. So, the only
time we could accidentally try to decode frames using an incorrect set of
setup headers would be in the case of a CRC collision (P ~= 0).
If the payload format only includes a "chain ID", then the chances of two
streams having the same "chain ID" when coming from different sources is
pretty much P==1. So, the server that's doing the switching would need to
actually muck with the payload in order to give each sender a different
Chain ID, and then keep track of which was which, etc. It makes it
impossible to just switch senders in the server without the server
understanding the internals of the codec payload.
Ok now I see.
Post by Steve Kann
I know nothing about IAX2, but I would assume that it has some sort of
offer/answer model to negotiate codec parameters and such. You could easily
put the chain ID in this negotiation so that all users in the conference use
the same codebook.
Presently, it's pretty simple, where it allows negotiation of the codec,
but not codec parameters. In practice, it hasn't been necessary to do
that. In the future, it might need to be extended to do so.
I see. I suppose you could append the codebook hash to the codec name. Instead
of just "Theora" you could have "Theora-2982394872479842". I don't know if
there are any limitations on codebook name.
See below..
Post by Aaron Colwell
Post by Steve Kann
But, consider that users will join and leave the conference at arbitrary
times, so the conference engine can't know in advance all the codebooks
that might be used.
This isn't necessarily a problem. It would just need to know the codebooks that
it allows to be used in the conference. That's basically what happens with all
other codecs. It's just implicit in their case instead of explicit like it
would have to be for Theora.
Right, but presently, there's no way to force the encoder to use a
particular codebook; Presently, it seems like the encoder presently uses
the same codebook all the time depending on compile, not run time stuff
(not sure about this, though).
Post by Aaron Colwell
Post by Steve Kann
Also, as you elude to below, there's no way to seed an encoder with a
particular codebook (AFAIK).
I think that my use case isn't all that unusual though; it's somewhat
like the properties you might have in multicasting, I think.
1) It would be ideal if the RTP payload format could be made independent
of SDP.
It is currently independent of SDP if you use inline codebook transmission. The
info in the SDP just allows you to know ahead of time what the info and setup
headers are going to be for each chain. It also provides a mechanism to grab
the codebooks ahead of time. You also can save bits if you don't want
to periodically transmit codebooks.
I don't think so, I thought the latest proposal called for replacing the
"setup ident" field (32 bits) with a "chain ID" field (16 bits or so),
where the "chain ID" field would refer to a "chain-info" item in the SDP.
The chain ID really just represents a group of packets that use a particular
ident & codebook pair. You don't have to refer to the SDP. If the client
knows that there is going to be inline header and codebook transmission it
can just wait for the ident and codebook for that chain to arrive. In most
situations those would arrive before any data packets, but in your switching
situation, that might not happen.
Post by Steve Kann
This would mean, that even for a RTP and SDP based conference application
like mine, if a client joined the conference with a different codebook,
then all the clients would need to re-fetch the SDP in order to identify
the codebook that's needed.
No it wouldn't. It can just wait for the inline headers to arrive just like it
would in the CRC32 case.
Post by Steve Kann
But, I guess, I haven't seen (maybe it hasn't been written yet), how the
inline codebook transfer would work.
The way I envision it is that ident and codebook packets would be transmitted
just like data packets. They would have a chain ID associated with them as
well. This allows you to determine what headers go with which stream.
OK, if this is the case, switching can happen without needing to
reference SDP, but then the server _still_ needs to understand, and
modify the streams that it sends out to each client. In particular, it
would need to:

1) Parse all the packets looking for in-line setup headers.
2) Keep a mapping between a "conference" chain ID, and the
<sender><chain-ID> for all the codebooks it has seen.
3) For each frame that comes in, it would need to re-write the chain-IDs
for each video frame, as well as each setup-header, translating the
sender's chain-ID to a conference chain-ID.

This seems like it will be quite some amount of work.. If instead, the
CRC-32 of the codebook set was used (i.e., like it is now, except using
both the codebooks and "info" headers), none of this would be necessary..
Post by Aaron Colwell
Post by Steve Kann
In general, I don't think that my issues are unique to IAX2; Nor do I
think that they are things that can't be made to work with whatever format
you have. But the questions are 1) how complex will these
implementations need to be, and 2) how will they perform.
Most video codecs have the property where a "switch" (which is basically
what my conferencing application is), can "switch" between streams from
different sources, at any keyframe, as long as the width, height, and
framerates are the same (and in some cases even if they're not), without
needing to negotiate with the receiver at all. This can make a switch
fairly simple; It only needs to know, for each frame, whether it's a
keyframe or not, and then treat the whole thing as opaque data.
In these cases the switch needs to know datatype specific info about the codec.
I'm assuming it cracks open the payload to determine the frame size and frame
rate. All Theora does is add codebook to the switch criteria. How are you
doing switching right now for Theora?
I'm not doing this switching at all yet; At the moment, app_conference
(the switch) handles audio only, and not video. At the moment, the
released version of my client supports audio only; In my development
code, I have video capture, encoding, packetization, depacketization,
decoding and display working [with plenty of shortcuts still present,
like implicitly only supporting YUV420P format, etc].

The "switch" is a module that goes into asterisk. Asterisk does know a
bit about some audio codecs (it includes translators [encoders/decoders]
for some), but for other audio formats, and for video data, it just
treats them as opaque, and will pass them through if both sides agree
that they support them.
Post by Aaron Colwell
The server would still need to keep state
for each client since the frame size & frame rate is not in the frame data.
I think that for other video codecs, it is (I'm not sure about this,
though) [frame rate and size]. I'm not actually sure that frame rate is
needed at all, though, since frames are all timestamped with a
timestamps synchronized to audio.

Apparently (although I haven't played with this stuff myself), asterisk
is able to (a) connect party-to-party calls, and (b) store and play back
"video voice mail", for video codecs H.261, H.263, H.263+, without
knowing anything at all about the video stream other than the timestamps
on individual packets, and the format for the stream.
Post by Aaron Colwell
Are you not enforcing that criteria for Theora? Is this something that is
determined at the time the client connects to the server?
Post by Steve Kann
Theora has already moved away from this goal a bunch with the whole
codebook thing, but it would be nice to at least minimize the
inconvenience of dealing with the codebooks as much as possible.
[the present theora rtp format exhibits this property; if you use
periodic inline
setup header transmission]
2) It would be ideal if the RTP payload format continued to allow inline
setup header
transmission.
To my knowledge we weren't going to get rid of inline transmission. I had
always intended to keep it.
Would the format be the same as it is now (+- the setup header ident
field)? Would there be some way outside of SDP to indicate which
codebooks belonged to which "chain id?"
The ident and codebook packets would have a chain ID in them.
Post by Steve Kann
It would be most convenient, if there were a "fixed setup" mode for
theora, where you could ask the theora-encoder to use fixed setup header
set, and have it act like other codecs in this respect. I understand
the flexibility that the setup headers give you in encoder design, but
it would be nice if there were a way to configure it otherwise..
If the encoder allowed you to specify a codebook on initialization, you could
effectively do this. Basically your app could just always specify the same
codebook to the encoder and then sent the hash to the other participants.
They would then verify that your hash matches the hash of their codebook and
then your done. This is basically the codebook cache hit scenario. If you get
a miss then you just make connection to the conference fail.
Right. Something like this would allow for the most bit-efficient method,
because we could (rarely, if ever) retransmit codebooks if we can control
all the clients, and force them to use the same codebook.
One of the other things I'll need to do eventually is to "record" these
conferences, into some container, and make that format
forward-compatible; If all the clients use the same codebooks, that also
makes things much simpler, because we could write this all out as one
"chain".
Thanks for bringing this up. It's nice to have input from a different use case.
I'm fine with making changes to the current thinking, but I just want to make
sure that we have a good understanding of the problems.
Thanks for discussing it with me as well. I'm not necessarily set in my
thinking about things, and the idea that I had in mind might not be the
best. Basically, I think that the whole setup-header business is going
to make the implementation of Theora into programs a lot more
complicated than it is to drop in another codec which doesn't require
all this extra stuff to happen.

In one particular use case, (off-line encoding to .ogg files), all this
isn't much of a headache. But for use-cases like this, and perhaps for
many others, this is quite a headache. For example, If I had all this
working with h.263 (or h.264), and I wanted to switch to theora, it
would be quite a job, because compared to the design of most video
codecs, theora is a square peg when you might have a round hole..

Of course, the upside is, patent licensing headaches are probably bigger
headaches than codebook transmission stuff :)
Post by Aaron Colwell
The whole reason that we went from the CRC32 -> chain ID thinking was because
there was concern about collisions in this value. Unique chain IDs fix that
problem, but cause a problem for you because you want a system that doesn't
have to worry about the chainID -> codebook mapping.
What isn't clear to me is whether your problem space actually needs to allow
arbitrary codebooks. It seems to me that allowing this causes more headaches
that it's worth since you could potentially waste a ton of bits on codebook
transmission if every client in the conference uses a slightly different
codebook.
Absolutely, it would be much easier to do, if I could just use the
theora implementation with fixed codebooks, and not have to worry about
any of this stuff. If VP3 codebooks were an option, that would be
excellent.

I suspect that if all the clients are using the same theora
implementation, and the same settings (framerate, frame size, etc),
then, even as theora improves, they'll end up with the same codebooks.
In that case, with the CRC32 method, they would be able to avoid getting
codebooks altogether (the codebooks they'd generate themselves would
have the same CRC32 as the codebook they get from the very first packet,
and they'd be able to feed themselves).

-SteveK

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.xiph.org/pipermail/xiph-rtp/attachments/20050418/cd4df1ae/attachment-0001.htm
Ralph Giles
2005-04-18 18:49:44 UTC
Permalink
Post by Steve Kann
In one particular use case, (off-line encoding to .ogg files), all this
isn't much of a headache. But for use-cases like this, and perhaps for
many others, this is quite a headache. For example, If I had all this
working with h.263 (or h.264), and I wanted to switch to theora, it
would be quite a job, because compared to the design of most video
codecs, theora is a square peg when you might have a round hole..
Yes, this is all about the configuration header which is different from
the way way most other codecs are designed. (Or, as Aaron points
out, the configuration being more than frame size and rate.) The Vorbis
audio codec has all the same issues.

Our motivation here was the longevity of the baseline jpeg image format,
still an excellent choice 15 years after it was first developed. We
didn't think we could be equally well tuned in our first release, so
we designed the format so encoders could have maximum flexibility
without having the upgrade the installed base of decoders.

It may be that we've bet wrong. Much of the world hasn't seemed to mind
upgrading repeatedly for each new incompatible iteration of the 'Windows
Media' format, or "AAC" or even "MPEG-4 video", perhaps because OS
vendors are shipping the upgrades as a normal part of their systems. So
longevity (of the codec, not the brand) hasn't been an issue in the last
couple of years.
Post by Steve Kann
Absolutely, it would be much easier to do, if I could just use the
theora implementation with fixed codebooks, and not have to worry about
any of this stuff. If VP3 codebooks were an option, that would be
excellent.
So we could use one of the 8 reserved bits in my 32-bit aligned payload
header proposal to mark something like this. I remain unconvinced of the
value though.

In regards to instructing the encoder on what decoder setup to use,
Derf's experimental encoder already supports this, and it's on the list
for the revised reference encoder api. So while you can't do this now
without some hacking, you should be able to in the future, including
configuring the encoder from a set of codebooks pulled from another
stream.

-r
Steve Kann
2005-04-18 19:26:04 UTC
Permalink
Post by Ralph Giles
Post by Steve Kann
In one particular use case, (off-line encoding to .ogg files), all this
isn't much of a headache. But for use-cases like this, and perhaps for
many others, this is quite a headache. For example, If I had all this
working with h.263 (or h.264), and I wanted to switch to theora, it
would be quite a job, because compared to the design of most video
codecs, theora is a square peg when you might have a round hole..
Yes, this is all about the configuration header which is different from
the way way most other codecs are designed. (Or, as Aaron points
out, the configuration being more than frame size and rate.) The Vorbis
audio codec has all the same issues.
Our motivation here was the longevity of the baseline jpeg image format,
still an excellent choice 15 years after it was first developed. We
didn't think we could be equally well tuned in our first release, so
we designed the format so encoders could have maximum flexibility
without having the upgrade the installed base of decoders.
It may be that we've bet wrong. Much of the world hasn't seemed to mind
upgrading repeatedly for each new incompatible iteration of the 'Windows
Media' format, or "AAC" or even "MPEG-4 video", perhaps because OS
vendors are shipping the upgrades as a normal part of their systems. So
longevity (of the codec, not the brand) hasn't been an issue in the last
couple of years.
I buy it, but I think that there might be some compromise that's possible:

1) Include in Theora I a set of predefined codebooks. (This set might be
1 right now, I'm not sure..).

2) Include these codebooks in the specifications for decoders;

3) ALSO include in the specifications for decoders the ability to load
new codebooks, etc, dynamically, as they do now.

Then, when, encoding, you have choices, you can choose to use a
predefined codebook, and then operate more or less like other codecs do,
or you can have the ability to use new and optimized codebooks.

It doesn't limit any uses at all to offer this, but it does give you the
ability to operate the encoder in the predefined codebook mode, without
needing to go through the whole codebook transfer mode.

Then, if newer/better codebooks are developed (say, for Theora 1.2), you
have a choice about whether you want the encoder to use these as
predefined codebooks or not. If you choose not to, then you can still
use them as "dynamic codebooks", and have lost nothing in compatibility.
Post by Ralph Giles
Post by Steve Kann
Absolutely, it would be much easier to do, if I could just use the
theora implementation with fixed codebooks, and not have to worry about
any of this stuff. If VP3 codebooks were an option, that would be
excellent.
So we could use one of the 8 reserved bits in my 32-bit aligned payload
header proposal to mark something like this. I remain unconvinced of the
value though.
In regards to instructing the encoder on what decoder setup to use,
Derf's experimental encoder already supports this, and it's on the list
for the revised reference encoder api. So while you can't do this now
without some hacking, you should be able to in the future, including
configuring the encoder from a set of codebooks pulled from another
stream.
That sounds interesting.. I do plan to experiment with some of the
experimental work out there; The theora-mmx branch looks interesting,
because presently performance is really just barely adequate for
real-time conferencing. Of the other open-source (but not patent-free)
encoders out there, it seems that ffmpeg (for any of it's mpeg/h263
variants) and x264 really blow theora away at the moment.

-SteveK
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.xiph.org/pipermail/xiph-rtp/attachments/20050418/38d2b2e0/attachment.htm
Ralph Giles
2005-04-18 19:56:31 UTC
Permalink
Post by Steve Kann
That sounds interesting.. I do plan to experiment with some of the
experimental work out there; The theora-mmx branch looks interesting,
because presently performance is really just barely adequate for
real-time conferencing. Of the other open-source (but not patent-free)
encoders out there, it seems that ffmpeg (for any of it's mpeg/h263
variants) and x264 really blow theora away at the moment.
Yeah, you really need the -mmx branch for conferencing. There are some
decoder asm patches available as well, and I've been working on a
parallel version recent for the Elphel people.

The ffmpeg and vxid implementations have seen a lot more optimization
work than theora has. We're getting there, but working on encumbered
codecs seems to be sexier...

-r
Steve Kann
2005-04-18 20:03:32 UTC
Permalink
Post by Ralph Giles
Post by Steve Kann
That sounds interesting.. I do plan to experiment with some of the
experimental work out there; The theora-mmx branch looks interesting,
because presently performance is really just barely adequate for
real-time conferencing. Of the other open-source (but not patent-free)
encoders out there, it seems that ffmpeg (for any of it's mpeg/h263
variants) and x264 really blow theora away at the moment.
Yeah, you really need the -mmx branch for conferencing. There are some
decoder asm patches available as well, and I've been working on a
parallel version recent for the Elphel people.
The ffmpeg and vxid implementations have seen a lot more optimization
work than theora has. We're getting there, but working on encumbered
codecs seems to be sexier...
Maybe. I've been working with Speex for some time, so I'll help where I
can -- but in the asterisk community, people are enamored with iLBC
(free, but not open-source), and G.72x (patents and licenses and all).

I'll also need to do optimization work for PPC, because my application
needs to be able to perform well on PPC as well as x86.

I'm sure there's headroom there to make things work, so it's not my
primary concern.. The quality I see so far with this seems acceptable,
but x264 seems to be still better..

-SteveK

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.xiph.org/pipermail/xiph-rtp/attachments/20050418/d790b030/attachment.html
Ralph Giles
2005-04-18 19:51:46 UTC
Permalink
Post by Ralph Giles
Post by Steve Kann
In one particular use case, (off-line encoding to .ogg files), all this
isn't much of a headache. But for use-cases like this, and perhaps for
many others, this is quite a headache. For example, If I had all this
working with h.263 (or h.264), and I wanted to switch to theora, it
would be quite a job, because compared to the design of most video
codecs, theora is a square peg when you might have a round hole..
Yes, this is all about the configuration header which is different from
the way way most other codecs are designed.
Just to be clear, the flexibility of the vorbis setup headers have
served us very well. The irony of that statement is that linux
distributions are the only significant os vendors shipping our codecs as
a matter of course. The fact that a beta3 decoder release can play
files from aoTuVb4 with better quality at half the bitrate is a
significant acheivement.

So yes, the flexibility means more work at the front end, and yes the
CRC32-as-ident proposal would have traded the explicit chainid mapping
table for an implicit one. We've generally found dealing with the setup
overhead isn't as complex as you're expecting. The idea is that doing a
little more work up front is easier than having the mass-upgrade your
installed base in two years.

It's nice when it's easy to get things 'just working' quickly, but it's
also nice to do things right. You were already talking about negotiating
a common frame size and rate, and the rtp server mixing the streams
together, which I understand affects the SSRC and CSRC RTP header
fields, only switching on keyframes and so on, all of which requires at
least a little bit of codec knowledge. And theora, at least, is designed
so things like header and keyframe packet detection can be done easily
without a full decode. (Just by looking at the first byte for those
cases.)

Our concern with defining profiles, like the 'VP3' bit I suggested has
always been encouraging inoperable implementations that only support
that profile. "profiles are useless" has been a common lesson of many
specification designs. They make committee decisions easier, but then
end you either implement the de facto standard or you don't. Those are
the main reasons I remain unconvinced.

Note also that while the chain id lets you multiplex streams from
encoders using a different setup, you don't have to do it that way.
Your application might be better served by mandating that everyone
use the same profile and then not worry about chaining at all. That's
more like the situation you have with fixed setup codecs.

I hope that explains the design reasoning a bit better, and why we've
been resistent to things like static codebook sets. We do very much
appreciate your opinion and contribution to the design discussion and
are very willing to help you figure out what needs to be done to make
your implementation work well.

-r
Steve Kann
2005-04-18 20:26:21 UTC
Permalink
Post by Ralph Giles
Our concern with defining profiles, like the 'VP3' bit I suggested has
always been encouraging inoperable implementations that only support
that profile. "profiles are useless" has been a common lesson of many
specification designs. They make committee decisions easier, but then
end you either implement the de facto standard or you don't. Those are
the main reasons I remain unconvinced.
Maybe my case is an outlier, and maybe I'm just lazy :), but for this
situation, a fixed codebook would make life a lot easier. Since the
application is videoconferencing, the fact that the encoding format
would be (at the moment) limited to a fixed codebook would not affect
future decoders, and I would be keeping the on-disk recordings in a
format which does include the codebooks (*).

At some point in the future, when better codebooks are developed, we can
improve the conferencing engine and such to support them, without
changing anything about the decoders.
Post by Ralph Giles
Note also that while the chain id lets you multiplex streams from
encoders using a different setup, you don't have to do it that way.
Your application might be better served by mandating that everyone
use the same profile and then not worry about chaining at all. That's
more like the situation you have with fixed setup codecs.
And this would work for me, given an API to force the decoders to do
this. (presently, I think it will happen automatically, given all the
same parameters).
Post by Ralph Giles
I hope that explains the design reasoning a bit better, and why we've
been resistent to things like static codebook sets. We do very much
appreciate your opinion and contribution to the design discussion and
are very willing to help you figure out what needs to be done to make
your implementation work well.
Thanks so much for your thoughtful responses.

(*) Speaking of that format, one of the things that I've been pondering
is whether to use ogg as that format at all; The problem that ogg does
not solve that I'd like to solve in my on-disk format is I'd like the
format to be able to store a "lost frame" as a lost frame. In other
words, there is a VoIP client that is sending Speex via some unreliable
protocol. At the receive end, we see 10% packet loss. I'd like to the
file format to store the speex frames like this: 1 2 3 4 5 . 7 8 . 10
(with frames 6, 9 noted as "missing"), such that at playback time, I can
interpolate for those lost frames just like we would do if they were
played in real-time. I guess this is another thread to discuss somewhere
else, though :)
Phil Kerr
2005-04-19 05:02:43 UTC
Permalink
Hi Steve,

Thanks for the comments and feedback.
Post by Steve Kann
Post by Ralph Giles
Our concern with defining profiles, like the 'VP3' bit I suggested
has always been encouraging inoperable implementations that only
support that profile. "profiles are useless" has been a common lesson
of many specification designs. They make committee decisions easier,
but then end you either implement the de facto standard or you don't.
Those are the main reasons I remain unconvinced.
Maybe my case is an outlier, and maybe I'm just lazy :), but for this
situation, a fixed codebook would make life a lot easier. Since the
application is videoconferencing, the fact that the encoding format
would be (at the moment) limited to a fixed codebook would not affect
future decoders, and I would be keeping the on-disk recordings in a
format which does include the codebooks (*).
It certainly would make live easier, and for Theora it makes more sense
than Vorbis. But as soon as the protocols are re-worked to make them
fixed someone will come along with a compelling use scenario that
requires dynamic codebooks.

Trying to cater for such a wide range of uses isn't easy.
Post by Steve Kann
At some point in the future, when better codebooks are developed, we
can improve the conferencing engine and such to support them, without
changing anything about the decoders.
The only problem is if the protocols fix in one particular manner (fixed
codebooks) this will hinder developments for dynamic codebooks later.

The best option is to have a mechanism which allows the use of one
codebook to be made easy, and the scaling up to be possible.

Cheers

Phil

Michael Smith
2005-04-18 20:32:41 UTC
Permalink
Post by Ralph Giles
Our concern with defining profiles, like the 'VP3' bit I suggested has
always been encouraging inoperable implementations that only support
that profile. "profiles are useless" has been a common lesson of many
specification designs. They make committee decisions easier, but then
end you either implement the de facto standard or you don't. Those are
the main reasons I remain unconvinced.
A side note here:

Though the profile-less codebook based mechanism has served vorbis
well so far, we're now seriously considering introducing some sort of
profile mechanism. The reasoning behind this is that the existing
embedded implementations _already_ don't support the full
specification (they can't really - full vorbis can require a lot of
memory in pathelogical cases, and even normal streams can commonly
exceed what the more limited flash-based devices can manage), but
there's some impetus to standardise the subset that they do support.
Without that, the "de-facto standard" is actually different between
different manufacturers, and we have no way to label 'vorbis
compliant'.

Mike
Ralph Giles
2005-04-18 21:40:51 UTC
Permalink
Post by Michael Smith
Though the profile-less codebook based mechanism has served vorbis
well so far, we're now seriously considering introducing some sort of
profile mechanism.
(Now thoroughly off topic)

This is a little different, isn't it? The things we'd talked about
putting in the profile are limits on block length, codebook size,
required buffers and so on, not actually fixing the setup header
itself. Right?

-r
Ralph Giles
2005-04-18 14:28:18 UTC
Permalink
Post by Steve Kann
I've also read the archives of this list, about some of the proposed
changes. I'd like to describe here what I'm planning on doing, and see
how this might fit into your design.
Yay feedback! :-)
Post by Steve Kann
*Note: I also suspect, but I haven't researched, that if all the
clients are using the same version of the theora encoder, and the same
settings, that their setup headers would likely be the same; If this is
the case, then their CRC32's would be the same, and they could start
decoding at any keyframe..
This is correct. Future encoders may well adapt to their input, so your
conference engine should just check if the headers are the same. Note
that you can also do things like reencode the stream up to the next
keyframe if you want to switch someone outside a keyframe boundary, but
that reduces quality and uses a lot more resources than packet
switching.
Post by Steve Kann
With the latest idea I've read, though, it makes this process much more
inconvenient, because _each_ client would have their own 16bit "chain
ID", and these chain ID's would be duplicated in the streams sent by
each client, and therefore the server would need to deeply understand
and parse each of the streams in order to put them together, etc.
Aaron addressed this, but to clarify: the idea with the chain id is that
its a simple mark on each packet telling the client what decoder setup
to use to decode it. Making the mapping between the chain id and the
decoder is an out-of-band process. If you're not using SDP, it's your
protocol's responsiblity to set the mapping you want.
Post by Steve Kann
It would be most convenient, if there were a "fixed setup" mode for
theora, where you could ask the theora-encoder to use fixed setup header
set, and have it act like other codecs in this respect. I understand
the flexibility that the setup headers give you in encoder design, but
it would be nice if there were a way to configure it otherwise..
Well, if we did this officially, it would limit future encoder
improvements. Better, we think, to leave such profiles up to
particular applications.

That said, the fixed config used by the VP3 codec theora is based on
is one reasonable baseline.

-r
Steve Kann
2005-04-18 14:47:23 UTC
Permalink
Post by Ralph Giles
Post by Steve Kann
I've also read the archives of this list, about some of the proposed
changes. I'd like to describe here what I'm planning on doing, and see
how this might fit into your design.
Yay feedback! :-)
Post by Steve Kann
*Note: I also suspect, but I haven't researched, that if all the
clients are using the same version of the theora encoder, and the same
settings, that their setup headers would likely be the same; If this is
the case, then their CRC32's would be the same, and they could start
decoding at any keyframe..
This is correct. Future encoders may well adapt to their input, so your
conference engine should just check if the headers are the same. Note
that you can also do things like reencode the stream up to the next
keyframe if you want to switch someone outside a keyframe boundary, but
that reduces quality and uses a lot more resources than packet
switching.
Yes. I was thinking that a good compromise might be to cache the last
keyframe from each participant, and then, if we want to switch
in-between keyframes, I can send the previous keyframe from a
participant out, then send nothing until the next keyframe comes, and
then send everything.

In that case, viewers would see the switch instantly, but motion
wouldn't begin until the next keyframe appeared (which would be in a
second or two, at most).
Post by Ralph Giles
Post by Steve Kann
With the latest idea I've read, though, it makes this process much more
inconvenient, because _each_ client would have their own 16bit "chain
ID", and these chain ID's would be duplicated in the streams sent by
each client, and therefore the server would need to deeply understand
and parse each of the streams in order to put them together, etc.
Aaron addressed this, but to clarify: the idea with the chain id is that
its a simple mark on each packet telling the client what decoder setup
to use to decode it. Making the mapping between the chain id and the
decoder is an out-of-band process. If you're not using SDP, it's your
protocol's responsiblity to set the mapping you want.
But, even using SDP, this is pretty inconvienent, because you
essentially need to have all your clients re-fetch the SDP each time
someone joins the session (or, if encoders change, any time an encoder
wants to use a new codebook). For SIP, for example, that will be pretty
disruptive, needing to do REINVITE and all..
Post by Ralph Giles
Post by Steve Kann
It would be most convenient, if there were a "fixed setup" mode for
theora, where you could ask the theora-encoder to use fixed setup header
set, and have it act like other codecs in this respect. I understand
the flexibility that the setup headers give you in encoder design, but
it would be nice if there were a way to configure it otherwise..
Well, if we did this officially, it would limit future encoder
improvements. Better, we think, to leave such profiles up to
particular applications.
It wouldn't necessarily need to do that. You could define a set of
standard codebooks (maybe even just one), and then offer a greatly
compressed way of signifying that you're going to use this standard
codebook. Decoders would be required to accept either these "standard"
codebooks, or "dynamic" codebooks like they do now.

When transmitting the codebooks, you could have a small sequence at the
beginning saying "standard codebook N", or "dynamic codebook", so you
could transmit the "standard codebook N" stuff with just a few bytes,
instead of the 2 kilobytes or so it seems like they take now.

Then, the encoder has the choice to use a fixed codebook or a dynamic
codebook, and the only limitation on future improvements would be that
you can't introduce additional "standard codebooks" without introducing
compatibility problems.
Post by Ralph Giles
That said, the fixed config used by the VP3 codec theora is based on
is one reasonable baseline.
Is there any way to force that mode now?

I've basically just gotten to the point where I've got encoding and
decoding working in my end-point (after figuring out that in order to do
this in real-time, even for 320x240x15fps I need to set quick_p=1 and
noise_sensitivity=0), and I haven't yet really dug into the Theora
codebase yet (other than to figure out what those parameters do) -- I
just saw this conversation happening, and figured I'd offer my use-case
out there.

-SteveK


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.xiph.org/pipermail/xiph-rtp/attachments/20050418/70eb4fbd/attachment.htm
Loading...