[xiph-rtp] about theora-over-rtp draft

Post by Simon Morlat
http://svn.xiph.org/trunk/theora/doc/draft-barbato-avt-rtp-theora-01.txt
(the most recent I've found).

use the xml one the txt MAY be slightly outdated from time to time.

Post by Simon Morlat
I'm the author and maintainer of linphone, a free software SIP video phone
(http://www.linphone.org) . I've been the first to implement speex over RTP
and I've contributed a little to the speex-over-rtp draft with Jean Marc
Valin and Greg Herlein (especially concerning SDP usage specification)

Post by Simon Morlat
While implementing theora support in linphone, I encoutered several major
1/ about packed configuration header. This packed configuration header is
supposed to be theora header followed immediately by theora tables.
Unfortunately the current theora decoder is unable to decode such packed
configuration (it stops after the header and ignores the table) and as far as
I understand there's no way to retrieve where theora tables start when
receiving such a packet.

there is: the first packet is fixed in size. Check the example code in:

http://svn.xiph.org/trunk/xiph-rtp/

Post by Simon Morlat
-> as a consequence I've implemented differently: theora header and tables are
sent in different packets.

Wrong, nonstandard*, bad!

Post by Simon Morlat
2/ about fragment type. The draft defines 3 types: begin of packet,
continuation of packet, and end of packet. I think this is really very
redundant information: the receiver only needs to know the frontier between
video frames, nothing more. Setting the marker bit of the rtp header to 1 for
the last packet of a video frame is enough and much simple.

I think isn't.

Post by Simon Morlat
RTP (RFC3550)
tells it's up to payload specifications to indicate the meaning of this
markbit. There's no problem in using it. RFC2429-bis (payload spec for
H263-1998) does that.

We could have a look at it again but if it was discarded even before I
joined the development of this rfc probably there could be a reason.

Post by Simon Morlat
Furthermore, for the fragmentation algorithm, it is painful to know whether a
fragment is a end of packet or continuation packet.

Why? you just have to check the Fragment type field.

Post by Simon Morlat
And what about if a
packet isn't fragmented at all, ie it is as well a start and a end of a video
frame ?

Hmm, this part is that unclear?

| This field is set according to the following list
| </t>
| <vspace blankLines="1" />
| <list style="empty">
| <t> 0 = Not Fragmented</t>

One or more theora frames or a full configuration in a single rtp datagram.

| <t> 1 = Start Fragment</t>

If I get this one I have to store it somewhere and expect type 2 fragments

| <t> 2 = Continuation Fragment</t>

keep on storing type 2

| <t> 3 = End Fragment</t>

end packet, build a full frame/packed configuration out of it.

| </list>
|
| <t>This field must be zero if the number of packets field is
| non-zero.</t>

Post by Simon Morlat
Note that the sequence number of the rtp header let the application detect
incomplete frames.

I know.

Post by Simon Morlat
-> I used the marker bit to indicate end-of-frames packet.

If you use the marker bit you have a problem in telling if is a full
packet or not, using the drafted way is immediate.

Post by Simon Morlat
3/ I used inband sending of configuration headers. The inline SDP method has a
big problem for me: it forces the SDP offerer to configure its theora encoder
before even knowing about the bandwidth constraints of the remote side
(expressed using the b=<AS>: field of SDP).

No, you should use offer-answer[rfc3264] to have an agreement between
client and server see 6.2

Post by Simon Morlat
Thus by taking account all those preferences, each theora encoder can be
configured efficiently to fit the bandwidth requirements and the display
constraints of the remote side. The theora packed configuration packets can
then be sent inband (the method that I prefer), or through an alternate
method: (http, RTCP packet?) , but ONLY AFTER the SDP messages have been
exchanged.

I think Offer-Answer is a solution, maybe I could relax a bit the
constraint, but even as is it fits already your requirement:

- the encoder side knows already the maximum resolution, its side
maximum bitrate and could set a lower bound between resolution and
bandwidth using heuristics.

- the decoder side will receive an offer with a sufficiently wide set of
possibilities and it will pick the ones it supports
bandwidth/hw/protocol wise.

Post by Simon Morlat
For me it is very important to efficiently use bandwidth indications because
for example with usual DSL connections the bandwidth is sometimes limited to
128kbit/s (and very often in upload case). Doing CIF at 30 fps with high
quality coding is not possible in this situation. I found theora codec is
really efficient (CIF at 7 fps works with such DSL modems). But the
prequesite for this to work is that the phone be able to configure its theora
encoder after receiving the SDP message from the remote side.

Relaxing the constraint to let the receiver answer deliver an altered
reply with all the parameters bounded between the highest and the lowest
values it got in the first offer would probably lead to more corner
cases than you may want to handle.

Post by Simon Morlat
Finally the format I've used in my implementation (see

[a completely nonstandard implementation]

Post by Simon Morlat
Finally, I would expect about this draft to tell how to split a big theora
frame in several mtu-sized packets in a way that would make a partially
received frame usable by the decoder. In other words, how to be as safe as
possible in case of packet losses. But I don't know whether this is something
possible, I don't know enough about the internals of theora.

The current draft expects a lost one - lost everything scenario.

Post by Simon Morlat
That's all for my comments. I just want to try to keep the world as simple as
possible and bring my developer experience as well as my user-experience of
video-telephony.
Despite I've made reference to RFC2429-bis (H263-1998) I don't consider this
paper as an example to follow, I'm sure we can do better.

Please reread the draft, I'm afraid you miss some sections and that
means that it isn't crystal clear yet.

Post by Simon Morlat
I don't want linphone to be an out-of-standarts video phone, so I would really
like it to implement the draft you are working on. However I would really
like that this future RFC to be as clear and simple as possible. I'm really
rfc2190, amr over rtp, mpeg4 over rtp...).
I think with a good RFC, theora would be really superior to MPEG4 in the real
time streaming world.

Theora is quite a good fit for your specific application IMHO, I think
we will find a good way to make it work on linphone.

Post by Simon Morlat
Thanks a lot for reading this, I'm waiting for your feedbacks.
Also, I'd like to thank Mr. Barbato for all the work he has already done with
this draft.

I'm happy to see that someone started implementing rtp-theora.

Thank you for your feedback

lu

--
Luca Barbato

Gentoo/linux Gentoo/PPC
http://dev.gentoo.org/~lu_zero

Rhys Hawkins

2006-07-20 21:27:46 UTC

On Fri, 21 Jul 2006 00:46:27 +0200

The current draft expects a lost one - lost everything scenario.

I said something similar a little while ago and I still think the theora spec
should aim for greater fault tolerance. This may require consultation with the
theora people to add the necessary hooks into their api to do partial frame
decoding. But at the very least the rtp spec should recommend a robust way of
breaking up the theora bitstream, like macroblocks in H261, or restart intervals
in JPEG.

The "lost one - lost everything" approach will work on the local network, but
not over wide area multicast (for example), and I'm not interested in sending
video to the computer next to me (and I'd be surprised if anybody was:)

Cheers,
Rhys

Luca Barbato

2006-07-20 22:38:46 UTC

Post by Rhys Hawkins
The "lost one - lost everything" approach will work on the local network, but
not over wide area multicast (for example), and I'm not interested in sending
video to the computer next to me (and I'd be surprised if anybody was:)

Well, I could relax the rule but I don't like at all the idea of making
the spec more complex.

lu

--
Luca Barbato

Gentoo/linux Gentoo/PPC
http://dev.gentoo.org/~lu_zero

Rhys Hawkins

2006-07-21 04:36:23 UTC

On Fri, 21 Jul 2006 02:38:08 +0200

Well, I could relax the rule but I don't like at all the idea of making
the spec more complex.

I agree that it would make things more complicated. You would need extra fields
in the headers to describe from where a particular chunk of theora data starts
from, eg macroblock or groupofblocks number. It would also require modification
to libtheora to support partial frame decoding. It doesn't seem to me that
there is an easy short term solution. However, I believe that the benefits in
terms of robustness to packet loss would outway the costs in terms of complexity.

Cheers,
Rhys

Luca Barbato

2006-07-21 07:16:02 UTC

Post by Rhys Hawkins
I agree that it would make things more complicated. You would need extra fields
in the headers to describe from where a particular chunk of theora data starts
from, eg macroblock or groupofblocks number.

Not really.

Post by Rhys Hawkins
It would also require modification
to libtheora to support partial frame decoding.

Out of the scope of the draft...

Post by Rhys Hawkins
It doesn't seem to me that
there is an easy short term solution.

That's why I didn't put any constraint about how to fragment and assumed
the worst case about default behaviour.

Post by Rhys Hawkins
However, I believe that the benefits in
terms of robustness to packet loss would outway the costs in terms of complexity.

Theora gives its best at low bitrates, so you shouldn't expect really
expect long chains of theora fragments.

I'm afraid that just real life situation will tell us if having a finer
fragment rule would improve the situation.

lu

--
Luca Barbato

Gentoo/linux Gentoo/PPC
http://dev.gentoo.org/~lu_zero

Simon Morlat

2006-07-21 08:09:17 UTC

Thanks for your quick feedbacks !

Post by Luca Barbato
http://svn.xiph.org/trunk/xiph-rtp/

If this is a known fact that the header is fixed size, then it should be written in the rfc.
However I don't think it's a good idea, this may prevent theora developers to make evolutions to the codec.

Post by Simon Morlat
2/ about fragment type. The draft defines 3 types: begin of packet,
continuation of packet, and end of packet. I think this is really very
redundant information: the receiver only needs to know the frontier
between video frames, nothing more. Setting the marker bit of the rtp
header to 1 for the last packet of a video frame is enough and much
simple.

I think isn't.

Why ?

We could have a look at it again but if it was discarded even before I
joined the development of this rfc probably there could be a reason.

You can find the rfc2429 draft here:
http://www.ietf.org/internet-drafts/draft-ietf-avt-rfc2429-bis-09.txt

Post by Simon Morlat
Furthermore, for the fragmentation algorithm, it is painful to know
whether a fragment is a end of packet or continuation packet.

Why? you just have to check the Fragment type field.

I was talking about the algorithm at the encoding side that splits a theora frame into smaller packets, not the algorithm at the receiving side.

Post by Simon Morlat
-> I used the marker bit to indicate end-of-frames packet.

If you use the marker bit you have a problem in telling if is a full
packet or not, using the drafted way is immediate.

What problem ? The algorithm takes in four lines:
packet=rtp_get();
accumalate_into_internal_buffer(packet);
if (!marker_bit_set(packet))
process_accumulated_buffer_and_clean_it();

Post by Simon Morlat
3/ I used inband sending of configuration headers. The inline SDP method
has a big problem for me: it forces the SDP offerer to configure its
theora encoder before even knowing about the bandwidth constraints of the
remote side (expressed using the b=<AS>: field of SDP).

No, you should use offer-answer[rfc3264] to have an agreement between
client and server see 6.2

Please explain how the SDP offerer could propose a theora configuration inlined in the sdp message before even knowing the receiving prequisite of the other side. This cannot work. Typically if the offerer want to send CIF at high bit rate, but unfortunately the other side cannot receive else than QCIF at very low bitrate, you have NO way to give him the configuration string that fits thoses requirements.
The SDP offer/answer is made of only 2 messages !

Taken from RFC3264:
"If the bandwidth attribute is present for a stream, it indicates the
desired bandwidth that the offerer would like to receive. "

If the SDP offerer sends its theora configuration in the sdp offer message, it logically CANNOT take in account the bandwidth attribute sent by the remote side.

Post by Luca Barbato
I think Offer-Answer is a solution, maybe I could relax a bit the
- the encoder side knows already the maximum resolution, its side
maximum bitrate and could set a lower bound between resolution and
bandwidth using heuristics.
- the decoder side will receive an offer with a sufficiently wide set of
possibilities and it will pick the ones it supports
bandwidth/hw/protocol wise.

But how the decoder side will inform the encoder side of the config it has choosen ?
Furthermore most streams are full duplex, they are two encoder side and two decoder side. We should talk about SDP offerer and answerer.
Let's imagine the SDP offerer suggests 3 theora config using various bandwidth constraint.
The SDP answerer can eventually tell which one of the config it chooses. But as it also sends a theora stream and will also suggests various theora configs int the same way
Unfortunately as the SDP offerer has no way to indicate the config it chooses as there is no third message.

As far as I understand, the only SDP offer/answer model that works is the one where each side tells its preferences and constraints about the stream they wish to receive.
Remember also that many internet access are asymetric: we really need that both sides indicates their receiving capabilities.

Post by Luca Barbato
[a completely nonstandard implementation]

Unfortunately... I want my software to work to let me see my family and friends. As we only have 512/128 adsl connection, it's important that each side properly takes in account the receiving capabilities of the other side.

Lastly, I forgot to tell about the possibility offered by the draft to put several theora frames in a single rtp packet. My opinion is that this is completely useless because RTP is done for real-time streaming, and putting several theora frames in a same rtp packet means buffering theora frames before sending them, thus it's no more real-time.
Despite this possibility is often offered by audio codecs to save bandwidth by reducing rtp overhead, it has never been used for video codecs, simply because the huge size of video packets compared to the rtp header makes the gain of bandwidth very limited. Thus it's simply more simple to rely on the underlining protocol (UDP/RTP) to know the size the video packets, and assume that each RTP packet contains at most 1 video frame.

Simon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.xiph.org/pipermail/xiph-rtp/attachments/20060721/00acc481/attachment.htm

Ralph Giles

2006-07-21 15:21:22 UTC

Post by Simon Morlat
If this is a known fact that the header is fixed size, then it should be written in the rfc.
However I don't think it's a good idea, this may prevent theora developers to make evolutions to the codec.

I agree with both of these points, especially because the implicit
length isn't a limitation of the ogg embedding. If we did change
the length of the first header in a future rev of theora, we'd
almost certainly also bump the version number, so you could switch
on that, but it's still a constraint we'd have to remember.

Perhaps more importantly, the current theora spec allows additional
header packets if they are optional for decode (custom colourspace
definitions, application-private data and so on). Having these would
also break the 'only one length field' hack.

So we should either go to 'length field for every packet in the packed
header format' or we need language about how such extra headers cannot
be transmitted this way, and to define the length of the identification
header".

Apparently I'm being careful not to express a decision here. :)

Simon, as regards the fragmentation bits, this was designed for vorbis,
where it's helpful to know you have a truncated packet because the
decoder can handle that for partial decode. You can infer this just
as well from your marker bit scheme, but I think there's more value in
having a consistent scheme between the codecs.

One advantage here would be that if we can use the marker bit for
fragmentation, we could drop the fragmentation *and* packet type
fields and only go back to a 32 bit configuration id. (or better yet,
drop that too! :-) That would be a serious simplification, perhaps worth
losing the fragmented/packed distinction.

Anyway, I want to say thanks to you too, having feedback from actual
implementation is invaluable.

Cheers,
-r

Luca Barbato

2006-07-21 17:06:29 UTC

Post by Ralph Giles
Apparently I'm being careful not to express a decision here. :)

I smell problems if they cannot placed on the tail of the 3rd header (so
you'll just keep parsing)

Post by Ralph Giles
One advantage here would be that if we can use the marker bit for
fragmentation, we could drop the fragmentation *and* packet type
fields and only go back to a 32 bit configuration id. (or better yet,
drop that too! :-) That would be a serious simplification, perhaps worth
losing the fragmented/packed distinction.

I'm not so happy about it, I'd rather have something consistent between
vorbis and theora that doesn't require a mixer/repeater to decode them
fully.

lu

--
Luca Barbato

Gentoo/linux Gentoo/PPC
http://dev.gentoo.org/~lu_zero

Luca Barbato

2006-07-21 18:05:33 UTC

Post by Simon Morlat
Thanks for your quick feedbacks !

Just a request: configure your client to break the lines at the 79th
col, thank you

Post by Luca Barbato
http://svn.xiph.org/trunk/xiph-rtp/

If this is a known fact that the header is fixed size, then it should be
written in the rfc.

The rfc assumes that people will read the normative refs, I'd just move
the theora ref to the normative section.

Post by Simon Morlat
However I don't think it's a good idea, this may prevent theora developers
to make evolutions to the codec.

The codec theora is frozen, changes will lead to theora-newversion.

Post by Luca Barbato
I think isn't.

Why ?

you may have:

- fully enclosed frames per single frame.
- non raw video payloads.

Post by Simon Morlat
http://www.ietf.org/internet-drafts/draft-ietf-avt-rfc2429-bis-09.txt

Post by Simon Morlat
I was talking about the algorithm at the encoding side that splits
a theora frame into smaller packets, not the algorithm at the receiving side.

see my example code. dead easy (too much) ^^;

Post by Simon Morlat
-> I used the marker bit to indicate end-of-frames packet.

If you use the marker bit you have a problem in telling if is a full
packet or not, using the drafted way is immediate.

packet=rtp_get();
accumulate_into_internal_buffer(packet);
if (!marker_bit_set(packet))
process_accumulated_buffer_and_clean_it();

you lose something at the start of a fragment chain, what would you do?

If it were just another single packet (marker bit not set) you just
ignore this loss, if you had lost the start of a fragmented frame you'd
feed the decoder with something more or less unexpected.

1 2 time
12345678901234567890
NMNNNMMMMMMNNMMMNNNN bit Marked or Notmarked
^lost ^lost
_||__|||||||_||||___ _fullframe |frags
FSEFFSCCCCCEFSCCEFFF Fullframe, Startfrag, Contfrag, Endfrag

in my case you know what's lost.

Post by Simon Morlat
3/ I used inband sending of configuration headers. The inline SDP method
has a big problem for me: it forces the SDP offerer to configure its
theora encoder before even knowing about the bandwidth constraints of the
remote side (expressed using the b=<AS>: field of SDP).

No, you should use offer-answer[rfc3264] to have an agreement between
client and server see 6.2

At any time, either agent MAY generate a new offer that updates the
session. However, it MUST NOT generate a new offer if it has
received an offer which it has not yet answered or rejected.
Furthermore, it MUST NOT generate a new offer if it has generated a
prior offer for which it has not yet received an answer or a
rejection. If an agent receives an offer after having sent one, but
before receiving an answer to it, this is considered a "glare"
condition. [rfc3264 - 4 Protocol Operation]
See further:
http://www.ietf.org/internet-drafts/draft-xu-mmusic-sdp-codec-param-01.txt

Post by Simon Morlat
"If the bandwidth attribute is present for a stream, it indicates the
desired bandwidth that the offerer would like to receive. "
If the SDP offerer sends its theora configuration in the sdp offer message,
it logically CANNOT take in account the bandwidth attribute sent by the remote
side.

But if you send _multiple_ offers the client can pick which one they
like most, keep in mind that _nothing_ is preventing you to:

- use adaptative techs like the ones supported by nemesi/fenice[1]
- use something like codec-param (I'm thinking about adding it soon)
- keep the offer-answer ballet till you get to agree.

Post by Simon Morlat
But how the decoder side will inform the encoder side of the config it has choosen ?

The only remaining methods/configurations in the answer.

Post by Simon Morlat
Furthermore most streams are full duplex, they are two encoder side and two decoder
side. We should talk about SDP offerer and answerer.

yup

Post by Simon Morlat
Let's imagine the SDP offerer suggests 3 theora config using various bandwidth constraint.
The SDP answerer can eventually tell which one of the config it chooses.
But as it also sends a theora stream and will also suggests various theora
configs int the same way
Unfortunately as the SDP offerer has no way to indicate the config it chooses
as there is no third message.

why not?

Post by Simon Morlat
As far as I understand, the only SDP offer/answer model that works is the one
where each side tells its preferences and constraints about the stream they wish
to receive.
Remember also that many internet access are asymetric: we really need that both
sides indicates their receiving capabilities.

works too.

Post by Luca Barbato
[a completely nonstandard implementation]

see already mentioned rfcs

Post by Simon Morlat
Lastly, I forgot to tell about the possibility offered by the draft to put several
theora frames in a single rtp packet. My opinion is that this is completely useless
because RTP is done for real-time streaming, and putting several theora frames in a
same rtp packet means buffering theora frames before sending them, thus it's no
more real-time.
Despite this possibility is often offered by audio codecs to save bandwidth by
reducing rtp overhead, it has never been used for video codecs, simply because the
huge size of video packets compared to the rtp header makes the gain of bandwidth
very limited. Thus it's simply more simple to rely on the underlining protocol
(UDP/RTP) to know the size the video packets, and assume that each RTP packet
contains at most 1 video frame.

You forgot that:
- using rtp-theora for yuotube-like application is not so far from
possibility (you want low bitrate with nice but not perfect quality)
- how many museconds/milliseconds/seconds you buffer in those not so
real-time applications?

the draft isn't preventing you to disregard collated packets at all.

Keep in mind that the draft must not be perfect for a single task, but
cover most of the common usages and let you take just the part you need.

Thank you for the comments again =)

lu

--
Luca Barbato

Gentoo/linux Gentoo/PPC
http://dev.gentoo.org/~lu_zero

Simon Morlat

2006-07-24 08:31:11 UTC

Post by Luca Barbato
Just a request: configure your client to break the lines at the 79th
col, thank you

I'm sorry ! It's apparently the case, but perhaps it's buggy... I'm going to
send this one in pure text mode in case it works better.

Post by Luca Barbato
The rfc assumes that people will read the normative refs, I'd just move
the theora ref to the normative section.

I think everything you say in a RFC that helps implementors is really welcomed
and avoid mis-interpretations and buggy implementations.

Post by Luca Barbato
you lose something at the start of a fragment chain, what would you do?
If it were just another single packet (marker bit not set) you just
ignore this loss, if you had lost the start of a fragmented frame you'd
feed the decoder with something more or less unexpected.
1 2 time
12345678901234567890
NMNNNMMMMMMNNMMMNNNN bit Marked or Notmarked
^lost ^lost
_||__|||||||_||||___ _fullframe |frags
FSEFFSCCCCCEFSCCEFFF Fullframe, Startfrag, Contfrag, Endfrag
in my case you know what's lost.

But you can do the same thing using the sequence number. Marker bit says: that
packet carries the end of frame. As a consequence a fullframe has the marker
bit (which is not the case in your example). Sequence number discontinuities
can be handled by the application by dropping packets until a marker bit is
received, then restart the accumulate algo I've described.
The only advantage I see of your technique (I've just realized it) is that you
can differentiate a FullFrame from an "end of frame with begin lost".
But anyway you could also decide to always pass a marked packet to the decoder
(regardless it can be fullframe or just an end of frame with possibly lost
start) and see what happens. If it is not a fullframe the theora decoder
won't be able to decode it.

Post by Luca Barbato
At any time, either agent MAY generate a new offer that updates the
session. However, it MUST NOT generate a new offer if it has
received an offer which it has not yet answered or rejected.
Furthermore, it MUST NOT generate a new offer if it has generated a
prior offer for which it has not yet received an answer or a
rejection. If an agent receives an offer after having sent one, but
before receiving an answer to it, this is considered a "glare"
condition. [rfc3264 - 4 Protocol Operation]

This paragraphs talks about renegociation. This technique is actually handled
by very very few implementations and was not really done to make complex call
setup, but call parameters changes during a call, for example port and codec
changes when entering a conference after a simple call. Or for example a
client that decides to stop video the video session but keeps audio.

I think it is worth to have a SDP scheme that allows to setup the session
efficiently and according to the network capabilities with a minimal message
exchange, I mean just an offer and an answer. This is possible in RFC2429-bis
and all audio codecs, why shouln't it be possible with theora ?

Post by Luca Barbato
But if you send _multiple_ offers the client can pick which one they
- use adaptative techs like the ones supported by nemesi/fenice[1]
- use something like codec-param (I'm thinking about adding it soon)

in a=fmtp line ?

Post by Luca Barbato
- keep the offer-answer ballet till you get to agree.

too complex and not the goal of RFC3264.

Post by Simon Morlat
But how the decoder side will inform the encoder side of the config it has choosen ?

The only remaining methods/configurations in the answer.

So this assume a symetric stream since only the answerer has the choice of the
format parameters... Not good.

Post by Simon Morlat
Let's imagine the SDP offerer suggests 3 theora config using various
bandwidth constraint.
The SDP answerer can eventually tell which one of the config it chooses.
But as it also sends a theora stream and will also suggests various
theora configs int the same way
Unfortunately as the SDP offerer has no way to indicate the config it
chooses as there is no third message.

why not?

It's offer-answer model. Not offer-answer-acknolegment. A third message would
mean a re-negociation, ie the answerer answers and then re-INVITES with a new
SDP offer, and the initial offerer will then have to answer... a great moment
of coding when implementing this...
Theora risks not to be popular if you force implementors to do such things...

Post by Simon Morlat
Lastly, I forgot to tell about the possibility offered by the draft to
put several theora frames in a single rtp packet. My opinion is that this
is completely useless because RTP is done for real-time streaming, and
putting several theora frames in a same rtp packet means buffering theora
frames before sending them, thus it's no more real-time.
Despite this possibility is often offered by audio codecs to save
bandwidth by reducing rtp overhead, it has never been used for video
codecs, simply because the huge size of video packets compared to the rtp
header makes the gain of bandwidth very limited. Thus it's simply more
simple to rely on the underlining protocol (UDP/RTP) to know the size the
video packets, and assume that each RTP packet contains at most 1 video
frame.

- using rtp-theora for yuotube-like application is not so far from
possibility (you want low bitrate with nice but not perfect quality)
- how many museconds/milliseconds/seconds you buffer in those not so
real-time applications?

At 15 fps this would make a 66 milisecond buffering. As the webcam usually
delivers frames with more or less 100 miliseconds latency, the result is not
very good. As usually voice can be transmitted with around 60 milisecond
latency (soundcard+encoding), this would force the application to delay voice
with 100 additional miliseconds to keep audio and video synchronised. Then
add the network transmission delay, around 50 miliseconds on xDSL, you obtain
around 210 milisecond, then you need to add 50 more miliseconds for jitter
compensation at the other end. 260 milisecond end to end delay, users won't
accept that. In classic telephony or mobile telephony, it rarely exceeds 100
miliseconds.
In this scenario I agree that the video buffering is not the main cause of
latency, but it's clear that an implementor that needs to reduce the latency
of its telephony application will NEVER have fun to bufferize video frames to
put them in a single RTP packet.

What are for you the benefits of being able to put several theora frames in a
single RTP packet ?

Post by Luca Barbato
the draft isn't preventing you to disregard collated packets at all.

The draft isn't preventing me to only transmit packet that contain at most one
frame. But if I want to comply with the draft, I MUST be able to deal with
packets that contain several frames... isn't it ? Otherwise my application
will not be able to decode streams from another that would use this
functionnality.
Relying on the packetisation already provided by UDP is more natural and makes
this to work without the need of a single line of code.

To sum up all this discussion, I'd like this draft to:
- clarify the packed-conf message (limit between header and tables)
- explain a SDP operation that let each side to configure asymetrically in a
simple offer-answer (only 2 messages) scheme (for me it implies to NOT
transmit inline encoder configuration in SDP, which prevents the offerer to
adapt to the other end).
- use less bits to indicate fragmentation (for me 1 bit is enough, 2 if you
wish to indicate begin of frame and end of frame). Whether this bit is RTP
marker or not is not important.
- assume each rtp contains at most one frame.
The last two points in the goal of having much simpler unpacketisation code.

Currently the unpacketisation code necessary for implementing this draft makes
it more complex than the RFC2429bis or RFC3016 packetisation, which I think
is not good if we want more and more people like me or companies to prefer
open-source technology instead of heavy patented ones.

Simon

note: I cc'd Aymeric Moizard (***@atosc.org), the libosip/eXosip (SIP and SDP
protocols implementations) author. He said to me that he were interested in
that discussion.

Ralph Giles

2006-07-24 14:05:22 UTC

Post by Simon Morlat
But you can do the same thing using the sequence number. Marker bit says: that
packet carries the end of frame. As a consequence a fullframe has the marker
bit (which is not the case in your example). Sequence number discontinuities
can be handled by the application by dropping packets until a marker bit is
received, then restart the accumulate algo I've described.
The only advantage I see of your technique (I've just realized it) is that you
can differentiate a FullFrame from an "end of frame with begin lost".
But anyway you could also decide to always pass a marked packet to the decoder
(regardless it can be fullframe or just an end of frame with possibly lost
start) and see what happens. If it is not a fullframe the theora decoder
won't be able to decode it.

I think you've pointed out another difference between the schemes here:
latency. With the 2 fragmentation bits you know immediately if you've
lost a frame, while the marker bit scheme you don't know until you see
the next packet with a marker bit. In a naive implementation you lag
either way, but if you're going to do something to interpolate the
missing frame, you have more time to do that if you don't have to wait
until the next marker bit falls out of the jitter buffer.

Or...can you infer the same thing from the timestamp?

Post by Simon Morlat
[...]
In this scenario I agree that the video buffering is not the main cause of
latency, but it's clear that an implementor that needs to reduce the latency
of its telephony application will NEVER have fun to bufferize video frames to
put them in a single RTP packet.

Thanks for the excellent latency analysis. It's always nice to see
numbers. :)

Post by Simon Morlat
What are for you the benefits of being able to put several theora frames in a
single RTP packet ?

Our other main design goal was to support multicast. Generally speaking,
non-interactive applications are the opposite case, where bandwidth use
is more important than latency. Luca pointed out a "youtube"-like
unicast streaming application, but IP multicast is the case where RTP
transport is absolutely essential.

It is for these situations that we provide the packing mechanism. So
as you say, if you're implementing a receiver you need to handle this,
but as you're free to not use it in your sender if it's not appropriate
for your application.

Post by Simon Morlat
- clarify the packed-conf message (limit between header and tables)

Agreed.

Post by Simon Morlat
- explain a SDP operation that let each side to configure asymetrically in a
simple offer-answer (only 2 messages) scheme (for me it implies to NOT
transmit inline encoder configuration in SDP, which prevents the offerer to
adapt to the other end).

Sounds reasonable.

Post by Simon Morlat
- use less bits to indicate fragmentation (for me 1 bit is enough, 2 if you
wish to indicate begin of frame and end of frame). Whether this bit is RTP
marker or not is not important.
- assume each rtp contains at most one frame.
The last two points in the goal of having much simpler unpacketisation code.

I don't think you've made a convincing case here.

Post by Simon Morlat
Currently the unpacketisation code necessary for implementing this draft makes
it more complex than the RFC2429bis or RFC3016 packetisation, which I think
is not good if we want more and more people like me or companies to prefer
open-source technology instead of heavy patented ones.

Doesn't the (switchable) codebook transmission requirement completely
overshadow this? Is your rtp library somehow written in such a way that
it needs significant changes to do packing this way? Is Luca's source
code not helpful here?

-r

Simon Morlat

2006-07-31 06:25:19 UTC

Post by Ralph Giles
latency. With the 2 fragmentation bits you know immediately if you've
lost a frame, while the marker bit scheme you don't know until you see
the next packet with a marker bit. In a naive implementation you lag
either way, but if you're going to do something to interpolate the
missing frame, you have more time to do that if you don't have to wait
until the next marker bit falls out of the jitter buffer.
Or...can you infer the same thing from the timestamp?

The correct way to detect missing packets is to use the sequence number.
The fragmentation bits allows to know, after loosing a sequence of packet, if
you can start decode immediately (in case of a begin of fragment ). Without
this indication you would wait for the packet after the next marker bit... or
choose to pass the possibility truncated data to the decoder and see what
happens.
So this does not make a big difference.

Post by Ralph Giles
Our other main design goal was to support multicast. Generally speaking,
non-interactive applications are the opposite case, where bandwidth use
is more important than latency. Luca pointed out a "youtube"-like
unicast streaming application, but IP multicast is the case where RTP
transport is absolutely essential.

I understand this. My feeling is that the bandwidth gain is little compared to
additional complexity needed at the receiver side be compliant with the
draft.
I had a little question: when receiving multiple frames in a single RTP
packet, should we assume that it's always Full-Fragment ?

Post by Simon Morlat
- clarify the packed-conf message (limit between header and tables)

Agreed.

Post by Simon Morlat
- explain a SDP operation that let each side to configure asymetrically
in a simple offer-answer (only 2 messages) scheme (for me it implies to
NOT transmit inline encoder configuration in SDP, which prevents the
offerer to adapt to the other end).

Sounds reasonable.

Post by Simon Morlat
- use less bits to indicate fragmentation (for me 1 bit is enough, 2 if
you wish to indicate begin of frame and end of frame). Whether this bit
is RTP marker or not is not important.
- assume each rtp contains at most one frame.
The last two points in the goal of having much simpler unpacketisation code.

I don't think you've made a convincing case here.

I think you 'll get more and more feedbacks from telephony developers in that
way... Perhaps they'll be more convincing than me !

Post by Simon Morlat
Currently the unpacketisation code necessary for implementing this draft
makes it more complex than the RFC2429bis or RFC3016 packetisation, which
I think is not good if we want more and more people like me or companies
to prefer open-source technology instead of heavy patented ones.

The problem is not a programming one. I could have implemented the draft
already (with probably twice more lines than my "lightweighted" version.

My problem is that a I think complexity is adequate when there is a need to
solve a particular problem.

* To solve the problem of frame fragmentation, I think one bit is enough. The
four item enum proposed in this draft is full of redundancy and makes the
implementation a bit more complex for no value added.

* You tried to solve the problem of rtp header overhead by proposing a
multi-frame per rtp packet structure, that's ok but my feeling is that the
rtp header overhead is not a problem at all. So the complexity of handling
multiple frames per packet is not justified, for me.

* One big problem for me is to setup a RTP session through SDP with 2
messages: for me the theora draft is "too simple" as it does not answer
clearly on how to achieve this.

The reason why I think I might not be totally wrong is that this draft has
choosen different design principles than mpeg4 and h263, and I don't see any
reasons for that.

I'll try to implement the draft fully in a future release, with some
workaround for SDP until the draft clarifies that part.

Simon

Luca Barbato

2006-07-31 17:18:35 UTC

Post by Simon Morlat
So this does not make a big difference.

Agree

Post by Simon Morlat
I had a little question: when receiving multiple frames in a single RTP
packet, should we assume that it's always Full-Fragment ?

You collate always complete frames.

Post by Simon Morlat
- clarify the packed-conf message (limit between header and tables)

Agreed.

Something will appear soon.