B-Frames in DirectShow

来源:百度文库 编辑:神马文学网 时间:2024/04/30 08:50:06
B-Frames in DirectShow
June 8th, 2008 Posted inDirectShow
Ever wondered about B-Frame support in DirectShow? … or do you think that DirectShow is perfect? Unluckily it’s notRead the rest of the article for more info on this issue.
Frame types
It can be said that all modern video codecs reduce the amount ofbits required to describe a video sequence by exploiting spatial andtemporal redundancy. A frame encoded using only the spatial compressiontools is mostly called an I frame (Intra frame). In most cases thistype of frame can also be referred to as the ‘Key frame’ because it isnot derived from any other frames and can be used as a reference framefor the following frames. Frame types that also utilize the temporalcompression tools are called the ‘Inter frames’ or P frames (predicted)because they are derived from at least one other frame. Latest codecssuch as the H.264 offer the possibility to use multiple referenceframes but the general rule for P frames is that they are only derivedfrom previous frames in chronological order. See figure 1.

Figure 1. Simple frame structure
Since MPEG-2 had appeared a new frame type was introduced - thebi-directionally predicted frame - B frame. As the name suggests theframe is derived from at least two other frames - one from the past andone from the future (Figure 2).

Figure 2. Sequence containing a B frame
Since the B1 frame is derived from I0 and P2 frames both of themmust be available to the decoder prior to the start of the decoding ofB1. This means the transmission/decoding order is not the same as thepresentation order. That’s why there are two types of time stamps forvideo frames - PTS (presentation time stamp) and DTS (decoding timestamp).
Containers
Unluckily not all existing file containers are suitable for storingencoded video containing B frames because of the lack of capability tostore two types of time stamps. AVI file is the typical example of acontainer that only allows one time stamp per access unit. MP4 orMatroska support B frames natively and that’s why they are moresuitable for storing of latest high quality video.
DirectShow
Unluckily the DirectShow technology lacks the ability to storePTS/DTS information for media samples too and various techniques arebeing used to work around this limitation.
AVI files with DivX/XviD
Depending on the authoring tool and codec used to create the AVI file the content might be stored in several ways.
Frames are stored in decoding order and time stamps represent DTS.
Frames are stored in a "packed bitstream" mode where multiple encoded frames are stored in one AVI sample. This mode also implies the usage of empty "delay frames". See figure 3.

Figure 3. AVI samples containing B frames
To find more about AVI and VFW way of dealing with B frames you canread this post at the doom9 board.All of these techniques require that both encoded and decoder know whatthey are doing. If you are developing a filter that would be capable ofreading and processing encoded stream containing B frames you should beaware of this.
Guess the times by yourself
If you are certain that the timestamps passed along with theIMediaSample object are PTS (and not DTS) you can implement a simplealgorithm to determine the DTS value. You will need one variable toremember the maximum timestamp value seen so far that would help withthe timestamp generation. This simple algorithm has one disadvantage -it assumes you CAN discard the first received frame (should not be aproblem e.g. for network transmissions that run for a long time andneed to be restarted only on special occasions).
PTS_IN - PTS of the input frame
PTS_OUT - PTS of the output frame
DTS_OUT - DTS of the output frame
TEMP_TS - Temporary timestamp value
The algorithm is as follows:
Receive frame and get its PTS_IN value
If it was the first frame, assign TEMP_TS = PTS_IN and discard the frame. Jump to step 1.
If (PTS_IN < TEMP_TS) then { DTS_OUT = PTS_IN } else { DTS_OUT = TEMP_TS; TEMP_TS = PTS_IN; }
Assign PTS_OUT = PTS_IN
Deliver frame
Jump to step 1.
Consider the following sequence : I0, P2, B1, P4, B3, P6, B5
Input Frame
(PTS_IN) Temp_TS
(at the start) Output Frame
(PTS_OUT, DTS_OUT) Temp_TS
(at the end)
I (0) - - 0
P (2) 0 P (2,0) 2
B (1) 2 B (1,1) 2
P (4) 2 P (4,2) 4
B (3) 4 B (3,3) 4
P (6) 4 P (6,4) 6
B (5) 6 B (5,5) 6
Simple, huh ? It will also work nicely if the first received frameis not an I-Frame and also if there are some frames dropped/lost ormissing in the sequence. The only two problems are that you must bereceiving PTS with the input frame and that you’re losing one frame atthe start.
The "my-beloved" solution
Another solution to the timestamp issue could be to introducePTS/DTS timestamps into the MediaSample object. Since media samples areinstances of a CUnknown(IUnknown)-derived classes they also can exposeinterfaces. That means we could define an IMediaSample3 interface thatwould contain the Get/SetTimePTS/DTS(REFERENCE_TIME *, REFERENCE_TIME*)  methods so both times could be attached to the media sample. Theproblem is that this would require that all filters supporting this newtransport would have to provide custom allocators as well and it wouldbe difficult to assure compatibility among other 3rd party componentsthat might not have a clue of this mechanism. However, if I were todevelp a solution that would contain filters made by one party for theEncoding/Mux/Demux/Decode parts of the graph I would go for this onefor sure.
7 Responses to “B-Frames in DirectShow”
BySina™ onJun 9, 2008
Hi Igor…
How you doing?
Any news about LAME filter?
By Zion onJul 15, 2008
Hi,
I have a question..
that why you need such an algorithm to obtain the DTS order from the PTS stamp of the input frames, isn’t the DTS stamp just the sequence of frames?
ByIgor Janos onJul 15, 2008
Yes, DTS is “just a sequence of frames” but for every frame PTS must be greater or equal to DTS. If you just started to count frames as you receive them the condition might not be fulfilled.
ByJamie Fenton onNov 30, 2008
Two ideas popped into my head when I read your blog entry:
1) DirectShow uses 100ns clock resolution - an application can thus hide a code inside the jitter, particularly in video signals where jitter is a feature and not a bug :-). I.E. truncate the timestamps to 0.1 ms and use the remaining 3 digits of precision to hide flags. A form of temporary timestamping like you describe.
2) Use an empirical trick like trying the decode both ways and take whichever route is less ghastly. This requires the codec to be bulletproof (which it has to be anyway these days), and enough preroll space/time. So we have a variant on “discard first frame” theme here.
ByIgor Janos onDec 1, 2008
Nice ideas indeed. Although I like the second one much more than the first one. Using dirty tricks with timestamp precision can lead to incompatibility issues when using with 3rd party filters.
In my latest filter, the x264 encoder, I’ve tried to implement a derived allocator class the creates IMediaSampleEx samples that provide methods for setting additional PTS, DTS and Clock values. This makes the filter compatible with filters that can only read one timestamp value as well as filters that can take advantage of the extended sample interface. IMHO this might prove useful.
ByJamie Fenton onDec 6, 2008
The derived allocator approach is probably the best way to go - Microsoft recommends something like it for a back channel for managing dynamic format changes.
Even better would be for Microsoft to add a property list to their media sample API, so you can tag your properties, and I can tag mine, to the same sample and know that your upstream will get it, as would mine. Then DirectShow could get away from the “lots of independent allocators” architecture, and closer to a “shared-pool with routing” architecture, that only copied something, changed formats, etc., when it really needed to.
One way to get that benefit started would be to release the design/code for such a facility as open-source, and code what you can to it. (Which I know you have done with your x264 encoder). Still, having it be separate, minimal, and free to use by anybody might get it officially approved.
ByIgor Janos onDec 6, 2008
The mechanism of shared allocators is implemented in Trans-In-Place filters. However it seems that this usage scenario is not very common. In a typical situation - playback of an AVI file you usually have one async source and muxer - they use their own specific way of data delivery. Then you have Mux->Decoder connection which uses classic memory allocator. And the Decoder->Video Renderer connection that works best with allocator provided by the renderer filter so the framebuffer is decoded and copied only once directly into the video memory.
As for the opensource design- yupp, that’s what I’m trying to do. I also have a set of muxers nearly finished so let’s hope the mechanism will become popular.