Why so many H.264 encoders are bad

来源:百度文库 编辑:神马文学网 时间:2024/03/29 14:10:56

Why somany H.264 encoders are bad

Filed under: H.264,fail,psychovisualoptimizations,rate-distortion optimization ::

If one works long enough with a large number of H.264 encoders, onemight notice that a large number of them are pretty much awful.  Thisof course shouldn’t be a surprise: Sturgeon’sLaw says that “90% of everything is crap”.  It’s also exacerbatedby the fact that H.264 is the most widely-accepted video standard inyears and has spawned a huge amount of software that implements it, thusgenerating more mediocre implementations.

But even this doesn’t really explain the massive gap between good andbad H.264 encoders.  Good H.264 encoders, like x264, can beatprevious-generation encoders like Xvid visually at half the bitrate inmany cases.  Yet bad H.264 encoders are often so terrible that they loseto MPEG-2 !  The disparity wasn’t nearly this large withprevious standards… and there’s a good reason for this.

H.264 offers a great variety of compression features, more than anyprevious standard.  This also greatly increases the number of ways thatencoder developers can shoot themselves in the foot.  In this post I’llgo through a sampling of these.  Most of the problems stem from thesingle fact that blurriness seems good when using mean squared erroras a mode decision metric.

Since this post has gotten linkeda good bit outside the technical community, I’ll elaborate slightly onsome basic terminology that underlies the concepts in this post.

RD = lambda * bits + distortion, a measure of how “good” a decisionis.  Lambda is how valuable bits are relative to quality (distortion). If something costs very few bits, for example, it might be able to getaway with more distortion.  Distortion is measured via a modedecision metric , the most common being sum of squared errors .

Visual energy is the amount of apparent detail in an image or video. Part of the job of a good encoder is to retain energy so that the imagedoesn’t look blurry.

i16x16 macroblocks

The good: i16x16 is very appealing as a mode: it is phenomenallycheap bit-wise due to its heirarchical DC transform.  In flatter areasof the frame, this usually makes it cost less than a dozen or even half adozen bits per macroblock.  As a result, RD mode decision lovesthis mode.

The bad: It looks like crap.  i16x16 is atrocious at maintainingvisual energy: it almost never has any AC coefficients when it is used,three out of four of its prediction modes code nearly no energy at all,and the deblocker tends to blur out any details left anyways.  Combinedwith a lack of adaptive quantization, this is the prime cause of ugly16×16 blocks in flat areas in encodes by crappy H.264 encoders.  Whilethe mode isn’t inherently bad, it’s over-emphasized in the spec andmakes a great trap for RD to fall into.

Bilinear qpel

The good: Qpel is of course a good thing for compression, and H.264′sqpel is particularly unique in that it is designed for encoderperformance.  The hpel filter is slow (6-tap filter), but can beprecalculated, while the qpel is simple and can be done on-the-fly(bilinear).

The bad: Bilinear interpolation is blurry, thus losing visualenergy.  But of course RD mode decision loves blurriness and so willpick it happily.  Furthermore, the most naive motion search method(fullpel, one iteration of hpel, one iteration of qpel) tends to biastowards qpel instead of hpel.  While qpel is still very useful, itsoveruse is yet another trap for encoders.

4×4 transform

The good: The 4×4 transform is great for coding edges efficiently andhelps form the backbone of the highly efficient i4x4 intra mode.  Italso doesn’t need as fancy an entropy coder (for CAVLC at least) as an8×8 transform would, thus allowing smaller VLC tables.

The bad: It’s blurry! It has a lower quantization precision at thesame quantizer (compared to 8×8 transform); combined with decimation,this results in lots of uncoded blocks, yet another trap for RD.  It’sterrible at coding textured areas, especially when the details in thetexture are larger than the transform itself.  It also gets deblockedmore than 8×8.  While adaptive transform is good news, the fact that 4×4was the default (and 8×8 added later) is likely an artifact of theentire specification process being done while optimizing for CIFresolution videos.

Biprediction

The good: Biprediction is at the core of any modern video format:B-frames vastly improve compression efficiency, especially inlower-motion scenes.  Biprediction singlehandedly makes possible thehigh number of skip blocks in B-frames in most sane-bitrate H.264encodes.

The bad: It’s bilinear interpolation again, so it’s blurry, whichacts as a nice RD trap yet again.  This makes biprediction get overusedeven in non-constant areas of the image, such as film grain, ensuringblurry grain in B-frames and clear grain in P-frames (nicely alternatingas such).

One should note of course that B-frames and thus biprediction are notat all unique to H.264; this has been an ongoing problem for many yearsand tends to be exacerbated by lower bitrates.

h/v/dc intra prediction modes

The good: These modes are critical to the intra prediction system. DC is similar to the old-style intra coding before spatial intraprediction, and the latter two are very useful for straight edges. These three tend to be overall the most common intra prediction modes.

The bad: They retain energy terribly.  The other intra predictionmodes (planar and ddl/ddr/vr/hd/vl/hu) effectively predict frequenciesthat are difficult to code with a DCT, thus increasing visual energy inthe resulting reconstructed image.  But h/v/dc don’t really do this. Furthermore, because of how the mode prediction system works, they tendto be the cheapest modes to signal (in terms of bits).

Of course, x264 effectively uses all of these features without mostof the aforementioned problems.  Developers of other encoders: takenote.

Comments [18]

18 Responses to “Why so many H.264 encoders are bad”

  1. JoeH Says:

    Amazing post as always. All the CUDA encoders available fall into this categories. I would love to see a post from you about your OpenCL (or whatever technology you would use) plans for using video cards to speed up X264s output (obviously splitting up the work between the card and the CPU). I can’t trust anyone else will do it right….

  2. cb Says:

    “Of course, x264 effectively uses all of these features without most of the aforementioned problems.”

    How?

  3. Dark Shikari Says:

    @cb

    By taking energy into account during RD optimization, x264 avoids falling into low-error but low-energy modes.

  4. danx0r Says:

    what would be the best way to understand x264′s energy-retaining R/D approach? (I assume RTFC, but if it’s based on published research that would be extremely helpful)

  5. Dark Shikari Says:

    @danx0r

    I’ve never seen published research on the topic, though I haven’t looked that hard.

    Check encoder/rdo.c for some basic information.

  6. Gonzo Bumm Says:

    I reaqlly would like to read a beginners tutorial by you about how to use x264 – I read many other tutorials now and all of them seem to have misunderstandings of concepts or problems with the many options. There are also many guis for x264 where the authors seem not to understand what they are doing. It would be great to have the one and only real reference. Thanks!

  7. Dark Shikari Says:

    @Gonzo

    1) x264 –help (it even has example usage!)
    2) x264 –longhelp
    3) x264 –fullhelp
    4) http://mewiki.project357.com/wiki/X264_Settings (slightly outdated at times)

  8. Sarang Says:

    Hi,

    I understand that this may be a novice question, but thought you would be the best one to answer:

    1) We want to minimize post ( frequency ) transform energy after Motion Estimation.
    2) However, currently ME is done in spatial domain, although SAD equals the DC term and can give some correlation with minimized transform terms.
    3) For now if we ignore the extremely expensive computational cost, can’t ME be done in frequency domain? This would give the exact bit cost, and would ( hopefully ) give absolutely minimum distortion.
    4) So is this correct – By minimizing the error term of transformed blocks, we can minimize the objective bit cost AND the “Subjective Distortion” as well as approximating measures like SSIM ?

  9. Dark Shikari Says:

    @Sarang

    Yes, it’s called –me tesa.

  10. Esurnir Says:

    Would a -me tumh be a stupid idea? I assume you already tested it and thought “it is” but just throwing the idea.

  11. Dark Shikari Says:

    @Esurnir

    Yup, we tested it on all the various modes. It didn’t help much and the speed cost was very high, so we restricted it to esa only.

  12. Shevach Riabtsev Says:

    As intra prediction, in H.264 (as well as in AVS) intra prediction is performed in pixel domain while in MPEG4 partly intra prediction is executed in frequency domain (DC/AC prediction). Therefore the H.264 intra prediction fails on weak noisy material due to loss of correlation. On the other hand if intra prediction was executed in frequency doamin then at least low-frequency components would have been correlated. This is worth to mention that dynamic range of intra prediction in frequency domain would have been much morethan in pixel one.

  13. Pengvado Says:

    So spatial intra prediction works great for normal content, and only fails when there is no correlation to predict from. Whereas AC prediction is uniformly useless (it makes about 0.1% bitrate difference in MPEG4). Yet another win for H.264.

  14. Shevach Riabtsev Says:

    I would like to stress the following point:
    the spatial correlation usually is good on non-noisy content. On noisy content spatial correlation between neghboring pixels is expected to deteriorate while the correlation for low-frequency AC coefficients remains high since noise affects on high-frequency (although DCT leakage of high-frequency coefficients might slightly impact on low-frequency harmonics).
    I suppose that on noisy content the MPEG4 AC prediction shows a gain more than 0.1%).

  15. Pengvado Says:

    Nope, AC prediction is just as useless at predicting noise as it is for clean content.

  16. skal Says:

    you didn’t talk about in-loop deblocking strength.
    0:0 is too high to my taste, but that’s just me…

  17. Jeremy Noring Says:

    Thanks, this post was really interesting.

    I have a follow-up question: what is a good way of evaluating an encoder’s output? I have an embedded encoder; is there some way to see if it does any of the aforementioned encoding faux pas based on the output?

    Any general strategies you know of here would be welcome. Great blog too, I love it.

  18. Dark Shikari Says:

    @Jeremy

    An embedded encoder is going to be extremely minimal, generally: at best it’ll do SAD mode decision, deadzone quantization, and other extremely simple algorithms. Don’t expect much out of it; at best you’ll get something similar to x264 with –preset veryfast –profile baseline –tune psnr. No point in bothering trying to do fancy evaluation, IMO.