When a Single Photo Broke My HEIC Converter Benchmarks

Posted on 2026-01-18 20:39:02

When a Simple 1.9 MB Image Produced Two Different HEIC Results: My Story

You know that quiet moment in testing when you think you finally nailed the benchmark? I had that. I dropped a 1.9 megabyte sample photo into a batch of HEIC converters expecting consistent results. Instead, one converter returned a tidy 1.2 megabyte HEIC and another spat out something closer to 1.9 MB. Same input. Same pixel dimensions. Same visual check in the viewer. Different file sizes. I blinked. Then I went down a rabbit hole.

At first I blamed the obvious suspects - user error, one-off corruption, the universe being cruel. Meanwhile I had a release deadline and a presentation that insisted on "average compression numbers." As it turned out, that single mismatch exposed a host of assumptions in my testing methodology. That moment changed the way I design HEIC converter tests for good.

Why File Size Variance in HEIC Converters Breaks Real-World Testing

Here's the hard reality: one number - file size - looks simple, but it masks a lot of complexity. When two tools produce wildly different HEIC sizes from the same original, you can't treat either result as "correct" without digging deeper. Tools differ in:

Encoder engine - Apple's proprietary encoder vs libheif, x265 through ffmpeg, or third-party services. Default quality parameters and quality-to-bitrate mapping (a quality 80 in one encoder is not a quality 80 in another). Chroma subsampling, bit depth, and color profile handling. Metadata retention - Exif, thumbnails, GPS tags, depth maps, or Live Photo components. Preservation of auxiliary images - e.g., HEIF containers can hold multiple images, image sequences, or alpha channels.

So that 1.9 MB to 1.2 MB story was not about a single "better" converter. It was about inconsistent defaults and hidden baggage carried in the file. If you judge converters by a single sample and one number, you'll be misled.

Why Traditional HEIC Converter Tests Often Miss Critical Differences

Most testing setups follow an easy path: pick one image, run converters with default settings, record sizes and eyeball the results. It feels efficient. It also feels wrong after you learn how these encoders behave. Here are the complications I found that made the naive approach fail:

1. Metadata can outweigh pixel data

One converter I tested kept full Exif, makers notes, and a 300 KB embedded thumbnail. Another stripped everything by default. Result: nearly 25% size difference, purely metadata. Your "compressed file size" metric becomes a proxy for metadata policy, not compression efficiency.

2. Different encoders map quality numbers differently

Setting quality to "80" in tool A versus tool B is not consistent. Some encoders adjust quantization in a nonlinear way. So comparing sizes at "quality 80" is meaningless unless you align on objective bitrate or perceptual metrics.

3. HEIF's container features are optional but impactful

HEIF can store multiple images, depth maps, or alpha layers. Some converters include a depth map or retain the Live Photo frame, increasing size without changing the visible main image. If you only compare final file size, you cannot tell whether the extra bytes are useful data or junk.

4. Color conversion steals subtle details

Converters that convert color profiles differently can produce files that look similar on casual inspection but differ in pixel values. That difference affects how compressible the image is. An encoder that changes the color space to a profile that compresses better may produce smaller files but also change color fidelity.

5. Noise and detail distribution matters

The same scene compressed with different pre-processing - denoising or sharpening - will yield different sizes. An encoder that applies a mild denoise can reduce file size significantly but also hide detail. So size versus perceived quality is always a trade-off.

How I Changed My HEIC Testing Method and Found the Root Causes

After that 1.9 MB vs 1.2 MB incident I rewired my testing process to avoid superficial comparisons. Here are the concrete steps I implemented. You can reuse this framework immediately.

Step 1 - Create a representative image set

One sample image is a trap. Build a suite that includes:

High-detail landscape (lots of fine textures) Portrait with skin tones and soft gradients Low-detail flat color with gradients (to expose banding) High-ISO noisy photo Transparent PNG converted to HEIC with alpha to test container features Live photo / image sequence to test multi-image behavior

I use 12 images as a minimum. That gives you statistical spread and reveals corner cases.

Step 2 - Normalize inputs and log everything

Normalize the inputs before conversion: same pixel dimensions, same color space (or explicitly test color conversions), and a clean copy without unintended metadata. For each run record:

Tool name and exact version Command line or GUI options OS and library versions (libheif, ffmpeg, x265) Time of conversion and runtime flags

This led to one of my eye-openers: two conversions on different days gave different outputs because an updated library silently changed defaults.

Step 3 - Strip and measure metadata separately

Before declaring "smaller is better", split file size into components: visual image payload vs metadata. You can use tools to extract Exif, thumbnails, and auxiliary HEIF items. My first 1.2 MB victory evaporated when I saw 300 KB of removed metadata in that file. Good to know, but not the same as better compression.

Step 4 - Use objective visual quality metrics

Do not rely solely on eyeballing. Use SSIM and PSNR to measure distortion. For perceptual alignment, use MS-SSIM or perceptual metrics appropriate to still images. For video-derived HEVC encoders, VMAF can be instructive though it is tuned for video. Run metric vs size curves across quality settings to compare encoders on equal footing.

Step 5 - Compare with a visual difference map

Create difference images showing pixel-wise delta between the original and decoded HEIC. That highlights subtle color shifts or banding missed by simple inspection. A converter that produces slightly lower size but leaves visible banding in skies is not a winner for most real applications.

Step 6 - Test real workflows

Finally, test converters in your real workflow. If your app strips metadata after conversion, then initial metadata retention matters less. If you serve images directly to mobile clients, color profile handling matters more. Tailor your tests to what you actually ship.

From 1.9 MB to 1.2 MB: The Protocol That Stopped False Positives

Adopting the above protocol changed our conclusions. Go here What looked like inconsistent conversion behavior was actually a combination of metadata handling and encoder default differences. After we enforced alignment - same encoder settings, stripped metadata, and objective metrics - the results lined up predictably.

Tool Output Size (sample) Metadata Size Encoder/Settings Observation Converter A (Desktop) 1.20 MB 0.05 MB libheif default Smaller payload, decent SSIM 0.98 Converter B (Online) 1.90 MB 0.60 MB Apple encoder, full metadata kept Large metadata (thumbs, depth), best perceived color ffmpeg + x265 1.25 MB 0.02 MB -qscale 28, 4:2:0 Good trade-off, slightly warmer toning

Numbers vary with the image, but the important part is what the numbers represent. Once you separate metadata from pixel payload and align encoder parameters, size comparisons are meaningful and repeatable.

Practical Tips You Can Use Today

Always record tool versions. One patch can change defaults and bite you in production. Compare at fixed target bitrate rather than opaque "quality" settings when possible. Strip metadata if your product does not need it, and measure before and after. Include noisy and low-detail images in your suite to see how denoising and banding behave. Automate difference map generation and SSIM measurement so tests are objective and repeatable.

Mini Checklist for a Robust HEIC Test Run

Use at least 10 diverse images. Fix resolution and color space or test both explicitly. Log tool, version, args, and OS. Extract and log metadata size separately. Compute SSIM and create a difference map. Document perceived artifacts and the target use case.

Quick Self-Assessment Quiz - Is Your HEIC Testing Rig Solid?

Score 1 point per "Yes". 5-6: solid. 3-4: improve quickly. 0-2: stop and revise your methodology now.

Do you use more than one image for benchmarking? Do you log tool and library versions for each run? Do you separate metadata size from payload size? Do you compute objective metrics like SSIM/PSNR? Do you test conversion settings that match your production workflow? Do you inspect difference maps for subtle artifacts?

What I Learned - The Bigger Picture

That 1.9 MB to 1.2 MB mismatch could have been dismissed as a fluke. Instead it forced a protocol upgrade. The bigger lesson is about assumptions. Compression is not just about bits. It's about what those bits represent: image payload, metadata, container choices, and encoder behavior. When you test like an engineer who expects surprises, you design repeatable, defensible comparisons.

As a final note - if you want reproducible results, pick a reference pipeline and lock it down. Define your acceptable quality thresholds in measurable terms, and design tests that reflect the user experience you deliver. This led to faster decisions, fewer surprises in production, and better conversations with stakeholders. It also meant I got to tell that one smug product manager that "smaller" doesn't always equal "better". They were not impressed. I was.

Resources and Next Steps

If you want a starter script to automate the steps above - extraction of metadata, SSIM calculation, difference map generation, and a sample suite - tell me which platform you use (macOS, Linux, Windows) and which encoder tools you have available (ffmpeg, libheif, ImageMagick). I can share a ready-to-run pipeline that replicates the methodology I describe here.