# Autoencoder based image compression: can the learning be quantization independent?

Thierry Dumas, Aline Roumy and Christine Guillemot

INRIA Rennes Bretagne-Atlantique

## Abstract

This paper explores the problem of learning transforms for image compression via autoencoders. Usually, the rate-distortion performances of image compression are tuned by varying the quantization step size. In the case of autoencoders, this in principle would require learning one transform per rate-distortion point at a given quantization step size. Here, we show that comparable performances can be obtained with a unique learned transform. The different rate-distortion points are then reached by varying the quantization step size at test time. This approach saves a lot of training time.

## Method

An autoencoder for image compression is displayed in Figure 1. An encoder $$g_{e}$$, parametrized by $$\boldsymbol{\theta}$$, computes a representation $$\mathbf{Y}$$ from the image $$\mathbf{X}$$. The representation is quantized, yielding $$\hat{\mathbf{Y}} = \mathcal{Q} \left( \mathbf{Y} \right)$$. Then, a decoder $$g_{d}$$, parametrized by $$\boldsymbol{\phi}$$, provides a reconstruction $$\hat{\mathbf{X}}$$ of $$\mathbf{X}$$. $$g_{e}$$ and $$g_{d}$$ are convolutional. That is why $$\mathbf{Y}$$ is a stack of $$m$$ feature maps of size $$h \times w$$. $$\boldsymbol{\theta}$$ and $$\boldsymbol{\phi}$$ are learned by minimizing a rate-distortion objective function.

This work is structured around two questions:

1. During the training stage, does the learning of the quantization matter?
2. At test time, is it efficient to quantize the coefficients obtained with the learned transform using quantization step sizes which differ from those in the training stage?

To answer Question 1, we compare two approaches for allocating the bits among the feature maps of $$\mathbf{Y}$$. In the $$1^{\text{st}}$$ approach, $$\mathcal{Q}$$ applies a uniform scalar quantization with trainable step size $$\delta_{i} \in \mathbb{R}_{+}^{*}$$ to the $$i^{\text{th}}$$ feature map of $$\mathbf{Y}$$. $$\boldsymbol{\theta}$$, $$\boldsymbol{\phi}$$ and $$\{ \delta_{1}, ..., \delta_{m} \}$$ are learned jointly. In the $$2^{\text{nd}}$$ approach, $$\mathcal{Q}$$ is the uniform scalar quantization with step size 1, a normalization at the encoder side, parametrized by $$\boldsymbol{\varphi}_{e}$$, is inserted between $$g_{e}$$ and $$\mathcal{Q}$$, a normalization at the decoder side, parametrized by $$\boldsymbol{\varphi}_{d}$$, is inserted between $$\mathcal{Q}$$ and $$g_{d}$$. $$\boldsymbol{\theta}$$, $$\boldsymbol{\varphi}_{e}$$, $$\boldsymbol{\varphi}_{d}$$ and $$\boldsymbol{\phi}$$ are learned jointly.
To answer Question 2, we analyze the characteristics of the distribution of $$\mathbf{Y}$$ after the training.

## Results - $$1^{\text{st}}$$ test set

The $$1^{\text{st}}$$ test set contains 24 luminance images created from the Kodak suite [KodakPage].

For the orange curve, one autoencoder is trained per compression rate and the quantization step size is 1 during the training stage. At test time, the quantization step size is also 1.
For the green curve, a unique autoencoder is trained and the quantization step sizes are learned. At test time, the quantization step sizes vary.
For the red curve, a unique autoencoder is trained and the quantization step size is 1 during the training stage. At test time, the quantization step size varies.

Each table below contains two crops of a luminance image and their reconstructions via the compression algorithms in Figure 2. The crops are displayed two times larger than their real size. At the bottom of the table column associated to each compression algorithm is written a pair (rate, PSNR) for the compression of the luminance image via the compression algorithm.

$$1^{\text{st}}$$ luminance image orange green red JPEG2000 H.265
(0.439, 29.180) (0.301, 27.684) (0.311, 27.727) (0.367, 27.636) (0.325, 28.877)
$$4^{\text{th}}$$ luminance image orange green red JPEG2000 H.265
(0.200, 33.479) (0.231, 34.294) (0.239, 34.311) (0.234, 33.721) (0.228, 35.409)
$$6^{\text{th}}$$ luminance image orange green red JPEG2000 H.265
(0.233, 28.590) (0.240, 28.968) (0.248, 29.022) (0.310, 29.675) (0.237, 30.387)
$$9^{\text{th}}$$ luminance image orange green red JPEG2000 H.265
(0.143, 32.482) (0.145, 32.827) (0.151, 32.899) (0.181, 33.502) (0.112, 33.973)
$$10^{\text{th}}$$ luminance image orange green red JPEG2000 H.265
(0.219, 33.933) (0.235, 34.689) (0.244, 34.631) (0.210, 33.515) (0.221, 36.532)
$$14^{\text{th}}$$ luminance image orange green red JPEG2000 H.265
(0.357, 30.604) (0.401, 31.253) (0.413, 31.255) (0.330, 29.666) (0.245, 30.123)
$$17^{\text{th}}$$ luminance image orange green red JPEG2000 H.265
(0.160, 31.461) (0.167, 31.969) (0.170, 31.958) (0.180, 31.665) (0.148, 32.834)
$$18^{\text{th}}$$ luminance image orange green red JPEG2000 H.265
(0.390, 29.766) (0.438, 30.534) (0.450, 30.443) (0.388, 29.605) (0.292, 30.005)
$$22^{\text{nd}}$$ luminance image orange green red JPEG2000 H.265
(0.275, 31.792) (0.316, 32.533) (0.331, 32.587) (0.282, 31.647) (0.153, 31.158)
$$24^{\text{th}}$$ luminance image orange green red JPEG2000 H.265
(0.393, 29.259) (0.442, 30.097) (0.453, 30.000) (0.412, 29.630) (0.310, 30.253)

## Results - $$2^{\text{nd}}$$ test set

The $$2^{\text{nd}}$$ test set contains 100 luminance images created from the BSDS300 [BSDSPage].

$$4^{\text{th}}$$ luminance image orange green red JPEG2000 H.265
(0.253, 32.971) (0.177, 31.741) (0.184, 31.843) (0.208, 31.394) (0.176, 32.509)
$$9^{\text{th}}$$ luminance image orange green red JPEG2000 H.265
(0.339, 30.748) (0.239, 29.516) (0.243, 29.464) (0.294, 29.573) (0.253, 30.625)
$$28^{\text{th}}$$ luminance image orange green red JPEG2000 H.265
(0.270, 32.311) (0.185, 31.032) (0.190, 31.195) (0.270, 31.618) (0.184, 31.677)
$$39^{\text{th}}$$ luminance image orange green red JPEG2000 H.265
(0.447, 32.763) (0.370, 32.110) (0.381, 32.052) (0.445, 31.511) (0.481, 34.040)
$$47^{\text{th}}$$ luminance image orange green red JPEG2000 H.265
(0.549, 27.336) (0.611, 27.967) (0.620, 27.854) (0.663, 27.700) (0.516, 28.260)
$$66^{\text{th}}$$ luminance image orange green red JPEG2000 H.265
(0.340, 30.977) (0.233, 29.600) (0.237, 29.565) (0.288, 29.631) (0.235, 30.550)
$$76^{\text{th}}$$ luminance image orange green red JPEG2000 H.265
(0.252, 32.032) (0.171, 30.945) (0.175, 30.956) (0.257, 31.675) (0.157, 31.431)
$$91^{\text{st}}$$ luminance image orange green red JPEG2000 H.265
(0.390, 32.612) (0.322, 32.163) (0.339, 32.262) (0.361, 31.505) (0.400, 33.482)