Dictionaries¶
ZstdCompressionDict
¶
-
class
zstandard.
ZstdCompressionDict
(data, dict_type=0, k=0, d=0)¶ Represents a computed compression dictionary.
Instances are obtained by calling
train_dictionary()
or by passing bytes obtained from another source into the constructor.Instances can be constructed from bytes:
>>> dict_data = zstandard.ZstdCompressionDict(data)
It is possible to construct a dictionary from any data. If the data doesn’t begin with a magic header, it will be treated as a prefix dictionary. Prefix dictionaries allow compression operations to reference raw data within the dictionary.
It is possible to force the use of prefix dictionaries or to require a dictionary header:
>>> dict_data = zstandard.ZstdCompressionDict(data, dict_type=zstandard.DICT_TYPE_RAWCONTENT) >>> dict_data = zstandard.ZstdCompressionDict(data, dict_type=zstandard.DICT_TYPE_FULLDICT)
You can see how many bytes are in the dictionary by calling
len()
:>>> dict_data = zstandard.train_dictionary(size, samples) >>> dict_size = len(dict_data) # will not be larger than ``size``
Once you have a dictionary, you can pass it to the objects performing compression and decompression:
>>> dict_data = zstandard.train_dictionary(131072, samples) >>> cctx = zstandard.ZstdCompressor(dict_data=dict_data) >>> for source_data in input_data: ... compressed = cctx.compress(source_data) ... # Do something with compressed data. ... >>> dctx = zstandard.ZstdDecompressor(dict_data=dict_data) >>> for compressed_data in input_data: ... buffer = io.BytesIO() ... with dctx.stream_writer(buffer) as decompressor: ... decompressor.write(compressed_data) ... # Do something with raw data in ``buffer``.
Dictionaries have unique integer IDs. You can retrieve this ID via:
>>> dict_id = zstandard.dictionary_id(dict_data)
You can obtain the raw data in the dict (useful for persisting and constructing a
ZstdCompressionDict
later) viaas_bytes()
:>>> dict_data = zstandard.train_dictionary(size, samples) >>> raw_data = dict_data.as_bytes()
By default, when a
ZstdCompressionDict
is attached to aZstdCompressor
, eachZstdCompressor
performs work to prepare the dictionary for use. This is fine if only 1 compression operation is being performed or if theZstdCompressor
is being reused for multiple operations. But if multipleZstdCompressor
instances are being used with the dictionary, this can add overhead.It is possible to precompute the dictionary so it can readily be consumed by multiple
ZstdCompressor
instances:>>> d = zstandard.ZstdCompressionDict(data) >>> # Precompute for compression level 3. >>> d.precompute_compress(level=3) >>> # Precompute with specific compression parameters. >>> params = zstandard.ZstdCompressionParameters(...) >>> d.precompute_compress(compression_params=params)
Note
When a dictionary is precomputed, the compression parameters used to precompute the dictionary overwrite some of the compression parameters specified to
ZstdCompressor
.Parameters: - data – Dictionary data.
- dict_type – Type of dictionary. One of the
DICT_TYPE_*
constants.
-
as_bytes
()¶ Obtain the
bytes
representation of the dictionary.
-
dict_id
()¶ Obtain the integer ID of the dictionary.
-
precompute_compress
(level=0, compression_params=None)¶ Precompute a dictionary os it can be used by multiple compressors.
Calling this method on an instance that will be used by multiple
ZstdCompressor
instances will improve performance.
Training Dictionaries¶
Unless using prefix dictionaries, dictionary data is produced by training
on existing data using the train_dictionary()
function.
-
zstandard.
train_dictionary
(dict_size, samples, k=0, d=0, f=0, split_point=0.0, accel=0, notifications=0, dict_id=0, level=0, steps=0, threads=0)¶ Train a dictionary from sample data using the COVER algorithm.
A compression dictionary of size
dict_size
will be created from the iterable ofsamples
. The raw dictionary bytes will be returned.The dictionary training mechanism is known as cover. More details about it are available in the paper Effective Construction of Relative Lempel-Ziv Dictionaries (authors: Liao, Petri, Moffat, Wirth).
The cover algorithm takes parameters
k
andd
. These are the segment size and dmer size, respectively. The returned dictionary instance created by this function hask
andd
attributes containing the values for these parameters. If aZstdCompressionDict
is constructed from raw bytes data (a content-only dictionary), thek
andd
attributes will be0
.The segment and dmer size parameters to the cover algorithm can either be specified manually or
train_dictionary()
can try multiple values and pick the best one, where best means the smallest compressed data size. This later mode is called optimization mode.Under the hood, this function always calls
ZDICT_optimizeTrainFromBuffer_fastCover()
. See the corresponding C library documentation for more.If neither
steps
northreads
is defined, defaults ford
,steps
, andlevel
will be used that are equivalent with whatZDICT_trainFromBuffer()
would use.Parameters: - dict_size – Target size in bytes of the dictionary to generate.
- samples – A list of bytes holding samples the dictionary will be trained from.
- k – Segment size : constraint: 0 < k : Reasonable range [16, 2048+]
- d – dmer size : constraint: 0 < d <= k : Reasonable range [6, 16]
- f – log of size of frequency array : constraint: 0 < f <= 31 : 1 means default(20)
- split_point – Percentage of samples used for training: Only used for optimization.
The first # samples *
split_point
samples will be used to training. The last # samples * (1 - split_point) samples will be used for testing. 0 means default (0.75), 1.0 when all samples are used for both training and testing. - accel – Acceleration level: constraint: 0 < accel <= 10. Higher means faster and less accurate, 0 means default(1).
- dict_id – Integer dictionary ID for the produced dictionary. Default is 0, which uses a random value.
- steps – Number of steps through
k
values to perform when trying parameter variations. - threads – Number of threads to use when trying parameter variations. Default is 0, which means to use a single thread. A negative value can be specified to use as many threads as there are detected logical CPUs.
- level – Integer target compression level when trying parameter variations.
- notifications – Controls writing of informational messages to
stderr
.0
(the default) means to write nothing.1
writes errors.2
writes progression info.3
writes more details. And4
writes all info.