Compression APIs¶
ZstdCompressor
¶
-
class
zstandard.
ZstdCompressor
(level=3, dict_data=None, compression_params=None, write_checksum=None, write_content_size=None, write_dict_id=None, threads=0)¶ Create an object used to perform Zstandard compression.
Each instance is essentially a wrapper around a
ZSTD_CCtx
from zstd’s C API.An instance can compress data various ways. Instances can be used multiple times. Each compression operation will use the compression parameters defined at construction time.
compression_params
is mutually exclusive withlevel
,write_checksum
,write_content_size
,write_dict_id
, andthreads
.Assume that each
ZstdCompressor
instance can only handle a single logical compression operation at the same time. i.e. if you call a method likestream_reader()
to obtain multiple objects derived from the sameZstdCompressor
instance and attempt to use them simultaneously, errors will likely occur.If you need to perform multiple logical compression operations and you can’t guarantee those operations are temporally non-overlapping, you need to obtain multiple
ZstdCompressor
instances.Unless specified otherwise, assume that no two methods of
ZstdCompressor
instances can be called from multiple Python threads simultaneously. In other words, assume instances are not thread safe unless stated otherwise.Parameters: - level – Integer compression level. Valid values are all negative integers
through 22. Lower values generally yield faster operations with lower
compression ratios. Higher values are generally slower but compress
better. The default is 3, which is what the
zstd
CLI uses. Negative levels effectively engage--fast
mode from thezstd
CLI. - dict_data –
- A
ZstdCompressionDict
to be used to compress with dictionary - data.
- A
- compression_params – A
ZstdCompressionParameters
instance defining low-level compression parameters. If defined, this will overwrite thelevel
argument. - write_checksum – If True, a 4 byte content checksum will be written with the compressed data, allowing the decompressor to perform content verification.
- write_content_size – If True (the default), the decompressed content size will be included in the header of the compressed data. This data will only be written if the compressor knows the size of the input data.
- write_dict_id – Determines whether the dictionary ID will be written into the compressed data. Defaults to True. Only adds content to the compressed data if a dictionary is being used.
- threads – Number of threads to use to compress data concurrently. When set,
compression operations are performed on multiple threads. The default
value (0) disables multi-threaded compression. A value of
-1
means to set the number of threads to the number of detected logical CPUs.
-
chunker
(size=-1, chunk_size=131591)¶ Create an object for iterative compressing to same-sized chunks.
This API is similar to
ZstdCompressor.compressobj()
but has better performance properties.Parameters: - size – Size in bytes of data that will be compressed.
- chunk_size – Size of compressed chunks.
Returns:
-
compress
(data)¶ Compress data in a single operation.
This is the simplest mechanism to perform compression: simply pass in a value and get a compressed value back. It is almost the most prone to abuse.
The input and output values must fit in memory, so passing in very large values can result in excessive memory usage. For this reason, one of the streaming based APIs is preferred for larger values.
Parameters: data – Source data to compress Returns: Compressed data >>> cctx = zstandard.ZstdCompressor() >>> compressed = cctx.compress(b"data to compress")
-
compressobj
(size=-1)¶ Obtain a compressor exposing the Python standard library compression API.
See
ZstdCompressionObj
for the full documentation.Parameters: size – Size in bytes of data that will be compressed. Returns: ZstdCompressionObj
-
copy_stream
(ifh, ofh, size=-1, read_size=131072, write_size=131591)¶ Copy data between 2 streams while compressing it.
Data will be read from
ifh
, compressed, and written toofh
.ifh
must have aread(size)
method.ofh
must have awrite(data)
method.>>> cctx = zstandard.ZstdCompressor() >>> with open(input_path, "rb") as ifh, open(output_path, "wb") as ofh: ... cctx.copy_stream(ifh, ofh)
It is also possible to declare the size of the source stream:
>>> cctx = zstandard.ZstdCompressor() >>> cctx.copy_stream(ifh, ofh, size=len_of_input)
You can also specify how large the chunks that are
read()
andwrite()
from and to the streams:>>> cctx = zstandard.ZstdCompressor() >>> cctx.copy_stream(ifh, ofh, read_size=32768, write_size=16384)
The stream copier returns a 2-tuple of bytes read and written:
>>> cctx = zstandard.ZstdCompressor() >>> read_count, write_count = cctx.copy_stream(ifh, ofh)
Parameters: - ifh – Source stream to read from
- ofh – Destination stream to write to
- size – Size in bytes of the source stream. If defined, compression parameters will be tuned for this size.
- read_size – Chunk sizes that source stream should be
read()
from. - write_size – Chunk sizes that destination stream should be
write()
to.
Returns: 2-tuple of ints of bytes read and written, respectively.
-
frame_progression
()¶ Return information on how much work the compressor has done.
Returns a 3-tuple of (ingested, consumed, produced).
>>> cctx = zstandard.ZstdCompressor() >>> (ingested, consumed, produced) = cctx.frame_progression()
-
memory_size
()¶ Obtain the memory usage of this compressor, in bytes.
>>> cctx = zstandard.ZstdCompressor() >>> memory = cctx.memory_size()
-
multi_compress_to_buffer
(data, threads=-1)¶ Compress multiple pieces of data as a single function call.
(Experimental. Not yet supported by CFFI backend.)
This function is optimized to perform multiple compression operations as as possible with as little overhead as possible.
Data to be compressed can be passed as a
BufferWithSegmentsCollection
, aBufferWithSegments
, or a list containing byte like objects. Each element of the container will be compressed individually using the configured parameters on theZstdCompressor
instance.The
threads
argument controls how many threads to use for compression. The default is0
which means to use a single thread. Negative values use the number of logical CPUs in the machine.The function returns a
BufferWithSegmentsCollection
. This type represents N discrete memory allocations, each holding 1 or more compressed frames.Output data is written to shared memory buffers. This means that unlike regular Python objects, a reference to any object within the collection keeps the shared buffer and therefore memory backing it alive. This can have undesirable effects on process memory usage.
The API and behavior of this function is experimental and will likely change. Known deficiencies include:
- If asked to use multiple threads, it will always spawn that many threads, even if the input is too small to use them. It should automatically lower the thread count when the extra threads would just add overhead.
- The buffer allocation strategy is fixed. There is room to make it dynamic, perhaps even to allow one output buffer per input, facilitating a variation of the API to return a list without the adverse effects of shared memory buffers.
Parameters: data – Source to read discrete pieces of data to compress.
Can be a
BufferWithSegmentsCollection
, aBufferWithSegments
, or alist[bytes]
.Returns: BufferWithSegmentsCollection holding compressed data.
-
read_to_iter
(reader, size=-1, read_size=131072, write_size=131591)¶ Read uncompressed data from a reader and return an iterator
Returns an iterator of compressed data produced from reading from
reader
.This method provides a mechanism to stream compressed data out of a source as an iterator of data chunks.
Uncompressed data will be obtained from
reader
by calling theread(size)
method of it or by reading a slice (ifreader
conforms to the buffer protocol). The source data will be streamed into a compressor. As compressed data is available, it will be exposed to the iterator.Data is read from the source in chunks of
read_size
. Compressed chunks are at mostwrite_size
bytes. Both values default to the zstd input and and output defaults, respectively.If reading from the source via
read()
,read()
will be called until it raises or returns an empty bytes (b""
). It is perfectly valid for the source to deliver fewer bytes than were what requested byread(size)
.The caller is partially in control of how fast data is fed into the compressor by how it consumes the returned iterator. The compressor will not consume from the reader unless the caller consumes from the iterator.
>>> cctx = zstandard.ZstdCompressor() >>> for chunk in cctx.read_to_iter(fh): ... # Do something with emitted data.
read_to_iter()
accepts asize
argument declaring the size of the input stream:>>> cctx = zstandard.ZstdCompressor() >>> for chunk in cctx.read_to_iter(fh, size=some_int): >>> pass
You can also control the size that data is
read()
from the source and the ideal size of output chunks:>>> cctx = zstandard.ZstdCompressor() >>> for chunk in cctx.read_to_iter(fh, read_size=16384, write_size=8192): >>> pass
read_to_iter()
does not give direct control over the sizes of chunks fed into the compressor. Instead, chunk sizes will be whatever the object being read from delivers. These will often be of a uniform size.Parameters: - reader – Stream providing data to be compressed.
- size – Size in bytes of input data.
- read_size – Controls how many bytes are
read()
from the source. - write_size – Controls the output size of emitted chunks.
Returns: Iterator of
bytes
.
-
stream_reader
(source, size=-1, read_size=131072, closefd=True)¶ Wrap a readable source with a stream that can read compressed data.
This will produce an object conforming to the
io.RawIOBase
interface which can beread()
from to retrieve compressed data from a source.The source object can be any object with a
read(size)
method or an object that conforms to the buffer protocol.See
ZstdCompressionReader
for type documentation and usage examples.Parameters: - source – Object to read source data from
- size – Size in bytes of source object.
- read_size – How many bytes to request when
read()
’ing from the source. - closefd – Whether to close the source stream when the returned stream is closed.
Returns:
-
stream_writer
(writer, size=-1, write_size=131591, write_return_read=True, closefd=True)¶ Create a stream that will write compressed data into another stream.
The argument to
stream_writer()
must have awrite(data)
method. As compressed data is available,write()
will be called with the compressed data as its argument. Many common Python types implementwrite()
, including open file handles andio.BytesIO
.See
ZstdCompressionWriter
for more documentation, including usage examples.Parameters: - writer – Stream to write compressed data to.
- size – Size in bytes of data to be compressed. If set, it will be used to influence compression parameter tuning and could result in the size being written into the header of the compressed data.
- write_size – How much data to
write()
towriter
at a time. - write_return_read – Whether
write()
should return the number of bytes that were consumed from the input. - closefd – Whether to
close
thewriter
when this stream is closed.
Returns:
- level – Integer compression level. Valid values are all negative integers
through 22. Lower values generally yield faster operations with lower
compression ratios. Higher values are generally slower but compress
better. The default is 3, which is what the
ZstdCompressionWriter
¶
-
class
zstandard.
ZstdCompressionWriter
(compressor, writer, source_size, write_size, write_return_read, closefd=True)¶ Writable compressing stream wrapper.
ZstdCompressionWriter
is a write-only stream interface for writing compressed data to another stream.This type conforms to the
io.RawIOBase
interface and should be usable by any type that operates against a file-object (typing.BinaryIO
in Python type hinting speak). Only methods that involve writing will do useful things.As data is written to this stream (e.g. via
write()
), that data is sent to the compressor. As compressed data becomes available from the compressor, it is sent to the underlying stream by calling itswrite()
method.Both
write()
andflush()
return the number of bytes written to the object’swrite()
. In many cases, small inputs do not accumulate enough data to cause a write andwrite()
will return0
.Calling
close()
will mark the stream as closed and subsequent I/O operations will raiseValueError
(per the documented behavior ofio.RawIOBase
).close()
will also callclose()
on the underlying stream if such a method exists and the instance was constructed withclosefd=True
Instances are obtained by calling
ZstdCompressor.stream_writer()
.Typically usage is as follows:
>>> cctx = zstandard.ZstdCompressor(level=10) >>> compressor = cctx.stream_writer(fh) >>> compressor.write(b"chunk 0\n") >>> compressor.write(b"chunk 1\n") >>> compressor.flush() >>> # Receiver will be able to decode ``chunk 0\nchunk 1\n`` at this point. >>> # Receiver is also expecting more data in the zstd *frame*. >>> >>> compressor.write(b"chunk 2\n") >>> compressor.flush(zstandard.FLUSH_FRAME) >>> # Receiver will be able to decode ``chunk 0\nchunk 1\nchunk 2``. >>> # Receiver is expecting no more data, as the zstd frame is closed. >>> # Any future calls to ``write()`` at this point will construct a new >>> # zstd frame.
Instances can be used as context managers. Exiting the context manager is the equivalent of calling
close()
, which is equivalent to callingflush(zstandard.FLUSH_FRAME)
:>>> cctx = zstandard.ZstdCompressor(level=10) >>> with cctx.stream_writer(fh) as compressor: ... compressor.write(b'chunk 0') ... compressor.write(b'chunk 1') ... ...
Important
If
flush(FLUSH_FRAME)
is not called, emitted data doesn’t constitute a full zstd frame and consumers of this data may complain about malformed input. It is recommended to use instances as a context manager to ensure frames are properly finished.If the size of the data being fed to this streaming compressor is known, you can declare it before compression begins:
>>> cctx = zstandard.ZstdCompressor() >>> with cctx.stream_writer(fh, size=data_len) as compressor: ... compressor.write(chunk0) ... compressor.write(chunk1) ... ...
Declaring the size of the source data allows compression parameters to be tuned. And if
write_content_size
is used, it also results in the content size being written into the frame header of the output data.The size of chunks being
write()
to the destination can be specified:>>> cctx = zstandard.ZstdCompressor() >>> with cctx.stream_writer(fh, write_size=32768) as compressor: ... ...
To see how much memory is being used by the streaming compressor:
>>> cctx = zstandard.ZstdCompressor() >>> with cctx.stream_writer(fh) as compressor: ... ... ... byte_size = compressor.memory_size()
Thte total number of bytes written so far are exposed via
tell()
:>>> cctx = zstandard.ZstdCompressor() >>> with cctx.stream_writer(fh) as compressor: ... ... ... total_written = compressor.tell()
stream_writer()
accepts awrite_return_read
boolean argument to control the return value ofwrite()
. WhenFalse
(the default),write()
returns the number of bytes that werewrite()
’en to the underlying object. WhenTrue
,write()
returns the number of bytes read from the input that were subsequently written to the compressor.True
is the proper behavior forwrite()
as specified by theio.RawIOBase
interface and will become the default value in a future release.-
close
()¶
-
closed
¶
-
fileno
()¶
-
flush
(flush_mode=0)¶ Evict data from compressor’s internal state and write it to inner stream.
Calling this method may result in 0 or more
write()
calls to the inner stream.This method will also call
flush()
on the inner stream, if such a method exists.Parameters: flush_mode – How to flush the zstd compressor.
zstandard.FLUSH_BLOCK
will flush data already sent to the compressor but not emitted to the inner stream. The stream is still writable after calling this. This is the default behavior.See documentation for other
zstandard.FLUSH_*
constants for more flushing options.Returns: Integer number of bytes written to the inner stream.
-
isatty
()¶
-
memory_size
()¶
-
read
(size=-1)¶
-
readable
()¶
-
readall
()¶
-
readinto
(b)¶
-
readline
(size=-1)¶
-
readlines
(hint=-1)¶
-
seek
(offset, whence=None)¶
-
seekable
()¶
-
tell
()¶
-
truncate
(size=None)¶
-
writable
()¶
-
write
(data)¶ Send data to the compressor and possibly to the inner stream.
-
writelines
(lines)¶
-
ZstdCompressionReader
¶
-
class
zstandard.
ZstdCompressionReader
(compressor, source, read_size, closefd=True)¶ Readable compressing stream wrapper.
ZstdCompressionReader
is a read-only stream interface for obtaining compressed data from a source.This type conforms to the
io.RawIOBase
interface and should be usable by any type that operates against a file-object (typing.BinaryIO
in Python type hinting speak).Instances are neither writable nor seekable (even if the underlying source is seekable).
readline()
andreadlines()
are not implemented because they don’t make sense for compressed data.tell()
returns the number of compressed bytes emitted so far.Instances are obtained by calling
ZstdCompressor.stream_reader()
.In this example, we open a file for reading and then wrap that file handle with a stream from which compressed data can be
read()
.>>> with open(path, 'rb') as fh: ... cctx = zstandard.ZstdCompressor() ... reader = cctx.stream_reader(fh) ... while True: ... chunk = reader.read(16384) ... if not chunk: ... break ... ... # Do something with compressed chunk.
Instances can also be used as context managers:
>>> with open(path, 'rb') as fh: ... cctx = zstandard.ZstdCompressor() ... with cctx.stream_reader(fh) as reader: ... while True: ... chunk = reader.read(16384) ... if not chunk: ... break ... ... # Do something with compressed chunk.
When the context manager exits or
close()
is called, the stream is closed, underlying resources are released, and future operations against the compression stream will fail.stream_reader()
accepts asize
argument specifying how large the input stream is. This is used to adjust compression parameters so they are tailored to the source size. e.g.>>> with open(path, 'rb') as fh: ... cctx = zstandard.ZstdCompressor() ... with cctx.stream_reader(fh, size=os.stat(path).st_size) as reader: ... ...
If the
source
is a stream, you can specify how largeread()
requests to that stream should be via theread_size
argument. It defaults tozstandard.COMPRESSION_RECOMMENDED_INPUT_SIZE
. e.g.>>> with open(path, 'rb') as fh: ... cctx = zstandard.ZstdCompressor() ... # Will perform fh.read(8192) when obtaining data to feed into the ... # compressor. ... with cctx.stream_reader(fh, read_size=8192) as reader: ... ...
-
close
()¶
-
closed
¶
-
flush
()¶
-
isatty
()¶
-
next
()¶
-
read
(size=-1)¶
-
read1
(size=-1)¶
-
readable
()¶
-
readall
()¶
-
readinto
(b)¶
-
readinto1
(b)¶
-
readline
()¶
-
readlines
()¶
-
seekable
()¶
-
tell
()¶
-
writable
()¶
-
write
(data)¶
-
writelines
(ignored)¶
-
ZstdCompressionObj
¶
-
class
zstandard.
ZstdCompressionObj
¶ A compressor conforming to the API in Python’s standard library.
This type implements an API similar to compression types in Python’s standard library such as
zlib.compressobj
andbz2.BZ2Compressor
. This enables existing code targeting the standard library API to swap in this type to achieve zstd compression.Important
The design of this API is not ideal for optimal performance.
The reason performance is not optimal is because the API is limited to returning a single buffer holding compressed data. When compressing data, we don’t know how much data will be emitted. So in order to capture all this data in a single buffer, we need to perform buffer reallocations and/or extra memory copies. This can add significant overhead depending on the size or nature of the compressed data how much your application calls this type.
If performance is critical, consider an API like
ZstdCompressor.stream_reader()
,ZstdCompressor.stream_writer()
,ZstdCompressor.chunker()
, orZstdCompressor.read_to_iter()
, which result in less overhead managing buffers.Instances are obtained by calling
ZstdCompressor.compressobj()
.Here is how this API should be used:
>>> cctx = zstandard.ZstdCompressor() >>> cobj = cctx.compressobj() >>> data = cobj.compress(b"raw input 0") >>> data = cobj.compress(b"raw input 1") >>> data = cobj.flush()
Or to flush blocks:
>>> cctx.zstandard.ZstdCompressor() >>> cobj = cctx.compressobj() >>> data = cobj.compress(b"chunk in first block") >>> data = cobj.flush(zstandard.COMPRESSOBJ_FLUSH_BLOCK) >>> data = cobj.compress(b"chunk in second block") >>> data = cobj.flush()
For best performance results, keep input chunks under 256KB. This avoids extra allocations for a large output object.
It is possible to declare the input size of the data that will be fed into the compressor:
>>> cctx = zstandard.ZstdCompressor() >>> cobj = cctx.compressobj(size=6) >>> data = cobj.compress(b"foobar") >>> data = cobj.flush()
-
compress
(data)¶ Send data to the compressor.
This method receives bytes to feed to the compressor and returns bytes constituting zstd compressed data.
The zstd compressor accumulates bytes and the returned bytes may be substantially smaller or larger than the size of the input data on any given call. The returned value may be the empty byte string (
b""
).Parameters: data – Data to write to the compressor. Returns: Compressed data.
-
flush
(flush_mode=0)¶ Emit data accumulated in the compressor that hasn’t been outputted yet.
The
flush_mode
argument controls how to end the stream.zstandard.COMPRESSOBJ_FLUSH_FINISH
(the default) ends the compression stream and finishes a zstd frame. Once this type of flush is performed,compress()
andflush()
can no longer be called. This type of flush must be called to end the compression context. If not called, the emitted data may be incomplete and may not be readable by a decompressor.zstandard.COMPRESSOBJ_FLUSH_BLOCK
will flush a zstd block. This ensures that all data fed to this instance will have been omitted and can be decoded by a decompressor. Flushes of this type can be performed multiple times. The next call tocompress()
will begin a new zstd block.Parameters: flush_mode – How to flush the zstd compressor. Returns: Compressed data.
-
ZstdCompressionChunker
¶
-
class
zstandard.
ZstdCompressionChunker
(compressor, chunk_size)¶ Compress data to uniformly sized chunks.
This type allows you to iteratively feed chunks of data into a compressor and produce output chunks of uniform size.
compress()
,flush()
, andfinish()
all return an iterator ofbytes
instances holding compressed data. The iterator may be empty. Callers MUST iterate through all elements of the returned iterator before performing another operation on the object or else the compressor’s internal state may become confused. This can result in an exception being raised or malformed data being emitted.All chunks emitted by
compress()
will have a length of the configured chunk size.flush()
andfinish()
may return a final chunk smaller than the configured chunk size.Instances are obtained by calling
ZstdCompressor.chunker()
.Here is how the API should be used:
>>> cctx = zstandard.ZstdCompressor() >>> chunker = cctx.chunker(chunk_size=32768) >>> >>> with open(path, 'rb') as fh: ... while True: ... in_chunk = fh.read(32768) ... if not in_chunk: ... break ... ... for out_chunk in chunker.compress(in_chunk): ... # Do something with output chunk of size 32768. ... ... for out_chunk in chunker.finish(): ... # Do something with output chunks that finalize the zstd frame.
This compressor type is often a better alternative to
ZstdCompressor.compressobj
because it has better performance properties.compressobj()
will emit output data as it is available. This results in a stream of output chunks of varying sizes. The consistency of the output chunk size withchunker()
is more appropriate for many usages, such as sending compressed data to a socket.compressobj()
may also perform extra memory reallocations in order to dynamically adjust the sizes of the output chunks. Sincechunker()
output chunks are all the same size (except for flushed or final chunks), there is less memory allocation/copying overhead.-
compress
(data)¶ Feed new input data into the compressor.
Parameters: data – Data to feed to compressor. Returns: Iterator of bytes
representing chunks of compressed data.
-
finish
()¶ Signals the end of input data.
No new data can be compressed after this method is called.
This method will flush buffered data and finish the zstd frame.
Returns: Iterator of bytes
of compressed data.
-
flush
()¶ Flushes all data currently in the compressor.
Returns: Iterator of bytes
of compressed data.
-