Estimate compressibility of file

Is there a quick and dirty way of estimating gzip-compressibility of a file without having to fully compress it with gzip?

I could, in bash, do

bc <<<"scale=2;$(gzip -c file | wc -c)/$(wc -c <file)"

This gives me the compression factor without having to write the gz file to disk; this way I can avoid replacing a file on disk with its gz version if the resultant disk space savings do not justify the trouble. But with this approach the file is indeed fully put through gzip; it's just that the output is piped to wc rather than written to disk.

Is there a way to get a rough compressibility estimate for a file without having gzip work on all its contents?

asked Sep 16 '14 at 16:48

iruvar

11.7k62960

add a comment |

Is there a quick and dirty way of estimating gzip-compressibility of a file without having to fully compress it with gzip?

I could, in bash, do

bc <<<"scale=2;$(gzip -c file | wc -c)/$(wc -c <file)"

Is there a way to get a rough compressibility estimate for a file without having gzip work on all its contents?

asked Sep 16 '14 at 16:48

iruvar

11.7k62960

add a comment |

Is there a quick and dirty way of estimating gzip-compressibility of a file without having to fully compress it with gzip?

I could, in bash, do

bc <<<"scale=2;$(gzip -c file | wc -c)/$(wc -c <file)"

Is there a way to get a rough compressibility estimate for a file without having gzip work on all its contents?

asked Sep 16 '14 at 16:48

iruvar

11.7k62960

Is there a quick and dirty way of estimating gzip-compressibility of a file without having to fully compress it with gzip?

I could, in bash, do

bc <<<"scale=2;$(gzip -c file | wc -c)/$(wc -c <file)"

Is there a way to get a rough compressibility estimate for a file without having gzip work on all its contents?

compression gzip

asked Sep 16 '14 at 16:48

iruvar

11.7k62960

asked Sep 16 '14 at 16:48

iruvar

11.7k62960

asked Sep 16 '14 at 16:48

iruvar

11.7k62960

asked Sep 16 '14 at 16:48

iruvar

11.7k62960

asked Sep 16 '14 at 16:48

iruvar

11.7k62960

add a comment |

3 Answers
3

active

oldest

votes

You could try compressing one every 10 blocks for instance to get an idea:

perl -MIPC::Open2 -nE 'BEGIN{$/=4096;open2(*I,*O,"gzip|wc -c")}

                       if ($. % 10 == 1) {print O $_; $l+=length}

                       END{close O; $c = <I>; say $c/$l}'

(here with 4K blocks).

answered Sep 16 '14 at 18:23

Stéphane Chazelas

300k54564916

add a comment |

Here's a (hopefully equivalent) Python version of Stephane Chazelas's solution

python -c "

import zlib

from itertools import islice

from functools import partial

import sys

with open(sys.argv[1]) as f:

  compressor = zlib.compressobj()

  t, z = 0, 0.0

  for chunk in islice(iter(partial(f.read, 4096), ''), 0, None, 10):

    t += len(chunk)

    z += len(compressor.compress(chunk))

  z += len(compressor.flush())

  print z/t

" file

edited Apr 13 '17 at 12:36

Community♦

answered Sep 16 '14 at 19:14

iruvar

11.7k62960

I don't know if that's equivalent (as python is not by cup of coffee ;-b)), but that gives slightly different results (even for very large files where the gzip header size overhead can be ignored), possibly because zlib.compressobj uses different settings than gzip (I find that it's closer to gzip -3) or maybe because zlib.compressobj compresses each chunk in isolation (as opposed to a stream as a whole). In any case both approaches should be good enough. I find that perl is slightly faster.
– Stéphane Chazelas
Sep 17 '14 at 11:55

@StéphaneChazelas, judging by the documentation compressobj should be acting on the stream as a whole. I find that both Python and Perl solutions produce approximately the same result on my data files (they differ in the second decimal place, that's good enough for me). Thanks for the great idea
– iruvar
Sep 17 '14 at 12:48

add a comment |

I had a multi-gigabyte file and I wasn't sure if it was compressed, so I test-compressed the first 10M bytes:

head -c 10000000 large_file.bin | gzip | wc -c

It's not prefect but it worked well for me.

answered 26 mins ago

aidan

1063

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f155901%2festimate-compressibility-of-file%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

You could try compressing one every 10 blocks for instance to get an idea:

perl -MIPC::Open2 -nE 'BEGIN{$/=4096;open2(*I,*O,"gzip|wc -c")}

                       if ($. % 10 == 1) {print O $_; $l+=length}

                       END{close O; $c = <I>; say $c/$l}'

(here with 4K blocks).

answered Sep 16 '14 at 18:23

Stéphane Chazelas

300k54564916

add a comment |

You could try compressing one every 10 blocks for instance to get an idea:

perl -MIPC::Open2 -nE 'BEGIN{$/=4096;open2(*I,*O,"gzip|wc -c")}

                       if ($. % 10 == 1) {print O $_; $l+=length}

                       END{close O; $c = <I>; say $c/$l}'

(here with 4K blocks).

answered Sep 16 '14 at 18:23

Stéphane Chazelas

300k54564916

add a comment |

You could try compressing one every 10 blocks for instance to get an idea:

perl -MIPC::Open2 -nE 'BEGIN{$/=4096;open2(*I,*O,"gzip|wc -c")}

                       if ($. % 10 == 1) {print O $_; $l+=length}

                       END{close O; $c = <I>; say $c/$l}'

(here with 4K blocks).

answered Sep 16 '14 at 18:23

Stéphane Chazelas

300k54564916

You could try compressing one every 10 blocks for instance to get an idea:

perl -MIPC::Open2 -nE 'BEGIN{$/=4096;open2(*I,*O,"gzip|wc -c")}

                       if ($. % 10 == 1) {print O $_; $l+=length}

                       END{close O; $c = <I>; say $c/$l}'

(here with 4K blocks).

answered Sep 16 '14 at 18:23

Stéphane Chazelas

300k54564916

answered Sep 16 '14 at 18:23

Stéphane Chazelas

300k54564916

answered Sep 16 '14 at 18:23

Stéphane Chazelas

300k54564916

answered Sep 16 '14 at 18:23

Stéphane Chazelas

300k54564916

add a comment |

Here's a (hopefully equivalent) Python version of Stephane Chazelas's solution

python -c "

import zlib

from itertools import islice

from functools import partial

import sys

with open(sys.argv[1]) as f:

  compressor = zlib.compressobj()

  t, z = 0, 0.0

  for chunk in islice(iter(partial(f.read, 4096), ''), 0, None, 10):

    t += len(chunk)

    z += len(compressor.compress(chunk))

  z += len(compressor.flush())

  print z/t

" file

edited Apr 13 '17 at 12:36

Community♦

answered Sep 16 '14 at 19:14

iruvar

11.7k62960

I don't know if that's equivalent (as python is not by cup of coffee ;-b)), but that gives slightly different results (even for very large files where the gzip header size overhead can be ignored), possibly because zlib.compressobj uses different settings than gzip (I find that it's closer to gzip -3) or maybe because zlib.compressobj compresses each chunk in isolation (as opposed to a stream as a whole). In any case both approaches should be good enough. I find that perl is slightly faster.
– Stéphane Chazelas
Sep 17 '14 at 11:55

@StéphaneChazelas, judging by the documentation compressobj should be acting on the stream as a whole. I find that both Python and Perl solutions produce approximately the same result on my data files (they differ in the second decimal place, that's good enough for me). Thanks for the great idea
– iruvar
Sep 17 '14 at 12:48

add a comment |

Here's a (hopefully equivalent) Python version of Stephane Chazelas's solution

python -c "

import zlib

from itertools import islice

from functools import partial

import sys

with open(sys.argv[1]) as f:

  compressor = zlib.compressobj()

  t, z = 0, 0.0

  for chunk in islice(iter(partial(f.read, 4096), ''), 0, None, 10):

    t += len(chunk)

    z += len(compressor.compress(chunk))

  z += len(compressor.flush())

  print z/t

" file

edited Apr 13 '17 at 12:36

Community♦

answered Sep 16 '14 at 19:14

iruvar

11.7k62960

I don't know if that's equivalent (as python is not by cup of coffee ;-b)), but that gives slightly different results (even for very large files where the gzip header size overhead can be ignored), possibly because zlib.compressobj uses different settings than gzip (I find that it's closer to gzip -3) or maybe because zlib.compressobj compresses each chunk in isolation (as opposed to a stream as a whole). In any case both approaches should be good enough. I find that perl is slightly faster.
– Stéphane Chazelas
Sep 17 '14 at 11:55

@StéphaneChazelas, judging by the documentation compressobj should be acting on the stream as a whole. I find that both Python and Perl solutions produce approximately the same result on my data files (they differ in the second decimal place, that's good enough for me). Thanks for the great idea
– iruvar
Sep 17 '14 at 12:48

add a comment |

Here's a (hopefully equivalent) Python version of Stephane Chazelas's solution

python -c "

import zlib

from itertools import islice

from functools import partial

import sys

with open(sys.argv[1]) as f:

  compressor = zlib.compressobj()

  t, z = 0, 0.0

  for chunk in islice(iter(partial(f.read, 4096), ''), 0, None, 10):

    t += len(chunk)

    z += len(compressor.compress(chunk))

  z += len(compressor.flush())

  print z/t

" file

edited Apr 13 '17 at 12:36

Community♦

answered Sep 16 '14 at 19:14

iruvar

11.7k62960

Here's a (hopefully equivalent) Python version of Stephane Chazelas's solution

python -c "

import zlib

from itertools import islice

from functools import partial

import sys

with open(sys.argv[1]) as f:

  compressor = zlib.compressobj()

  t, z = 0, 0.0

  for chunk in islice(iter(partial(f.read, 4096), ''), 0, None, 10):

    t += len(chunk)

    z += len(compressor.compress(chunk))

  z += len(compressor.flush())

  print z/t

" file

edited Apr 13 '17 at 12:36

Community♦

answered Sep 16 '14 at 19:14

iruvar

11.7k62960

edited Apr 13 '17 at 12:36

Community♦

edited Apr 13 '17 at 12:36

Community♦

edited Apr 13 '17 at 12:36

Community♦

answered Sep 16 '14 at 19:14

iruvar

11.7k62960

answered Sep 16 '14 at 19:14

iruvar

11.7k62960

answered Sep 16 '14 at 19:14

iruvar

11.7k62960

I don't know if that's equivalent (as python is not by cup of coffee ;-b)), but that gives slightly different results (even for very large files where the gzip header size overhead can be ignored), possibly because zlib.compressobj uses different settings than gzip (I find that it's closer to gzip -3) or maybe because zlib.compressobj compresses each chunk in isolation (as opposed to a stream as a whole). In any case both approaches should be good enough. I find that perl is slightly faster.
– Stéphane Chazelas
Sep 17 '14 at 11:55

@StéphaneChazelas, judging by the documentation compressobj should be acting on the stream as a whole. I find that both Python and Perl solutions produce approximately the same result on my data files (they differ in the second decimal place, that's good enough for me). Thanks for the great idea
– iruvar
Sep 17 '14 at 12:48

add a comment |

I don't know if that's equivalent (as python is not by cup of coffee ;-b)), but that gives slightly different results (even for very large files where the gzip header size overhead can be ignored), possibly because zlib.compressobj uses different settings than gzip (I find that it's closer to gzip -3) or maybe because zlib.compressobj compresses each chunk in isolation (as opposed to a stream as a whole). In any case both approaches should be good enough. I find that perl is slightly faster.
– Stéphane Chazelas
Sep 17 '14 at 11:55

@StéphaneChazelas, judging by the documentation compressobj should be acting on the stream as a whole. I find that both Python and Perl solutions produce approximately the same result on my data files (they differ in the second decimal place, that's good enough for me). Thanks for the great idea
– iruvar
Sep 17 '14 at 12:48

I don't know if that's equivalent (as python is not by cup of coffee ;-b)), but that gives slightly different results (even for very large files where the gzip header size overhead can be ignored), possibly because zlib.compressobj uses different settings than gzip (I find that it's closer to gzip -3) or maybe because zlib.compressobj compresses each chunk in isolation (as opposed to a stream as a whole). In any case both approaches should be good enough. I find that perl is slightly faster.
– Stéphane Chazelas
Sep 17 '14 at 11:55

@StéphaneChazelas, judging by the documentation compressobj should be acting on the stream as a whole. I find that both Python and Perl solutions produce approximately the same result on my data files (they differ in the second decimal place, that's good enough for me). Thanks for the great idea
– iruvar
Sep 17 '14 at 12:48

add a comment |

I had a multi-gigabyte file and I wasn't sure if it was compressed, so I test-compressed the first 10M bytes:

head -c 10000000 large_file.bin | gzip | wc -c

It's not prefect but it worked well for me.

answered 26 mins ago

aidan

1063

add a comment |

I had a multi-gigabyte file and I wasn't sure if it was compressed, so I test-compressed the first 10M bytes:

head -c 10000000 large_file.bin | gzip | wc -c

It's not prefect but it worked well for me.

answered 26 mins ago

aidan

1063

add a comment |

I had a multi-gigabyte file and I wasn't sure if it was compressed, so I test-compressed the first 10M bytes:

head -c 10000000 large_file.bin | gzip | wc -c

It's not prefect but it worked well for me.

answered 26 mins ago

aidan

1063

I had a multi-gigabyte file and I wasn't sure if it was compressed, so I test-compressed the first 10M bytes:

head -c 10000000 large_file.bin | gzip | wc -c

It's not prefect but it worked well for me.

answered 26 mins ago

aidan

1063

answered 26 mins ago

aidan

1063

answered 26 mins ago

aidan

1063

answered 26 mins ago

aidan

1063

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Unix & Linux Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Sstrhsrtj