Estimate compressibility of file












2














Is there a quick and dirty way of estimating gzip-compressibility of a file without having to fully compress it with gzip?



I could, in bash, do



bc <<<"scale=2;$(gzip -c file | wc -c)/$(wc -c <file)"


This gives me the compression factor without having to write the gz file to disk; this way I can avoid replacing a file on disk with its gz version if the resultant disk space savings do not justify the trouble. But with this approach the file is indeed fully put through gzip; it's just that the output is piped to wc rather than written to disk.



Is there a way to get a rough compressibility estimate for a file without having gzip work on all its contents?










share|improve this question



























    2














    Is there a quick and dirty way of estimating gzip-compressibility of a file without having to fully compress it with gzip?



    I could, in bash, do



    bc <<<"scale=2;$(gzip -c file | wc -c)/$(wc -c <file)"


    This gives me the compression factor without having to write the gz file to disk; this way I can avoid replacing a file on disk with its gz version if the resultant disk space savings do not justify the trouble. But with this approach the file is indeed fully put through gzip; it's just that the output is piped to wc rather than written to disk.



    Is there a way to get a rough compressibility estimate for a file without having gzip work on all its contents?










    share|improve this question

























      2












      2








      2







      Is there a quick and dirty way of estimating gzip-compressibility of a file without having to fully compress it with gzip?



      I could, in bash, do



      bc <<<"scale=2;$(gzip -c file | wc -c)/$(wc -c <file)"


      This gives me the compression factor without having to write the gz file to disk; this way I can avoid replacing a file on disk with its gz version if the resultant disk space savings do not justify the trouble. But with this approach the file is indeed fully put through gzip; it's just that the output is piped to wc rather than written to disk.



      Is there a way to get a rough compressibility estimate for a file without having gzip work on all its contents?










      share|improve this question













      Is there a quick and dirty way of estimating gzip-compressibility of a file without having to fully compress it with gzip?



      I could, in bash, do



      bc <<<"scale=2;$(gzip -c file | wc -c)/$(wc -c <file)"


      This gives me the compression factor without having to write the gz file to disk; this way I can avoid replacing a file on disk with its gz version if the resultant disk space savings do not justify the trouble. But with this approach the file is indeed fully put through gzip; it's just that the output is piped to wc rather than written to disk.



      Is there a way to get a rough compressibility estimate for a file without having gzip work on all its contents?







      compression gzip






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Sep 16 '14 at 16:48









      iruvariruvar

      11.7k62960




      11.7k62960






















          3 Answers
          3






          active

          oldest

          votes


















          4














          You could try compressing one every 10 blocks for instance to get an idea:



          perl -MIPC::Open2 -nE 'BEGIN{$/=4096;open2(*I,*O,"gzip|wc -c")}
          if ($. % 10 == 1) {print O $_; $l+=length}
          END{close O; $c = <I>; say $c/$l}'


          (here with 4K blocks).






          share|improve this answer





























            2














            Here's a (hopefully equivalent) Python version of Stephane Chazelas's solution



            python -c "
            import zlib
            from itertools import islice
            from functools import partial
            import sys
            with open(sys.argv[1]) as f:
            compressor = zlib.compressobj()
            t, z = 0, 0.0
            for chunk in islice(iter(partial(f.read, 4096), ''), 0, None, 10):
            t += len(chunk)
            z += len(compressor.compress(chunk))
            z += len(compressor.flush())
            print z/t
            " file





            share|improve this answer























            • I don't know if that's equivalent (as python is not by cup of coffee ;-b)), but that gives slightly different results (even for very large files where the gzip header size overhead can be ignored), possibly because zlib.compressobj uses different settings than gzip (I find that it's closer to gzip -3) or maybe because zlib.compressobj compresses each chunk in isolation (as opposed to a stream as a whole). In any case both approaches should be good enough. I find that perl is slightly faster.
              – Stéphane Chazelas
              Sep 17 '14 at 11:55










            • @StéphaneChazelas, judging by the documentation compressobj should be acting on the stream as a whole. I find that both Python and Perl solutions produce approximately the same result on my data files (they differ in the second decimal place, that's good enough for me). Thanks for the great idea
              – iruvar
              Sep 17 '14 at 12:48



















            0














            I had a multi-gigabyte file and I wasn't sure if it was compressed, so I test-compressed the first 10M bytes:



            head -c 10000000 large_file.bin | gzip | wc -c


            It's not prefect but it worked well for me.






            share|improve this answer





















              Your Answer








              StackExchange.ready(function() {
              var channelOptions = {
              tags: "".split(" "),
              id: "106"
              };
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function() {
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled) {
              StackExchange.using("snippets", function() {
              createEditor();
              });
              }
              else {
              createEditor();
              }
              });

              function createEditor() {
              StackExchange.prepareEditor({
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: false,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: null,
              bindNavPrevention: true,
              postfix: "",
              imageUploader: {
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              },
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              });


              }
              });














              draft saved

              draft discarded


















              StackExchange.ready(
              function () {
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f155901%2festimate-compressibility-of-file%23new-answer', 'question_page');
              }
              );

              Post as a guest















              Required, but never shown

























              3 Answers
              3






              active

              oldest

              votes








              3 Answers
              3






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              4














              You could try compressing one every 10 blocks for instance to get an idea:



              perl -MIPC::Open2 -nE 'BEGIN{$/=4096;open2(*I,*O,"gzip|wc -c")}
              if ($. % 10 == 1) {print O $_; $l+=length}
              END{close O; $c = <I>; say $c/$l}'


              (here with 4K blocks).






              share|improve this answer


























                4














                You could try compressing one every 10 blocks for instance to get an idea:



                perl -MIPC::Open2 -nE 'BEGIN{$/=4096;open2(*I,*O,"gzip|wc -c")}
                if ($. % 10 == 1) {print O $_; $l+=length}
                END{close O; $c = <I>; say $c/$l}'


                (here with 4K blocks).






                share|improve this answer
























                  4












                  4








                  4






                  You could try compressing one every 10 blocks for instance to get an idea:



                  perl -MIPC::Open2 -nE 'BEGIN{$/=4096;open2(*I,*O,"gzip|wc -c")}
                  if ($. % 10 == 1) {print O $_; $l+=length}
                  END{close O; $c = <I>; say $c/$l}'


                  (here with 4K blocks).






                  share|improve this answer












                  You could try compressing one every 10 blocks for instance to get an idea:



                  perl -MIPC::Open2 -nE 'BEGIN{$/=4096;open2(*I,*O,"gzip|wc -c")}
                  if ($. % 10 == 1) {print O $_; $l+=length}
                  END{close O; $c = <I>; say $c/$l}'


                  (here with 4K blocks).







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Sep 16 '14 at 18:23









                  Stéphane ChazelasStéphane Chazelas

                  300k54564916




                  300k54564916

























                      2














                      Here's a (hopefully equivalent) Python version of Stephane Chazelas's solution



                      python -c "
                      import zlib
                      from itertools import islice
                      from functools import partial
                      import sys
                      with open(sys.argv[1]) as f:
                      compressor = zlib.compressobj()
                      t, z = 0, 0.0
                      for chunk in islice(iter(partial(f.read, 4096), ''), 0, None, 10):
                      t += len(chunk)
                      z += len(compressor.compress(chunk))
                      z += len(compressor.flush())
                      print z/t
                      " file





                      share|improve this answer























                      • I don't know if that's equivalent (as python is not by cup of coffee ;-b)), but that gives slightly different results (even for very large files where the gzip header size overhead can be ignored), possibly because zlib.compressobj uses different settings than gzip (I find that it's closer to gzip -3) or maybe because zlib.compressobj compresses each chunk in isolation (as opposed to a stream as a whole). In any case both approaches should be good enough. I find that perl is slightly faster.
                        – Stéphane Chazelas
                        Sep 17 '14 at 11:55










                      • @StéphaneChazelas, judging by the documentation compressobj should be acting on the stream as a whole. I find that both Python and Perl solutions produce approximately the same result on my data files (they differ in the second decimal place, that's good enough for me). Thanks for the great idea
                        – iruvar
                        Sep 17 '14 at 12:48
















                      2














                      Here's a (hopefully equivalent) Python version of Stephane Chazelas's solution



                      python -c "
                      import zlib
                      from itertools import islice
                      from functools import partial
                      import sys
                      with open(sys.argv[1]) as f:
                      compressor = zlib.compressobj()
                      t, z = 0, 0.0
                      for chunk in islice(iter(partial(f.read, 4096), ''), 0, None, 10):
                      t += len(chunk)
                      z += len(compressor.compress(chunk))
                      z += len(compressor.flush())
                      print z/t
                      " file





                      share|improve this answer























                      • I don't know if that's equivalent (as python is not by cup of coffee ;-b)), but that gives slightly different results (even for very large files where the gzip header size overhead can be ignored), possibly because zlib.compressobj uses different settings than gzip (I find that it's closer to gzip -3) or maybe because zlib.compressobj compresses each chunk in isolation (as opposed to a stream as a whole). In any case both approaches should be good enough. I find that perl is slightly faster.
                        – Stéphane Chazelas
                        Sep 17 '14 at 11:55










                      • @StéphaneChazelas, judging by the documentation compressobj should be acting on the stream as a whole. I find that both Python and Perl solutions produce approximately the same result on my data files (they differ in the second decimal place, that's good enough for me). Thanks for the great idea
                        – iruvar
                        Sep 17 '14 at 12:48














                      2












                      2








                      2






                      Here's a (hopefully equivalent) Python version of Stephane Chazelas's solution



                      python -c "
                      import zlib
                      from itertools import islice
                      from functools import partial
                      import sys
                      with open(sys.argv[1]) as f:
                      compressor = zlib.compressobj()
                      t, z = 0, 0.0
                      for chunk in islice(iter(partial(f.read, 4096), ''), 0, None, 10):
                      t += len(chunk)
                      z += len(compressor.compress(chunk))
                      z += len(compressor.flush())
                      print z/t
                      " file





                      share|improve this answer














                      Here's a (hopefully equivalent) Python version of Stephane Chazelas's solution



                      python -c "
                      import zlib
                      from itertools import islice
                      from functools import partial
                      import sys
                      with open(sys.argv[1]) as f:
                      compressor = zlib.compressobj()
                      t, z = 0, 0.0
                      for chunk in islice(iter(partial(f.read, 4096), ''), 0, None, 10):
                      t += len(chunk)
                      z += len(compressor.compress(chunk))
                      z += len(compressor.flush())
                      print z/t
                      " file






                      share|improve this answer














                      share|improve this answer



                      share|improve this answer








                      edited Apr 13 '17 at 12:36









                      Community

                      1




                      1










                      answered Sep 16 '14 at 19:14









                      iruvariruvar

                      11.7k62960




                      11.7k62960












                      • I don't know if that's equivalent (as python is not by cup of coffee ;-b)), but that gives slightly different results (even for very large files where the gzip header size overhead can be ignored), possibly because zlib.compressobj uses different settings than gzip (I find that it's closer to gzip -3) or maybe because zlib.compressobj compresses each chunk in isolation (as opposed to a stream as a whole). In any case both approaches should be good enough. I find that perl is slightly faster.
                        – Stéphane Chazelas
                        Sep 17 '14 at 11:55










                      • @StéphaneChazelas, judging by the documentation compressobj should be acting on the stream as a whole. I find that both Python and Perl solutions produce approximately the same result on my data files (they differ in the second decimal place, that's good enough for me). Thanks for the great idea
                        – iruvar
                        Sep 17 '14 at 12:48


















                      • I don't know if that's equivalent (as python is not by cup of coffee ;-b)), but that gives slightly different results (even for very large files where the gzip header size overhead can be ignored), possibly because zlib.compressobj uses different settings than gzip (I find that it's closer to gzip -3) or maybe because zlib.compressobj compresses each chunk in isolation (as opposed to a stream as a whole). In any case both approaches should be good enough. I find that perl is slightly faster.
                        – Stéphane Chazelas
                        Sep 17 '14 at 11:55










                      • @StéphaneChazelas, judging by the documentation compressobj should be acting on the stream as a whole. I find that both Python and Perl solutions produce approximately the same result on my data files (they differ in the second decimal place, that's good enough for me). Thanks for the great idea
                        – iruvar
                        Sep 17 '14 at 12:48
















                      I don't know if that's equivalent (as python is not by cup of coffee ;-b)), but that gives slightly different results (even for very large files where the gzip header size overhead can be ignored), possibly because zlib.compressobj uses different settings than gzip (I find that it's closer to gzip -3) or maybe because zlib.compressobj compresses each chunk in isolation (as opposed to a stream as a whole). In any case both approaches should be good enough. I find that perl is slightly faster.
                      – Stéphane Chazelas
                      Sep 17 '14 at 11:55




                      I don't know if that's equivalent (as python is not by cup of coffee ;-b)), but that gives slightly different results (even for very large files where the gzip header size overhead can be ignored), possibly because zlib.compressobj uses different settings than gzip (I find that it's closer to gzip -3) or maybe because zlib.compressobj compresses each chunk in isolation (as opposed to a stream as a whole). In any case both approaches should be good enough. I find that perl is slightly faster.
                      – Stéphane Chazelas
                      Sep 17 '14 at 11:55












                      @StéphaneChazelas, judging by the documentation compressobj should be acting on the stream as a whole. I find that both Python and Perl solutions produce approximately the same result on my data files (they differ in the second decimal place, that's good enough for me). Thanks for the great idea
                      – iruvar
                      Sep 17 '14 at 12:48




                      @StéphaneChazelas, judging by the documentation compressobj should be acting on the stream as a whole. I find that both Python and Perl solutions produce approximately the same result on my data files (they differ in the second decimal place, that's good enough for me). Thanks for the great idea
                      – iruvar
                      Sep 17 '14 at 12:48











                      0














                      I had a multi-gigabyte file and I wasn't sure if it was compressed, so I test-compressed the first 10M bytes:



                      head -c 10000000 large_file.bin | gzip | wc -c


                      It's not prefect but it worked well for me.






                      share|improve this answer


























                        0














                        I had a multi-gigabyte file and I wasn't sure if it was compressed, so I test-compressed the first 10M bytes:



                        head -c 10000000 large_file.bin | gzip | wc -c


                        It's not prefect but it worked well for me.






                        share|improve this answer
























                          0












                          0








                          0






                          I had a multi-gigabyte file and I wasn't sure if it was compressed, so I test-compressed the first 10M bytes:



                          head -c 10000000 large_file.bin | gzip | wc -c


                          It's not prefect but it worked well for me.






                          share|improve this answer












                          I had a multi-gigabyte file and I wasn't sure if it was compressed, so I test-compressed the first 10M bytes:



                          head -c 10000000 large_file.bin | gzip | wc -c


                          It's not prefect but it worked well for me.







                          share|improve this answer












                          share|improve this answer



                          share|improve this answer










                          answered 26 mins ago









                          aidanaidan

                          1063




                          1063






























                              draft saved

                              draft discarded




















































                              Thanks for contributing an answer to Unix & Linux Stack Exchange!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              To learn more, see our tips on writing great answers.





                              Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                              Please pay close attention to the following guidance:


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function () {
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f155901%2festimate-compressibility-of-file%23new-answer', 'question_page');
                              }
                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              Accessing regular linux commands in Huawei's Dopra Linux

                              Can't connect RFCOMM socket: Host is down

                              Kernel panic - not syncing: Fatal Exception in Interrupt