how to OCR a pdf file and get the text stored within pdf?











up vote
10
down vote

favorite
8












first, apologies if this has been asked before - I searched for a while through the existing posts, but could not find support.



I am interested in a solution for Fedora to OCR a multipage non-searchable pdf and to turn this pdf into a new pdf file that contains the text layer on top of the image. On Mac OSX or Windows we could use Adobe Acrobat, but on Linux, specifically on Fedora?



https://snippets.webaware.com.au/howto/pdf-ocr-linux/ seems to describe a solution - but unfortunately I am already lost when retrieving exact-image.










share|improve this question
























  • There is a problem with the nice pdfocr script that the page you are linking to recommends: it relies upon pdftk which is essentially deprecated (for two reasons, its dependence on libgcj and on iText5+). So a different solution is needed anyway...
    – Maxim
    Mar 14 '17 at 6:04















up vote
10
down vote

favorite
8












first, apologies if this has been asked before - I searched for a while through the existing posts, but could not find support.



I am interested in a solution for Fedora to OCR a multipage non-searchable pdf and to turn this pdf into a new pdf file that contains the text layer on top of the image. On Mac OSX or Windows we could use Adobe Acrobat, but on Linux, specifically on Fedora?



https://snippets.webaware.com.au/howto/pdf-ocr-linux/ seems to describe a solution - but unfortunately I am already lost when retrieving exact-image.










share|improve this question
























  • There is a problem with the nice pdfocr script that the page you are linking to recommends: it relies upon pdftk which is essentially deprecated (for two reasons, its dependence on libgcj and on iText5+). So a different solution is needed anyway...
    – Maxim
    Mar 14 '17 at 6:04













up vote
10
down vote

favorite
8









up vote
10
down vote

favorite
8






8





first, apologies if this has been asked before - I searched for a while through the existing posts, but could not find support.



I am interested in a solution for Fedora to OCR a multipage non-searchable pdf and to turn this pdf into a new pdf file that contains the text layer on top of the image. On Mac OSX or Windows we could use Adobe Acrobat, but on Linux, specifically on Fedora?



https://snippets.webaware.com.au/howto/pdf-ocr-linux/ seems to describe a solution - but unfortunately I am already lost when retrieving exact-image.










share|improve this question















first, apologies if this has been asked before - I searched for a while through the existing posts, but could not find support.



I am interested in a solution for Fedora to OCR a multipage non-searchable pdf and to turn this pdf into a new pdf file that contains the text layer on top of the image. On Mac OSX or Windows we could use Adobe Acrobat, but on Linux, specifically on Fedora?



https://snippets.webaware.com.au/howto/pdf-ocr-linux/ seems to describe a solution - but unfortunately I am already lost when retrieving exact-image.







command-line pdf ocr






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Aug 6 at 3:34









Eduard Florinescu

3,104103851




3,104103851










asked Aug 4 '16 at 15:39









ingli

314419




314419












  • There is a problem with the nice pdfocr script that the page you are linking to recommends: it relies upon pdftk which is essentially deprecated (for two reasons, its dependence on libgcj and on iText5+). So a different solution is needed anyway...
    – Maxim
    Mar 14 '17 at 6:04


















  • There is a problem with the nice pdfocr script that the page you are linking to recommends: it relies upon pdftk which is essentially deprecated (for two reasons, its dependence on libgcj and on iText5+). So a different solution is needed anyway...
    – Maxim
    Mar 14 '17 at 6:04
















There is a problem with the nice pdfocr script that the page you are linking to recommends: it relies upon pdftk which is essentially deprecated (for two reasons, its dependence on libgcj and on iText5+). So a different solution is needed anyway...
– Maxim
Mar 14 '17 at 6:04




There is a problem with the nice pdfocr script that the page you are linking to recommends: it relies upon pdftk which is essentially deprecated (for two reasons, its dependence on libgcj and on iText5+). So a different solution is needed anyway...
– Maxim
Mar 14 '17 at 6:04










3 Answers
3






active

oldest

votes

















up vote
5
down vote



accepted










Best and easyest way out there is to use pypdfocr it doesn't change the pdf. pypdfocr is a python module link here.



pypdfocr your_document.pdf


At the end you will have another your_document_ocr.pdf the way you want it with searchable text. The app doesn't change the quality of the image. Increases the size of the file a bit by adding the overlay text.



I think the command is pretty easy that it doesn't need any GUI.
Maybe installing pypdfocr is a bit more verbose:



sudo dnf -y install tesseract 
pip install pypdfocr


Update 3rd november 2018:



pypdfocr is no longer supported since 2016 and I noticed some problems due to not being mentained. ocrmypdf(module) does a symiliar job and can be used like this:



ocrmypdf in.pdf out.pdf


To install:



pip install ocrmypdf


or



sudo apt install ocrmypdf  #ubuntu
sudo dnf -y install tesseract #fedora





share|improve this answer






























    up vote
    5
    down vote













    After learning that tesseract can now also produce searchable pdfs, I found the script sandwich: http://www.tobias-elze.de/pdfsandwich/



    after installing dependencies (this might not be the complete list)



    sudo dnf install svn ocaml unpaper tesseract


    I followed the script's guide for compiling from source




    Compile from sources



    pdfsandwich is open source software (license: GPL). You can download the sources either as .tar.bz2 package from the download area on the project website or check them out by subversion:




    svn checkout svn://svn.code.sf.net/p/pdfsandwich/code/trunk/src pdfsandwich



    If OCaml is installed on your system, you can compile and install as follows:




    cd pdfsandwich
    ./configure
    make
    sudo make install


    and this now allows me to run



    sandwich multipaged-non-searchable.pdf


    resulting in a searchable pdf.






    share|improve this answer























    • for a related, but separate question, building on this one, see unix.stackexchange.com/questions/306051/…
      – ingli
      Aug 27 '16 at 18:25










    • FWIW: pdfsandwich is also available in Ubuntu's apt package repository. Other distros might have it as well.
      – Laurence Gonsalves
      Mar 14 at 6:25










    • unix.stackexchange.com/questions/471985/… any suggestions
      – Deepak Umredkar
      Sep 28 at 6:23










    • Just came across fedoramagazine.org/4-cool-new-projects-try-copr-october-2018 showing a COPR package for fedora that packages pdfsandwich
      – ingli
      Oct 26 at 8:59


















    up vote
    1
    down vote













    An easy tool available in Ubuntu is 'ocrfeeder' it allows the generation of PDFs with OCR text overlaid on the original documents. It makes use of Tesseract plus other OCR engines (not sure which) and provides for image rotation/'unpaper', etc, as well.




    • http://live.gnome.org/OCRFeeder

    • https://github.com/GNOME/ocrfeeder






    share|improve this answer





















      Your Answer








      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "106"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














       

      draft saved


      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f301318%2fhow-to-ocr-a-pdf-file-and-get-the-text-stored-within-pdf%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      3 Answers
      3






      active

      oldest

      votes








      3 Answers
      3






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes








      up vote
      5
      down vote



      accepted










      Best and easyest way out there is to use pypdfocr it doesn't change the pdf. pypdfocr is a python module link here.



      pypdfocr your_document.pdf


      At the end you will have another your_document_ocr.pdf the way you want it with searchable text. The app doesn't change the quality of the image. Increases the size of the file a bit by adding the overlay text.



      I think the command is pretty easy that it doesn't need any GUI.
      Maybe installing pypdfocr is a bit more verbose:



      sudo dnf -y install tesseract 
      pip install pypdfocr


      Update 3rd november 2018:



      pypdfocr is no longer supported since 2016 and I noticed some problems due to not being mentained. ocrmypdf(module) does a symiliar job and can be used like this:



      ocrmypdf in.pdf out.pdf


      To install:



      pip install ocrmypdf


      or



      sudo apt install ocrmypdf  #ubuntu
      sudo dnf -y install tesseract #fedora





      share|improve this answer



























        up vote
        5
        down vote



        accepted










        Best and easyest way out there is to use pypdfocr it doesn't change the pdf. pypdfocr is a python module link here.



        pypdfocr your_document.pdf


        At the end you will have another your_document_ocr.pdf the way you want it with searchable text. The app doesn't change the quality of the image. Increases the size of the file a bit by adding the overlay text.



        I think the command is pretty easy that it doesn't need any GUI.
        Maybe installing pypdfocr is a bit more verbose:



        sudo dnf -y install tesseract 
        pip install pypdfocr


        Update 3rd november 2018:



        pypdfocr is no longer supported since 2016 and I noticed some problems due to not being mentained. ocrmypdf(module) does a symiliar job and can be used like this:



        ocrmypdf in.pdf out.pdf


        To install:



        pip install ocrmypdf


        or



        sudo apt install ocrmypdf  #ubuntu
        sudo dnf -y install tesseract #fedora





        share|improve this answer

























          up vote
          5
          down vote



          accepted







          up vote
          5
          down vote



          accepted






          Best and easyest way out there is to use pypdfocr it doesn't change the pdf. pypdfocr is a python module link here.



          pypdfocr your_document.pdf


          At the end you will have another your_document_ocr.pdf the way you want it with searchable text. The app doesn't change the quality of the image. Increases the size of the file a bit by adding the overlay text.



          I think the command is pretty easy that it doesn't need any GUI.
          Maybe installing pypdfocr is a bit more verbose:



          sudo dnf -y install tesseract 
          pip install pypdfocr


          Update 3rd november 2018:



          pypdfocr is no longer supported since 2016 and I noticed some problems due to not being mentained. ocrmypdf(module) does a symiliar job and can be used like this:



          ocrmypdf in.pdf out.pdf


          To install:



          pip install ocrmypdf


          or



          sudo apt install ocrmypdf  #ubuntu
          sudo dnf -y install tesseract #fedora





          share|improve this answer














          Best and easyest way out there is to use pypdfocr it doesn't change the pdf. pypdfocr is a python module link here.



          pypdfocr your_document.pdf


          At the end you will have another your_document_ocr.pdf the way you want it with searchable text. The app doesn't change the quality of the image. Increases the size of the file a bit by adding the overlay text.



          I think the command is pretty easy that it doesn't need any GUI.
          Maybe installing pypdfocr is a bit more verbose:



          sudo dnf -y install tesseract 
          pip install pypdfocr


          Update 3rd november 2018:



          pypdfocr is no longer supported since 2016 and I noticed some problems due to not being mentained. ocrmypdf(module) does a symiliar job and can be used like this:



          ocrmypdf in.pdf out.pdf


          To install:



          pip install ocrmypdf


          or



          sudo apt install ocrmypdf  #ubuntu
          sudo dnf -y install tesseract #fedora






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited 2 days ago

























          answered Feb 3 at 19:23









          Eduard Florinescu

          3,104103851




          3,104103851
























              up vote
              5
              down vote













              After learning that tesseract can now also produce searchable pdfs, I found the script sandwich: http://www.tobias-elze.de/pdfsandwich/



              after installing dependencies (this might not be the complete list)



              sudo dnf install svn ocaml unpaper tesseract


              I followed the script's guide for compiling from source




              Compile from sources



              pdfsandwich is open source software (license: GPL). You can download the sources either as .tar.bz2 package from the download area on the project website or check them out by subversion:




              svn checkout svn://svn.code.sf.net/p/pdfsandwich/code/trunk/src pdfsandwich



              If OCaml is installed on your system, you can compile and install as follows:




              cd pdfsandwich
              ./configure
              make
              sudo make install


              and this now allows me to run



              sandwich multipaged-non-searchable.pdf


              resulting in a searchable pdf.






              share|improve this answer























              • for a related, but separate question, building on this one, see unix.stackexchange.com/questions/306051/…
                – ingli
                Aug 27 '16 at 18:25










              • FWIW: pdfsandwich is also available in Ubuntu's apt package repository. Other distros might have it as well.
                – Laurence Gonsalves
                Mar 14 at 6:25










              • unix.stackexchange.com/questions/471985/… any suggestions
                – Deepak Umredkar
                Sep 28 at 6:23










              • Just came across fedoramagazine.org/4-cool-new-projects-try-copr-october-2018 showing a COPR package for fedora that packages pdfsandwich
                – ingli
                Oct 26 at 8:59















              up vote
              5
              down vote













              After learning that tesseract can now also produce searchable pdfs, I found the script sandwich: http://www.tobias-elze.de/pdfsandwich/



              after installing dependencies (this might not be the complete list)



              sudo dnf install svn ocaml unpaper tesseract


              I followed the script's guide for compiling from source




              Compile from sources



              pdfsandwich is open source software (license: GPL). You can download the sources either as .tar.bz2 package from the download area on the project website or check them out by subversion:




              svn checkout svn://svn.code.sf.net/p/pdfsandwich/code/trunk/src pdfsandwich



              If OCaml is installed on your system, you can compile and install as follows:




              cd pdfsandwich
              ./configure
              make
              sudo make install


              and this now allows me to run



              sandwich multipaged-non-searchable.pdf


              resulting in a searchable pdf.






              share|improve this answer























              • for a related, but separate question, building on this one, see unix.stackexchange.com/questions/306051/…
                – ingli
                Aug 27 '16 at 18:25










              • FWIW: pdfsandwich is also available in Ubuntu's apt package repository. Other distros might have it as well.
                – Laurence Gonsalves
                Mar 14 at 6:25










              • unix.stackexchange.com/questions/471985/… any suggestions
                – Deepak Umredkar
                Sep 28 at 6:23










              • Just came across fedoramagazine.org/4-cool-new-projects-try-copr-october-2018 showing a COPR package for fedora that packages pdfsandwich
                – ingli
                Oct 26 at 8:59













              up vote
              5
              down vote










              up vote
              5
              down vote









              After learning that tesseract can now also produce searchable pdfs, I found the script sandwich: http://www.tobias-elze.de/pdfsandwich/



              after installing dependencies (this might not be the complete list)



              sudo dnf install svn ocaml unpaper tesseract


              I followed the script's guide for compiling from source




              Compile from sources



              pdfsandwich is open source software (license: GPL). You can download the sources either as .tar.bz2 package from the download area on the project website or check them out by subversion:




              svn checkout svn://svn.code.sf.net/p/pdfsandwich/code/trunk/src pdfsandwich



              If OCaml is installed on your system, you can compile and install as follows:




              cd pdfsandwich
              ./configure
              make
              sudo make install


              and this now allows me to run



              sandwich multipaged-non-searchable.pdf


              resulting in a searchable pdf.






              share|improve this answer














              After learning that tesseract can now also produce searchable pdfs, I found the script sandwich: http://www.tobias-elze.de/pdfsandwich/



              after installing dependencies (this might not be the complete list)



              sudo dnf install svn ocaml unpaper tesseract


              I followed the script's guide for compiling from source




              Compile from sources



              pdfsandwich is open source software (license: GPL). You can download the sources either as .tar.bz2 package from the download area on the project website or check them out by subversion:




              svn checkout svn://svn.code.sf.net/p/pdfsandwich/code/trunk/src pdfsandwich



              If OCaml is installed on your system, you can compile and install as follows:




              cd pdfsandwich
              ./configure
              make
              sudo make install


              and this now allows me to run



              sandwich multipaged-non-searchable.pdf


              resulting in a searchable pdf.







              share|improve this answer














              share|improve this answer



              share|improve this answer








              edited Mar 20 '17 at 16:07

























              answered Aug 4 '16 at 15:39









              ingli

              314419




              314419












              • for a related, but separate question, building on this one, see unix.stackexchange.com/questions/306051/…
                – ingli
                Aug 27 '16 at 18:25










              • FWIW: pdfsandwich is also available in Ubuntu's apt package repository. Other distros might have it as well.
                – Laurence Gonsalves
                Mar 14 at 6:25










              • unix.stackexchange.com/questions/471985/… any suggestions
                – Deepak Umredkar
                Sep 28 at 6:23










              • Just came across fedoramagazine.org/4-cool-new-projects-try-copr-october-2018 showing a COPR package for fedora that packages pdfsandwich
                – ingli
                Oct 26 at 8:59


















              • for a related, but separate question, building on this one, see unix.stackexchange.com/questions/306051/…
                – ingli
                Aug 27 '16 at 18:25










              • FWIW: pdfsandwich is also available in Ubuntu's apt package repository. Other distros might have it as well.
                – Laurence Gonsalves
                Mar 14 at 6:25










              • unix.stackexchange.com/questions/471985/… any suggestions
                – Deepak Umredkar
                Sep 28 at 6:23










              • Just came across fedoramagazine.org/4-cool-new-projects-try-copr-october-2018 showing a COPR package for fedora that packages pdfsandwich
                – ingli
                Oct 26 at 8:59
















              for a related, but separate question, building on this one, see unix.stackexchange.com/questions/306051/…
              – ingli
              Aug 27 '16 at 18:25




              for a related, but separate question, building on this one, see unix.stackexchange.com/questions/306051/…
              – ingli
              Aug 27 '16 at 18:25












              FWIW: pdfsandwich is also available in Ubuntu's apt package repository. Other distros might have it as well.
              – Laurence Gonsalves
              Mar 14 at 6:25




              FWIW: pdfsandwich is also available in Ubuntu's apt package repository. Other distros might have it as well.
              – Laurence Gonsalves
              Mar 14 at 6:25












              unix.stackexchange.com/questions/471985/… any suggestions
              – Deepak Umredkar
              Sep 28 at 6:23




              unix.stackexchange.com/questions/471985/… any suggestions
              – Deepak Umredkar
              Sep 28 at 6:23












              Just came across fedoramagazine.org/4-cool-new-projects-try-copr-october-2018 showing a COPR package for fedora that packages pdfsandwich
              – ingli
              Oct 26 at 8:59




              Just came across fedoramagazine.org/4-cool-new-projects-try-copr-october-2018 showing a COPR package for fedora that packages pdfsandwich
              – ingli
              Oct 26 at 8:59










              up vote
              1
              down vote













              An easy tool available in Ubuntu is 'ocrfeeder' it allows the generation of PDFs with OCR text overlaid on the original documents. It makes use of Tesseract plus other OCR engines (not sure which) and provides for image rotation/'unpaper', etc, as well.




              • http://live.gnome.org/OCRFeeder

              • https://github.com/GNOME/ocrfeeder






              share|improve this answer

























                up vote
                1
                down vote













                An easy tool available in Ubuntu is 'ocrfeeder' it allows the generation of PDFs with OCR text overlaid on the original documents. It makes use of Tesseract plus other OCR engines (not sure which) and provides for image rotation/'unpaper', etc, as well.




                • http://live.gnome.org/OCRFeeder

                • https://github.com/GNOME/ocrfeeder






                share|improve this answer























                  up vote
                  1
                  down vote










                  up vote
                  1
                  down vote









                  An easy tool available in Ubuntu is 'ocrfeeder' it allows the generation of PDFs with OCR text overlaid on the original documents. It makes use of Tesseract plus other OCR engines (not sure which) and provides for image rotation/'unpaper', etc, as well.




                  • http://live.gnome.org/OCRFeeder

                  • https://github.com/GNOME/ocrfeeder






                  share|improve this answer












                  An easy tool available in Ubuntu is 'ocrfeeder' it allows the generation of PDFs with OCR text overlaid on the original documents. It makes use of Tesseract plus other OCR engines (not sure which) and provides for image rotation/'unpaper', etc, as well.




                  • http://live.gnome.org/OCRFeeder

                  • https://github.com/GNOME/ocrfeeder







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Oct 18 at 4:14









                  jdpipe

                  1113




                  1113






























                       

                      draft saved


                      draft discarded



















































                       


                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f301318%2fhow-to-ocr-a-pdf-file-and-get-the-text-stored-within-pdf%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Accessing regular linux commands in Huawei's Dopra Linux

                      Can't connect RFCOMM socket: Host is down

                      Kernel panic - not syncing: Fatal Exception in Interrupt