how to OCR a pdf file and get the text stored within pdf?

up vote
10
down vote

favorite

first, apologies if this has been asked before - I searched for a while through the existing posts, but could not find support.

I am interested in a solution for Fedora to OCR a multipage non-searchable pdf and to turn this pdf into a new pdf file that contains the text layer on top of the image. On Mac OSX or Windows we could use Adobe Acrobat, but on Linux, specifically on Fedora?

https://snippets.webaware.com.au/howto/pdf-ocr-linux/ seems to describe a solution - but unfortunately I am already lost when retrieving exact-image.

edited Aug 6 at 3:34

Eduard Florinescu

3,104103851

asked Aug 4 '16 at 15:39

ingli

314419

There is a problem with the nice pdfocr script that the page you are linking to recommends: it relies upon pdftk which is essentially deprecated (for two reasons, its dependence on libgcj and on iText5+). So a different solution is needed anyway...
– Maxim
Mar 14 '17 at 6:04

add a comment |

up vote
10
down vote

favorite

first, apologies if this has been asked before - I searched for a while through the existing posts, but could not find support.

https://snippets.webaware.com.au/howto/pdf-ocr-linux/ seems to describe a solution - but unfortunately I am already lost when retrieving exact-image.

edited Aug 6 at 3:34

Eduard Florinescu

3,104103851

asked Aug 4 '16 at 15:39

ingli

314419

There is a problem with the nice pdfocr script that the page you are linking to recommends: it relies upon pdftk which is essentially deprecated (for two reasons, its dependence on libgcj and on iText5+). So a different solution is needed anyway...
– Maxim
Mar 14 '17 at 6:04

add a comment |

up vote
10
down vote

favorite

first, apologies if this has been asked before - I searched for a while through the existing posts, but could not find support.

https://snippets.webaware.com.au/howto/pdf-ocr-linux/ seems to describe a solution - but unfortunately I am already lost when retrieving exact-image.

edited Aug 6 at 3:34

Eduard Florinescu

3,104103851

asked Aug 4 '16 at 15:39

ingli

314419

first, apologies if this has been asked before - I searched for a while through the existing posts, but could not find support.

https://snippets.webaware.com.au/howto/pdf-ocr-linux/ seems to describe a solution - but unfortunately I am already lost when retrieving exact-image.

command-line pdf ocr

edited Aug 6 at 3:34

Eduard Florinescu

3,104103851

asked Aug 4 '16 at 15:39

ingli

314419

edited Aug 6 at 3:34

Eduard Florinescu

3,104103851

asked Aug 4 '16 at 15:39

ingli

314419

edited Aug 6 at 3:34

Eduard Florinescu

3,104103851

edited Aug 6 at 3:34

Eduard Florinescu

3,104103851

edited Aug 6 at 3:34

Eduard Florinescu

3,104103851

asked Aug 4 '16 at 15:39

ingli

314419

asked Aug 4 '16 at 15:39

ingli

314419

asked Aug 4 '16 at 15:39

ingli

314419

There is a problem with the nice pdfocr script that the page you are linking to recommends: it relies upon pdftk which is essentially deprecated (for two reasons, its dependence on libgcj and on iText5+). So a different solution is needed anyway...
– Maxim
Mar 14 '17 at 6:04

add a comment |

There is a problem with the nice pdfocr script that the page you are linking to recommends: it relies upon pdftk which is essentially deprecated (for two reasons, its dependence on libgcj and on iText5+). So a different solution is needed anyway...
– Maxim
Mar 14 '17 at 6:04

There is a problem with the nice pdfocr script that the page you are linking to recommends: it relies upon pdftk which is essentially deprecated (for two reasons, its dependence on libgcj and on iText5+). So a different solution is needed anyway...
– Maxim
Mar 14 '17 at 6:04

add a comment |

3 Answers
3

active

oldest

votes

up vote
5
down vote

accepted

Best and easyest way out there is to use pypdfocr it doesn't change the pdf. pypdfocr is a python module link here.

pypdfocr your_document.pdf

At the end you will have another your_document_ocr.pdf the way you want it with searchable text. The app doesn't change the quality of the image. Increases the size of the file a bit by adding the overlay text.

I think the command is pretty easy that it doesn't need any GUI.
Maybe installing pypdfocr is a bit more verbose:

sudo dnf -y install tesseract 

pip install pypdfocr

Update 3rd november 2018:

pypdfocr is no longer supported since 2016 and I noticed some problems due to not being mentained. ocrmypdf(module) does a symiliar job and can be used like this:

ocrmypdf in.pdf out.pdf

To install:

pip install ocrmypdf

sudo apt install ocrmypdf  #ubuntu

sudo dnf -y install tesseract #fedora

edited 2 days ago

answered Feb 3 at 19:23

Eduard Florinescu

3,104103851

add a comment |

up vote
5
down vote

After learning that tesseract can now also produce searchable pdfs, I found the script sandwich: http://www.tobias-elze.de/pdfsandwich/

after installing dependencies (this might not be the complete list)

sudo dnf install svn ocaml unpaper tesseract

I followed the script's guide for compiling from source

Compile from sources

pdfsandwich is open source software (license: GPL). You can download the sources either as .tar.bz2 package from the download area on the project website or check them out by subversion:

svn checkout svn://svn.code.sf.net/p/pdfsandwich/code/trunk/src pdfsandwich

If OCaml is installed on your system, you can compile and install as follows:

cd pdfsandwich

./configure

make

sudo make install

and this now allows me to run

sandwich multipaged-non-searchable.pdf

resulting in a searchable pdf.

edited Mar 20 '17 at 16:07

answered Aug 4 '16 at 15:39

ingli

314419

for a related, but separate question, building on this one, see unix.stackexchange.com/questions/306051/…
– ingli
Aug 27 '16 at 18:25

FWIW: pdfsandwich is also available in Ubuntu's apt package repository. Other distros might have it as well.
– Laurence Gonsalves
Mar 14 at 6:25

unix.stackexchange.com/questions/471985/… any suggestions
– Deepak Umredkar
Sep 28 at 6:23

Just came across fedoramagazine.org/4-cool-new-projects-try-copr-october-2018 showing a COPR package for fedora that packages pdfsandwich
– ingli
Oct 26 at 8:59

add a comment |

up vote
1
down vote

An easy tool available in Ubuntu is 'ocrfeeder' it allows the generation of PDFs with OCR text overlaid on the original documents. It makes use of Tesseract plus other OCR engines (not sure which) and provides for image rotation/'unpaper', etc, as well.

http://live.gnome.org/OCRFeeder

https://github.com/GNOME/ocrfeeder

answered Oct 18 at 4:14

jdpipe

1113

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f301318%2fhow-to-ocr-a-pdf-file-and-get-the-text-stored-within-pdf%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

up vote
5
down vote

accepted

Best and easyest way out there is to use pypdfocr it doesn't change the pdf. pypdfocr is a python module link here.

pypdfocr your_document.pdf

I think the command is pretty easy that it doesn't need any GUI.
Maybe installing pypdfocr is a bit more verbose:

sudo dnf -y install tesseract 

pip install pypdfocr

Update 3rd november 2018:

pypdfocr is no longer supported since 2016 and I noticed some problems due to not being mentained. ocrmypdf(module) does a symiliar job and can be used like this:

ocrmypdf in.pdf out.pdf

To install:

pip install ocrmypdf

sudo apt install ocrmypdf  #ubuntu

sudo dnf -y install tesseract #fedora

edited 2 days ago

answered Feb 3 at 19:23

Eduard Florinescu

3,104103851

add a comment |

up vote
5
down vote

accepted

Best and easyest way out there is to use pypdfocr it doesn't change the pdf. pypdfocr is a python module link here.

pypdfocr your_document.pdf

I think the command is pretty easy that it doesn't need any GUI.
Maybe installing pypdfocr is a bit more verbose:

sudo dnf -y install tesseract 

pip install pypdfocr

Update 3rd november 2018:

pypdfocr is no longer supported since 2016 and I noticed some problems due to not being mentained. ocrmypdf(module) does a symiliar job and can be used like this:

ocrmypdf in.pdf out.pdf

To install:

pip install ocrmypdf

sudo apt install ocrmypdf  #ubuntu

sudo dnf -y install tesseract #fedora

edited 2 days ago

answered Feb 3 at 19:23

Eduard Florinescu

3,104103851

add a comment |

up vote
5
down vote

accepted

Best and easyest way out there is to use pypdfocr it doesn't change the pdf. pypdfocr is a python module link here.

pypdfocr your_document.pdf

I think the command is pretty easy that it doesn't need any GUI.
Maybe installing pypdfocr is a bit more verbose:

sudo dnf -y install tesseract 

pip install pypdfocr

Update 3rd november 2018:

pypdfocr is no longer supported since 2016 and I noticed some problems due to not being mentained. ocrmypdf(module) does a symiliar job and can be used like this:

ocrmypdf in.pdf out.pdf

To install:

pip install ocrmypdf

sudo apt install ocrmypdf  #ubuntu

sudo dnf -y install tesseract #fedora

edited 2 days ago

answered Feb 3 at 19:23

Eduard Florinescu

3,104103851

Best and easyest way out there is to use pypdfocr it doesn't change the pdf. pypdfocr is a python module link here.

pypdfocr your_document.pdf

I think the command is pretty easy that it doesn't need any GUI.
Maybe installing pypdfocr is a bit more verbose:

sudo dnf -y install tesseract 

pip install pypdfocr

Update 3rd november 2018:

pypdfocr is no longer supported since 2016 and I noticed some problems due to not being mentained. ocrmypdf(module) does a symiliar job and can be used like this:

ocrmypdf in.pdf out.pdf

To install:

pip install ocrmypdf

sudo apt install ocrmypdf  #ubuntu

sudo dnf -y install tesseract #fedora

edited 2 days ago

answered Feb 3 at 19:23

Eduard Florinescu

3,104103851

edited 2 days ago

answered Feb 3 at 19:23

Eduard Florinescu

3,104103851

answered Feb 3 at 19:23

Eduard Florinescu

3,104103851

answered Feb 3 at 19:23

Eduard Florinescu

3,104103851

add a comment |

up vote
5
down vote

After learning that tesseract can now also produce searchable pdfs, I found the script sandwich: http://www.tobias-elze.de/pdfsandwich/

after installing dependencies (this might not be the complete list)

sudo dnf install svn ocaml unpaper tesseract

I followed the script's guide for compiling from source

Compile from sources

pdfsandwich is open source software (license: GPL). You can download the sources either as .tar.bz2 package from the download area on the project website or check them out by subversion:

svn checkout svn://svn.code.sf.net/p/pdfsandwich/code/trunk/src pdfsandwich

If OCaml is installed on your system, you can compile and install as follows:

cd pdfsandwich

./configure

make

sudo make install

and this now allows me to run

sandwich multipaged-non-searchable.pdf

resulting in a searchable pdf.

edited Mar 20 '17 at 16:07

answered Aug 4 '16 at 15:39

ingli

314419

for a related, but separate question, building on this one, see unix.stackexchange.com/questions/306051/…
– ingli
Aug 27 '16 at 18:25

FWIW: pdfsandwich is also available in Ubuntu's apt package repository. Other distros might have it as well.
– Laurence Gonsalves
Mar 14 at 6:25

unix.stackexchange.com/questions/471985/… any suggestions
– Deepak Umredkar
Sep 28 at 6:23

Just came across fedoramagazine.org/4-cool-new-projects-try-copr-october-2018 showing a COPR package for fedora that packages pdfsandwich
– ingli
Oct 26 at 8:59

add a comment |

up vote
5
down vote

After learning that tesseract can now also produce searchable pdfs, I found the script sandwich: http://www.tobias-elze.de/pdfsandwich/

after installing dependencies (this might not be the complete list)

sudo dnf install svn ocaml unpaper tesseract

I followed the script's guide for compiling from source

Compile from sources

pdfsandwich is open source software (license: GPL). You can download the sources either as .tar.bz2 package from the download area on the project website or check them out by subversion:

svn checkout svn://svn.code.sf.net/p/pdfsandwich/code/trunk/src pdfsandwich

If OCaml is installed on your system, you can compile and install as follows:

cd pdfsandwich

./configure

make

sudo make install

and this now allows me to run

sandwich multipaged-non-searchable.pdf

resulting in a searchable pdf.

edited Mar 20 '17 at 16:07

answered Aug 4 '16 at 15:39

ingli

314419

for a related, but separate question, building on this one, see unix.stackexchange.com/questions/306051/…
– ingli
Aug 27 '16 at 18:25

FWIW: pdfsandwich is also available in Ubuntu's apt package repository. Other distros might have it as well.
– Laurence Gonsalves
Mar 14 at 6:25

unix.stackexchange.com/questions/471985/… any suggestions
– Deepak Umredkar
Sep 28 at 6:23

Just came across fedoramagazine.org/4-cool-new-projects-try-copr-october-2018 showing a COPR package for fedora that packages pdfsandwich
– ingli
Oct 26 at 8:59

add a comment |

up vote
5
down vote

After learning that tesseract can now also produce searchable pdfs, I found the script sandwich: http://www.tobias-elze.de/pdfsandwich/

after installing dependencies (this might not be the complete list)

sudo dnf install svn ocaml unpaper tesseract

I followed the script's guide for compiling from source

Compile from sources

pdfsandwich is open source software (license: GPL). You can download the sources either as .tar.bz2 package from the download area on the project website or check them out by subversion:

svn checkout svn://svn.code.sf.net/p/pdfsandwich/code/trunk/src pdfsandwich

If OCaml is installed on your system, you can compile and install as follows:

cd pdfsandwich

./configure

make

sudo make install

and this now allows me to run

sandwich multipaged-non-searchable.pdf

resulting in a searchable pdf.

edited Mar 20 '17 at 16:07

answered Aug 4 '16 at 15:39

ingli

314419

After learning that tesseract can now also produce searchable pdfs, I found the script sandwich: http://www.tobias-elze.de/pdfsandwich/

after installing dependencies (this might not be the complete list)

sudo dnf install svn ocaml unpaper tesseract

I followed the script's guide for compiling from source

Compile from sources

pdfsandwich is open source software (license: GPL). You can download the sources either as .tar.bz2 package from the download area on the project website or check them out by subversion:

svn checkout svn://svn.code.sf.net/p/pdfsandwich/code/trunk/src pdfsandwich

If OCaml is installed on your system, you can compile and install as follows:

cd pdfsandwich

./configure

make

sudo make install

and this now allows me to run

sandwich multipaged-non-searchable.pdf

resulting in a searchable pdf.

edited Mar 20 '17 at 16:07

answered Aug 4 '16 at 15:39

ingli

314419

edited Mar 20 '17 at 16:07

answered Aug 4 '16 at 15:39

ingli

314419

answered Aug 4 '16 at 15:39

ingli

314419

answered Aug 4 '16 at 15:39

ingli

314419

for a related, but separate question, building on this one, see unix.stackexchange.com/questions/306051/…
– ingli
Aug 27 '16 at 18:25

FWIW: pdfsandwich is also available in Ubuntu's apt package repository. Other distros might have it as well.
– Laurence Gonsalves
Mar 14 at 6:25

unix.stackexchange.com/questions/471985/… any suggestions
– Deepak Umredkar
Sep 28 at 6:23

Just came across fedoramagazine.org/4-cool-new-projects-try-copr-october-2018 showing a COPR package for fedora that packages pdfsandwich
– ingli
Oct 26 at 8:59

add a comment |

for a related, but separate question, building on this one, see unix.stackexchange.com/questions/306051/…
– ingli
Aug 27 '16 at 18:25

FWIW: pdfsandwich is also available in Ubuntu's apt package repository. Other distros might have it as well.
– Laurence Gonsalves
Mar 14 at 6:25

unix.stackexchange.com/questions/471985/… any suggestions
– Deepak Umredkar
Sep 28 at 6:23

Just came across fedoramagazine.org/4-cool-new-projects-try-copr-october-2018 showing a COPR package for fedora that packages pdfsandwich
– ingli
Oct 26 at 8:59

for a related, but separate question, building on this one, see unix.stackexchange.com/questions/306051/…
– ingli
Aug 27 '16 at 18:25

FWIW: pdfsandwich is also available in Ubuntu's apt package repository. Other distros might have it as well.
– Laurence Gonsalves
Mar 14 at 6:25

unix.stackexchange.com/questions/471985/… any suggestions
– Deepak Umredkar
Sep 28 at 6:23

Just came across fedoramagazine.org/4-cool-new-projects-try-copr-october-2018 showing a COPR package for fedora that packages pdfsandwich
– ingli
Oct 26 at 8:59

add a comment |

up vote
1
down vote

http://live.gnome.org/OCRFeeder

https://github.com/GNOME/ocrfeeder

answered Oct 18 at 4:14

jdpipe

1113

add a comment |

up vote
1
down vote

http://live.gnome.org/OCRFeeder

https://github.com/GNOME/ocrfeeder

answered Oct 18 at 4:14

jdpipe

1113

add a comment |

up vote
1
down vote

http://live.gnome.org/OCRFeeder

https://github.com/GNOME/ocrfeeder

answered Oct 18 at 4:14

jdpipe

1113

http://live.gnome.org/OCRFeeder

https://github.com/GNOME/ocrfeeder

answered Oct 18 at 4:14

jdpipe

1113

answered Oct 18 at 4:14

jdpipe

1113

answered Oct 18 at 4:14

jdpipe

1113

answered Oct 18 at 4:14

jdpipe

1113

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Sstrhsrtj

how to OCR a pdf file and get the text stored within pdf?

3 Answers
3

Update 3rd november 2018:

Your Answer

Post as a guest

3 Answers
3

3 Answers
3

Update 3rd november 2018:

Update 3rd november 2018:

Update 3rd november 2018:

Update 3rd november 2018:

Post as a guest

Popular posts from this blog

サソリ

広島県道265号伴広島線

Accessing regular linux commands in Huawei's Dopra Linux

how to OCR a pdf file and get the text stored within pdf?

3 Answers 3

Update 3rd november 2018:

Your Answer

Sign up or log in

Post as a guest

Post as a guest

3 Answers 3

3 Answers 3

Update 3rd november 2018:

Update 3rd november 2018:

Update 3rd november 2018:

Update 3rd november 2018:

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

サソリ

広島県道265号伴広島線

Accessing regular linux commands in Huawei's Dopra Linux

3 Answers
3

3 Answers
3

3 Answers
3