how to OCR a pdf file and get the text stored within pdf?
up vote
10
down vote
favorite
first, apologies if this has been asked before - I searched for a while through the existing posts, but could not find support.
I am interested in a solution for Fedora to OCR a multipage non-searchable pdf and to turn this pdf into a new pdf file that contains the text layer on top of the image. On Mac OSX or Windows we could use Adobe Acrobat, but on Linux, specifically on Fedora?
https://snippets.webaware.com.au/howto/pdf-ocr-linux/ seems to describe a solution - but unfortunately I am already lost when retrieving exact-image.
command-line pdf ocr
add a comment |
up vote
10
down vote
favorite
first, apologies if this has been asked before - I searched for a while through the existing posts, but could not find support.
I am interested in a solution for Fedora to OCR a multipage non-searchable pdf and to turn this pdf into a new pdf file that contains the text layer on top of the image. On Mac OSX or Windows we could use Adobe Acrobat, but on Linux, specifically on Fedora?
https://snippets.webaware.com.au/howto/pdf-ocr-linux/ seems to describe a solution - but unfortunately I am already lost when retrieving exact-image.
command-line pdf ocr
There is a problem with the nice pdfocr script that the page you are linking to recommends: it relies upon pdftk which is essentially deprecated (for two reasons, its dependence on libgcj and on iText5+). So a different solution is needed anyway...
– Maxim
Mar 14 '17 at 6:04
add a comment |
up vote
10
down vote
favorite
up vote
10
down vote
favorite
first, apologies if this has been asked before - I searched for a while through the existing posts, but could not find support.
I am interested in a solution for Fedora to OCR a multipage non-searchable pdf and to turn this pdf into a new pdf file that contains the text layer on top of the image. On Mac OSX or Windows we could use Adobe Acrobat, but on Linux, specifically on Fedora?
https://snippets.webaware.com.au/howto/pdf-ocr-linux/ seems to describe a solution - but unfortunately I am already lost when retrieving exact-image.
command-line pdf ocr
first, apologies if this has been asked before - I searched for a while through the existing posts, but could not find support.
I am interested in a solution for Fedora to OCR a multipage non-searchable pdf and to turn this pdf into a new pdf file that contains the text layer on top of the image. On Mac OSX or Windows we could use Adobe Acrobat, but on Linux, specifically on Fedora?
https://snippets.webaware.com.au/howto/pdf-ocr-linux/ seems to describe a solution - but unfortunately I am already lost when retrieving exact-image.
command-line pdf ocr
command-line pdf ocr
edited Aug 6 at 3:34
Eduard Florinescu
3,104103851
3,104103851
asked Aug 4 '16 at 15:39
ingli
314419
314419
There is a problem with the nice pdfocr script that the page you are linking to recommends: it relies upon pdftk which is essentially deprecated (for two reasons, its dependence on libgcj and on iText5+). So a different solution is needed anyway...
– Maxim
Mar 14 '17 at 6:04
add a comment |
There is a problem with the nice pdfocr script that the page you are linking to recommends: it relies upon pdftk which is essentially deprecated (for two reasons, its dependence on libgcj and on iText5+). So a different solution is needed anyway...
– Maxim
Mar 14 '17 at 6:04
There is a problem with the nice pdfocr script that the page you are linking to recommends: it relies upon pdftk which is essentially deprecated (for two reasons, its dependence on libgcj and on iText5+). So a different solution is needed anyway...
– Maxim
Mar 14 '17 at 6:04
There is a problem with the nice pdfocr script that the page you are linking to recommends: it relies upon pdftk which is essentially deprecated (for two reasons, its dependence on libgcj and on iText5+). So a different solution is needed anyway...
– Maxim
Mar 14 '17 at 6:04
add a comment |
3 Answers
3
active
oldest
votes
up vote
5
down vote
accepted
Best and easyest way out there is to use pypdfocr
it doesn't change the pdf. pypdfocr is a python module link here.
pypdfocr your_document.pdf
At the end you will have another your_document_ocr.pdf
the way you want it with searchable text. The app doesn't change the quality of the image. Increases the size of the file a bit by adding the overlay text.
I think the command is pretty easy that it doesn't need any GUI.
Maybe installing pypdfocr is a bit more verbose:
sudo dnf -y install tesseract
pip install pypdfocr
Update 3rd november 2018:
pypdfocr
is no longer supported since 2016 and I noticed some problems due to not being mentained. ocrmypdf
(module) does a symiliar job and can be used like this:
ocrmypdf in.pdf out.pdf
To install:
pip install ocrmypdf
or
sudo apt install ocrmypdf #ubuntu
sudo dnf -y install tesseract #fedora
add a comment |
up vote
5
down vote
After learning that tesseract can now also produce searchable pdfs, I found the script sandwich: http://www.tobias-elze.de/pdfsandwich/
after installing dependencies (this might not be the complete list)
sudo dnf install svn ocaml unpaper tesseract
I followed the script's guide for compiling from source
Compile from sources
pdfsandwich is open source software (license: GPL). You can download the sources either as .tar.bz2 package from the download area on the project website or check them out by subversion:
svn checkout svn://svn.code.sf.net/p/pdfsandwich/code/trunk/src pdfsandwich
If OCaml is installed on your system, you can compile and install as follows:
cd pdfsandwich
./configure
make
sudo make install
and this now allows me to run
sandwich multipaged-non-searchable.pdf
resulting in a searchable pdf.
for a related, but separate question, building on this one, see unix.stackexchange.com/questions/306051/…
– ingli
Aug 27 '16 at 18:25
FWIW: pdfsandwich is also available in Ubuntu's apt package repository. Other distros might have it as well.
– Laurence Gonsalves
Mar 14 at 6:25
unix.stackexchange.com/questions/471985/… any suggestions
– Deepak Umredkar
Sep 28 at 6:23
Just came across fedoramagazine.org/4-cool-new-projects-try-copr-october-2018 showing a COPR package for fedora that packages pdfsandwich
– ingli
Oct 26 at 8:59
add a comment |
up vote
1
down vote
An easy tool available in Ubuntu is 'ocrfeeder' it allows the generation of PDFs with OCR text overlaid on the original documents. It makes use of Tesseract plus other OCR engines (not sure which) and provides for image rotation/'unpaper', etc, as well.
- http://live.gnome.org/OCRFeeder
- https://github.com/GNOME/ocrfeeder
add a comment |
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
5
down vote
accepted
Best and easyest way out there is to use pypdfocr
it doesn't change the pdf. pypdfocr is a python module link here.
pypdfocr your_document.pdf
At the end you will have another your_document_ocr.pdf
the way you want it with searchable text. The app doesn't change the quality of the image. Increases the size of the file a bit by adding the overlay text.
I think the command is pretty easy that it doesn't need any GUI.
Maybe installing pypdfocr is a bit more verbose:
sudo dnf -y install tesseract
pip install pypdfocr
Update 3rd november 2018:
pypdfocr
is no longer supported since 2016 and I noticed some problems due to not being mentained. ocrmypdf
(module) does a symiliar job and can be used like this:
ocrmypdf in.pdf out.pdf
To install:
pip install ocrmypdf
or
sudo apt install ocrmypdf #ubuntu
sudo dnf -y install tesseract #fedora
add a comment |
up vote
5
down vote
accepted
Best and easyest way out there is to use pypdfocr
it doesn't change the pdf. pypdfocr is a python module link here.
pypdfocr your_document.pdf
At the end you will have another your_document_ocr.pdf
the way you want it with searchable text. The app doesn't change the quality of the image. Increases the size of the file a bit by adding the overlay text.
I think the command is pretty easy that it doesn't need any GUI.
Maybe installing pypdfocr is a bit more verbose:
sudo dnf -y install tesseract
pip install pypdfocr
Update 3rd november 2018:
pypdfocr
is no longer supported since 2016 and I noticed some problems due to not being mentained. ocrmypdf
(module) does a symiliar job and can be used like this:
ocrmypdf in.pdf out.pdf
To install:
pip install ocrmypdf
or
sudo apt install ocrmypdf #ubuntu
sudo dnf -y install tesseract #fedora
add a comment |
up vote
5
down vote
accepted
up vote
5
down vote
accepted
Best and easyest way out there is to use pypdfocr
it doesn't change the pdf. pypdfocr is a python module link here.
pypdfocr your_document.pdf
At the end you will have another your_document_ocr.pdf
the way you want it with searchable text. The app doesn't change the quality of the image. Increases the size of the file a bit by adding the overlay text.
I think the command is pretty easy that it doesn't need any GUI.
Maybe installing pypdfocr is a bit more verbose:
sudo dnf -y install tesseract
pip install pypdfocr
Update 3rd november 2018:
pypdfocr
is no longer supported since 2016 and I noticed some problems due to not being mentained. ocrmypdf
(module) does a symiliar job and can be used like this:
ocrmypdf in.pdf out.pdf
To install:
pip install ocrmypdf
or
sudo apt install ocrmypdf #ubuntu
sudo dnf -y install tesseract #fedora
Best and easyest way out there is to use pypdfocr
it doesn't change the pdf. pypdfocr is a python module link here.
pypdfocr your_document.pdf
At the end you will have another your_document_ocr.pdf
the way you want it with searchable text. The app doesn't change the quality of the image. Increases the size of the file a bit by adding the overlay text.
I think the command is pretty easy that it doesn't need any GUI.
Maybe installing pypdfocr is a bit more verbose:
sudo dnf -y install tesseract
pip install pypdfocr
Update 3rd november 2018:
pypdfocr
is no longer supported since 2016 and I noticed some problems due to not being mentained. ocrmypdf
(module) does a symiliar job and can be used like this:
ocrmypdf in.pdf out.pdf
To install:
pip install ocrmypdf
or
sudo apt install ocrmypdf #ubuntu
sudo dnf -y install tesseract #fedora
edited 2 days ago
answered Feb 3 at 19:23
Eduard Florinescu
3,104103851
3,104103851
add a comment |
add a comment |
up vote
5
down vote
After learning that tesseract can now also produce searchable pdfs, I found the script sandwich: http://www.tobias-elze.de/pdfsandwich/
after installing dependencies (this might not be the complete list)
sudo dnf install svn ocaml unpaper tesseract
I followed the script's guide for compiling from source
Compile from sources
pdfsandwich is open source software (license: GPL). You can download the sources either as .tar.bz2 package from the download area on the project website or check them out by subversion:
svn checkout svn://svn.code.sf.net/p/pdfsandwich/code/trunk/src pdfsandwich
If OCaml is installed on your system, you can compile and install as follows:
cd pdfsandwich
./configure
make
sudo make install
and this now allows me to run
sandwich multipaged-non-searchable.pdf
resulting in a searchable pdf.
for a related, but separate question, building on this one, see unix.stackexchange.com/questions/306051/…
– ingli
Aug 27 '16 at 18:25
FWIW: pdfsandwich is also available in Ubuntu's apt package repository. Other distros might have it as well.
– Laurence Gonsalves
Mar 14 at 6:25
unix.stackexchange.com/questions/471985/… any suggestions
– Deepak Umredkar
Sep 28 at 6:23
Just came across fedoramagazine.org/4-cool-new-projects-try-copr-october-2018 showing a COPR package for fedora that packages pdfsandwich
– ingli
Oct 26 at 8:59
add a comment |
up vote
5
down vote
After learning that tesseract can now also produce searchable pdfs, I found the script sandwich: http://www.tobias-elze.de/pdfsandwich/
after installing dependencies (this might not be the complete list)
sudo dnf install svn ocaml unpaper tesseract
I followed the script's guide for compiling from source
Compile from sources
pdfsandwich is open source software (license: GPL). You can download the sources either as .tar.bz2 package from the download area on the project website or check them out by subversion:
svn checkout svn://svn.code.sf.net/p/pdfsandwich/code/trunk/src pdfsandwich
If OCaml is installed on your system, you can compile and install as follows:
cd pdfsandwich
./configure
make
sudo make install
and this now allows me to run
sandwich multipaged-non-searchable.pdf
resulting in a searchable pdf.
for a related, but separate question, building on this one, see unix.stackexchange.com/questions/306051/…
– ingli
Aug 27 '16 at 18:25
FWIW: pdfsandwich is also available in Ubuntu's apt package repository. Other distros might have it as well.
– Laurence Gonsalves
Mar 14 at 6:25
unix.stackexchange.com/questions/471985/… any suggestions
– Deepak Umredkar
Sep 28 at 6:23
Just came across fedoramagazine.org/4-cool-new-projects-try-copr-october-2018 showing a COPR package for fedora that packages pdfsandwich
– ingli
Oct 26 at 8:59
add a comment |
up vote
5
down vote
up vote
5
down vote
After learning that tesseract can now also produce searchable pdfs, I found the script sandwich: http://www.tobias-elze.de/pdfsandwich/
after installing dependencies (this might not be the complete list)
sudo dnf install svn ocaml unpaper tesseract
I followed the script's guide for compiling from source
Compile from sources
pdfsandwich is open source software (license: GPL). You can download the sources either as .tar.bz2 package from the download area on the project website or check them out by subversion:
svn checkout svn://svn.code.sf.net/p/pdfsandwich/code/trunk/src pdfsandwich
If OCaml is installed on your system, you can compile and install as follows:
cd pdfsandwich
./configure
make
sudo make install
and this now allows me to run
sandwich multipaged-non-searchable.pdf
resulting in a searchable pdf.
After learning that tesseract can now also produce searchable pdfs, I found the script sandwich: http://www.tobias-elze.de/pdfsandwich/
after installing dependencies (this might not be the complete list)
sudo dnf install svn ocaml unpaper tesseract
I followed the script's guide for compiling from source
Compile from sources
pdfsandwich is open source software (license: GPL). You can download the sources either as .tar.bz2 package from the download area on the project website or check them out by subversion:
svn checkout svn://svn.code.sf.net/p/pdfsandwich/code/trunk/src pdfsandwich
If OCaml is installed on your system, you can compile and install as follows:
cd pdfsandwich
./configure
make
sudo make install
and this now allows me to run
sandwich multipaged-non-searchable.pdf
resulting in a searchable pdf.
edited Mar 20 '17 at 16:07
answered Aug 4 '16 at 15:39
ingli
314419
314419
for a related, but separate question, building on this one, see unix.stackexchange.com/questions/306051/…
– ingli
Aug 27 '16 at 18:25
FWIW: pdfsandwich is also available in Ubuntu's apt package repository. Other distros might have it as well.
– Laurence Gonsalves
Mar 14 at 6:25
unix.stackexchange.com/questions/471985/… any suggestions
– Deepak Umredkar
Sep 28 at 6:23
Just came across fedoramagazine.org/4-cool-new-projects-try-copr-october-2018 showing a COPR package for fedora that packages pdfsandwich
– ingli
Oct 26 at 8:59
add a comment |
for a related, but separate question, building on this one, see unix.stackexchange.com/questions/306051/…
– ingli
Aug 27 '16 at 18:25
FWIW: pdfsandwich is also available in Ubuntu's apt package repository. Other distros might have it as well.
– Laurence Gonsalves
Mar 14 at 6:25
unix.stackexchange.com/questions/471985/… any suggestions
– Deepak Umredkar
Sep 28 at 6:23
Just came across fedoramagazine.org/4-cool-new-projects-try-copr-october-2018 showing a COPR package for fedora that packages pdfsandwich
– ingli
Oct 26 at 8:59
for a related, but separate question, building on this one, see unix.stackexchange.com/questions/306051/…
– ingli
Aug 27 '16 at 18:25
for a related, but separate question, building on this one, see unix.stackexchange.com/questions/306051/…
– ingli
Aug 27 '16 at 18:25
FWIW: pdfsandwich is also available in Ubuntu's apt package repository. Other distros might have it as well.
– Laurence Gonsalves
Mar 14 at 6:25
FWIW: pdfsandwich is also available in Ubuntu's apt package repository. Other distros might have it as well.
– Laurence Gonsalves
Mar 14 at 6:25
unix.stackexchange.com/questions/471985/… any suggestions
– Deepak Umredkar
Sep 28 at 6:23
unix.stackexchange.com/questions/471985/… any suggestions
– Deepak Umredkar
Sep 28 at 6:23
Just came across fedoramagazine.org/4-cool-new-projects-try-copr-october-2018 showing a COPR package for fedora that packages pdfsandwich
– ingli
Oct 26 at 8:59
Just came across fedoramagazine.org/4-cool-new-projects-try-copr-october-2018 showing a COPR package for fedora that packages pdfsandwich
– ingli
Oct 26 at 8:59
add a comment |
up vote
1
down vote
An easy tool available in Ubuntu is 'ocrfeeder' it allows the generation of PDFs with OCR text overlaid on the original documents. It makes use of Tesseract plus other OCR engines (not sure which) and provides for image rotation/'unpaper', etc, as well.
- http://live.gnome.org/OCRFeeder
- https://github.com/GNOME/ocrfeeder
add a comment |
up vote
1
down vote
An easy tool available in Ubuntu is 'ocrfeeder' it allows the generation of PDFs with OCR text overlaid on the original documents. It makes use of Tesseract plus other OCR engines (not sure which) and provides for image rotation/'unpaper', etc, as well.
- http://live.gnome.org/OCRFeeder
- https://github.com/GNOME/ocrfeeder
add a comment |
up vote
1
down vote
up vote
1
down vote
An easy tool available in Ubuntu is 'ocrfeeder' it allows the generation of PDFs with OCR text overlaid on the original documents. It makes use of Tesseract plus other OCR engines (not sure which) and provides for image rotation/'unpaper', etc, as well.
- http://live.gnome.org/OCRFeeder
- https://github.com/GNOME/ocrfeeder
An easy tool available in Ubuntu is 'ocrfeeder' it allows the generation of PDFs with OCR text overlaid on the original documents. It makes use of Tesseract plus other OCR engines (not sure which) and provides for image rotation/'unpaper', etc, as well.
- http://live.gnome.org/OCRFeeder
- https://github.com/GNOME/ocrfeeder
answered Oct 18 at 4:14
jdpipe
1113
1113
add a comment |
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f301318%2fhow-to-ocr-a-pdf-file-and-get-the-text-stored-within-pdf%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
There is a problem with the nice pdfocr script that the page you are linking to recommends: it relies upon pdftk which is essentially deprecated (for two reasons, its dependence on libgcj and on iText5+). So a different solution is needed anyway...
– Maxim
Mar 14 '17 at 6:04