Downloading nested pdf files with wget
I am trying to download dozens of PDF files located on pages linked from here:
http://machineknittingetc.com/passap.html?limit=all
Each PDF is referred to by a URL ending with /downloadable/download/sample/sample_id/[some three digit number]/.
I have tried these:
wget -r -l2 -A.pdf http://machineknittingetc.com/passap.html?limit=all
wget -r -l2 -np http://machineknittingetc.com/passap.html?limit=all -A "*.pdf"
wget -r -l2 -np http://machineknittingetc.com/passap.html?limit=all -A "*.###"
It doesn't get the PDFs.
Does it have something to do with the server not being indexed to allow me to access the URLs like a file hierarchy? Is there a way to make it work?
linux wget
add a comment |
I am trying to download dozens of PDF files located on pages linked from here:
http://machineknittingetc.com/passap.html?limit=all
Each PDF is referred to by a URL ending with /downloadable/download/sample/sample_id/[some three digit number]/.
I have tried these:
wget -r -l2 -A.pdf http://machineknittingetc.com/passap.html?limit=all
wget -r -l2 -np http://machineknittingetc.com/passap.html?limit=all -A "*.pdf"
wget -r -l2 -np http://machineknittingetc.com/passap.html?limit=all -A "*.###"
It doesn't get the PDFs.
Does it have something to do with the server not being indexed to allow me to access the URLs like a file hierarchy? Is there a way to make it work?
linux wget
That is much better, thank you.
– Kallaste
Jun 9 '17 at 20:02
add a comment |
I am trying to download dozens of PDF files located on pages linked from here:
http://machineknittingetc.com/passap.html?limit=all
Each PDF is referred to by a URL ending with /downloadable/download/sample/sample_id/[some three digit number]/.
I have tried these:
wget -r -l2 -A.pdf http://machineknittingetc.com/passap.html?limit=all
wget -r -l2 -np http://machineknittingetc.com/passap.html?limit=all -A "*.pdf"
wget -r -l2 -np http://machineknittingetc.com/passap.html?limit=all -A "*.###"
It doesn't get the PDFs.
Does it have something to do with the server not being indexed to allow me to access the URLs like a file hierarchy? Is there a way to make it work?
linux wget
I am trying to download dozens of PDF files located on pages linked from here:
http://machineknittingetc.com/passap.html?limit=all
Each PDF is referred to by a URL ending with /downloadable/download/sample/sample_id/[some three digit number]/.
I have tried these:
wget -r -l2 -A.pdf http://machineknittingetc.com/passap.html?limit=all
wget -r -l2 -np http://machineknittingetc.com/passap.html?limit=all -A "*.pdf"
wget -r -l2 -np http://machineknittingetc.com/passap.html?limit=all -A "*.###"
It doesn't get the PDFs.
Does it have something to do with the server not being indexed to allow me to access the URLs like a file hierarchy? Is there a way to make it work?
linux wget
linux wget
edited Jan 2 '17 at 8:05
Kusalananda
121k16229372
121k16229372
asked Jan 2 '17 at 7:47
Kallaste
83
83
That is much better, thank you.
– Kallaste
Jun 9 '17 at 20:02
add a comment |
That is much better, thank you.
– Kallaste
Jun 9 '17 at 20:02
That is much better, thank you.
– Kallaste
Jun 9 '17 at 20:02
That is much better, thank you.
– Kallaste
Jun 9 '17 at 20:02
add a comment |
3 Answers
3
active
oldest
votes
@rajaganesh87
you are guessing at the directory link numbers and are your code does not work for the actual links needed per the base link http://machineknittingetc.com/passap.html?limit=all
and the (.pdf) files correlating to it.
The problem is your being blocked by the
robots.txt file
and your using the dot (.) in
-A .pdf
Try the code below that I tested and it works.
wget -np -nd -r -l2 -A pdf -e robots=off http://machineknittingetc.com/passap.html?limit=all
Hope this helps.
add a comment |
Does this work for you ?
#!/bin/bash
for i in {000..175}
do
wget http://machineknittingetc.com/downloadable/download/sample/sample_id/$i
done
Yes, thanks! But it gets a lot more than the links on that page. Apparently the downloadable subpath has a lot of files. I will look for a range for the files I want (hopefully they are not randomly numbered) and see if I can alter it.
– Kallaste
Jan 2 '17 at 9:21
I really should have thought of that.
– Kallaste
Jan 2 '17 at 9:23
No, it seems the files I want are not numbered predictably. Some are consecutive but then they start to jump around. Short of passing it a list of each file path, this will not work. Is there no way to do it with the wget filters?
– Kallaste
Jan 2 '17 at 9:37
@Kallaste In that case get the html using wget and grep for the document numbers, download again from that list
– rajaganesh87
Jan 4 '17 at 11:26
add a comment |
@rajaganesh87 have you bash script for wget to download pdf books of school, college and high school in JSP page http://cnp.com.tn/CNP1/web/french/biblio/man-eleves.jsp
Thank you for help
New contributor
Karim Bn Abdlaziz is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
If you have a new question, please ask it by clicking the Ask Question button. Include a link to this question if it helps provide context. - From Review
– Jeff Schaller
59 mins ago
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f334243%2fdownloading-nested-pdf-files-with-wget%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
@rajaganesh87
you are guessing at the directory link numbers and are your code does not work for the actual links needed per the base link http://machineknittingetc.com/passap.html?limit=all
and the (.pdf) files correlating to it.
The problem is your being blocked by the
robots.txt file
and your using the dot (.) in
-A .pdf
Try the code below that I tested and it works.
wget -np -nd -r -l2 -A pdf -e robots=off http://machineknittingetc.com/passap.html?limit=all
Hope this helps.
add a comment |
@rajaganesh87
you are guessing at the directory link numbers and are your code does not work for the actual links needed per the base link http://machineknittingetc.com/passap.html?limit=all
and the (.pdf) files correlating to it.
The problem is your being blocked by the
robots.txt file
and your using the dot (.) in
-A .pdf
Try the code below that I tested and it works.
wget -np -nd -r -l2 -A pdf -e robots=off http://machineknittingetc.com/passap.html?limit=all
Hope this helps.
add a comment |
@rajaganesh87
you are guessing at the directory link numbers and are your code does not work for the actual links needed per the base link http://machineknittingetc.com/passap.html?limit=all
and the (.pdf) files correlating to it.
The problem is your being blocked by the
robots.txt file
and your using the dot (.) in
-A .pdf
Try the code below that I tested and it works.
wget -np -nd -r -l2 -A pdf -e robots=off http://machineknittingetc.com/passap.html?limit=all
Hope this helps.
@rajaganesh87
you are guessing at the directory link numbers and are your code does not work for the actual links needed per the base link http://machineknittingetc.com/passap.html?limit=all
and the (.pdf) files correlating to it.
The problem is your being blocked by the
robots.txt file
and your using the dot (.) in
-A .pdf
Try the code below that I tested and it works.
wget -np -nd -r -l2 -A pdf -e robots=off http://machineknittingetc.com/passap.html?limit=all
Hope this helps.
answered May 26 '17 at 8:19
Jason Swartz
862
862
add a comment |
add a comment |
Does this work for you ?
#!/bin/bash
for i in {000..175}
do
wget http://machineknittingetc.com/downloadable/download/sample/sample_id/$i
done
Yes, thanks! But it gets a lot more than the links on that page. Apparently the downloadable subpath has a lot of files. I will look for a range for the files I want (hopefully they are not randomly numbered) and see if I can alter it.
– Kallaste
Jan 2 '17 at 9:21
I really should have thought of that.
– Kallaste
Jan 2 '17 at 9:23
No, it seems the files I want are not numbered predictably. Some are consecutive but then they start to jump around. Short of passing it a list of each file path, this will not work. Is there no way to do it with the wget filters?
– Kallaste
Jan 2 '17 at 9:37
@Kallaste In that case get the html using wget and grep for the document numbers, download again from that list
– rajaganesh87
Jan 4 '17 at 11:26
add a comment |
Does this work for you ?
#!/bin/bash
for i in {000..175}
do
wget http://machineknittingetc.com/downloadable/download/sample/sample_id/$i
done
Yes, thanks! But it gets a lot more than the links on that page. Apparently the downloadable subpath has a lot of files. I will look for a range for the files I want (hopefully they are not randomly numbered) and see if I can alter it.
– Kallaste
Jan 2 '17 at 9:21
I really should have thought of that.
– Kallaste
Jan 2 '17 at 9:23
No, it seems the files I want are not numbered predictably. Some are consecutive but then they start to jump around. Short of passing it a list of each file path, this will not work. Is there no way to do it with the wget filters?
– Kallaste
Jan 2 '17 at 9:37
@Kallaste In that case get the html using wget and grep for the document numbers, download again from that list
– rajaganesh87
Jan 4 '17 at 11:26
add a comment |
Does this work for you ?
#!/bin/bash
for i in {000..175}
do
wget http://machineknittingetc.com/downloadable/download/sample/sample_id/$i
done
Does this work for you ?
#!/bin/bash
for i in {000..175}
do
wget http://machineknittingetc.com/downloadable/download/sample/sample_id/$i
done
answered Jan 2 '17 at 8:51
rajaganesh87
7382825
7382825
Yes, thanks! But it gets a lot more than the links on that page. Apparently the downloadable subpath has a lot of files. I will look for a range for the files I want (hopefully they are not randomly numbered) and see if I can alter it.
– Kallaste
Jan 2 '17 at 9:21
I really should have thought of that.
– Kallaste
Jan 2 '17 at 9:23
No, it seems the files I want are not numbered predictably. Some are consecutive but then they start to jump around. Short of passing it a list of each file path, this will not work. Is there no way to do it with the wget filters?
– Kallaste
Jan 2 '17 at 9:37
@Kallaste In that case get the html using wget and grep for the document numbers, download again from that list
– rajaganesh87
Jan 4 '17 at 11:26
add a comment |
Yes, thanks! But it gets a lot more than the links on that page. Apparently the downloadable subpath has a lot of files. I will look for a range for the files I want (hopefully they are not randomly numbered) and see if I can alter it.
– Kallaste
Jan 2 '17 at 9:21
I really should have thought of that.
– Kallaste
Jan 2 '17 at 9:23
No, it seems the files I want are not numbered predictably. Some are consecutive but then they start to jump around. Short of passing it a list of each file path, this will not work. Is there no way to do it with the wget filters?
– Kallaste
Jan 2 '17 at 9:37
@Kallaste In that case get the html using wget and grep for the document numbers, download again from that list
– rajaganesh87
Jan 4 '17 at 11:26
Yes, thanks! But it gets a lot more than the links on that page. Apparently the downloadable subpath has a lot of files. I will look for a range for the files I want (hopefully they are not randomly numbered) and see if I can alter it.
– Kallaste
Jan 2 '17 at 9:21
Yes, thanks! But it gets a lot more than the links on that page. Apparently the downloadable subpath has a lot of files. I will look for a range for the files I want (hopefully they are not randomly numbered) and see if I can alter it.
– Kallaste
Jan 2 '17 at 9:21
I really should have thought of that.
– Kallaste
Jan 2 '17 at 9:23
I really should have thought of that.
– Kallaste
Jan 2 '17 at 9:23
No, it seems the files I want are not numbered predictably. Some are consecutive but then they start to jump around. Short of passing it a list of each file path, this will not work. Is there no way to do it with the wget filters?
– Kallaste
Jan 2 '17 at 9:37
No, it seems the files I want are not numbered predictably. Some are consecutive but then they start to jump around. Short of passing it a list of each file path, this will not work. Is there no way to do it with the wget filters?
– Kallaste
Jan 2 '17 at 9:37
@Kallaste In that case get the html using wget and grep for the document numbers, download again from that list
– rajaganesh87
Jan 4 '17 at 11:26
@Kallaste In that case get the html using wget and grep for the document numbers, download again from that list
– rajaganesh87
Jan 4 '17 at 11:26
add a comment |
@rajaganesh87 have you bash script for wget to download pdf books of school, college and high school in JSP page http://cnp.com.tn/CNP1/web/french/biblio/man-eleves.jsp
Thank you for help
New contributor
Karim Bn Abdlaziz is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
If you have a new question, please ask it by clicking the Ask Question button. Include a link to this question if it helps provide context. - From Review
– Jeff Schaller
59 mins ago
add a comment |
@rajaganesh87 have you bash script for wget to download pdf books of school, college and high school in JSP page http://cnp.com.tn/CNP1/web/french/biblio/man-eleves.jsp
Thank you for help
New contributor
Karim Bn Abdlaziz is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
If you have a new question, please ask it by clicking the Ask Question button. Include a link to this question if it helps provide context. - From Review
– Jeff Schaller
59 mins ago
add a comment |
@rajaganesh87 have you bash script for wget to download pdf books of school, college and high school in JSP page http://cnp.com.tn/CNP1/web/french/biblio/man-eleves.jsp
Thank you for help
New contributor
Karim Bn Abdlaziz is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
@rajaganesh87 have you bash script for wget to download pdf books of school, college and high school in JSP page http://cnp.com.tn/CNP1/web/french/biblio/man-eleves.jsp
Thank you for help
New contributor
Karim Bn Abdlaziz is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
Karim Bn Abdlaziz is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
answered 1 hour ago
Karim Bn Abdlaziz
11
11
New contributor
Karim Bn Abdlaziz is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
Karim Bn Abdlaziz is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
Karim Bn Abdlaziz is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
If you have a new question, please ask it by clicking the Ask Question button. Include a link to this question if it helps provide context. - From Review
– Jeff Schaller
59 mins ago
add a comment |
If you have a new question, please ask it by clicking the Ask Question button. Include a link to this question if it helps provide context. - From Review
– Jeff Schaller
59 mins ago
If you have a new question, please ask it by clicking the Ask Question button. Include a link to this question if it helps provide context. - From Review
– Jeff Schaller
59 mins ago
If you have a new question, please ask it by clicking the Ask Question button. Include a link to this question if it helps provide context. - From Review
– Jeff Schaller
59 mins ago
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f334243%2fdownloading-nested-pdf-files-with-wget%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
That is much better, thank you.
– Kallaste
Jun 9 '17 at 20:02