Wget download entire site archive [on hold]
up vote
0
down vote
favorite
A beloved forum of mine seems to be hitting EOL. It went off line in the last week with no likelihood of coming back up.
Both Google Cache and Archive.org still have the pages saved for now. I have yet to figure how to get this to work though with a site on archive.org. Sometimes it caches older and sometimes new pages for differents parts of the site so the url is changing.
Typically I do the following
wget
--recursive
--no-clobber
--page-requisites
--html-extension
--convert-links
--restrict-file-names=windows
--domains website.org
--no-parent
www.website.org/tutorials/html/
Is there a way to get this command to work for archive.org?
wget
put on hold as too broad by Rui F Ribeiro, mosvy, sam, X Tian, RalfFriedl yesterday
Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.
add a comment |
up vote
0
down vote
favorite
A beloved forum of mine seems to be hitting EOL. It went off line in the last week with no likelihood of coming back up.
Both Google Cache and Archive.org still have the pages saved for now. I have yet to figure how to get this to work though with a site on archive.org. Sometimes it caches older and sometimes new pages for differents parts of the site so the url is changing.
Typically I do the following
wget
--recursive
--no-clobber
--page-requisites
--html-extension
--convert-links
--restrict-file-names=windows
--domains website.org
--no-parent
www.website.org/tutorials/html/
Is there a way to get this command to work for archive.org?
wget
put on hold as too broad by Rui F Ribeiro, mosvy, sam, X Tian, RalfFriedl yesterday
Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.
The blog entry suggests you retrieve the result of a search as a csv file of page ids, and pass that to wget.
– meuh
2 days ago
1
@meuh Wouldn't that qualify as an answer if you copy-pasted the steps here? (If you do, ping me and I'll come back and upvote) ;-)
– Fabby
2 days ago
@Meuh me too (FWIW)
– roaima
2 days ago
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
A beloved forum of mine seems to be hitting EOL. It went off line in the last week with no likelihood of coming back up.
Both Google Cache and Archive.org still have the pages saved for now. I have yet to figure how to get this to work though with a site on archive.org. Sometimes it caches older and sometimes new pages for differents parts of the site so the url is changing.
Typically I do the following
wget
--recursive
--no-clobber
--page-requisites
--html-extension
--convert-links
--restrict-file-names=windows
--domains website.org
--no-parent
www.website.org/tutorials/html/
Is there a way to get this command to work for archive.org?
wget
A beloved forum of mine seems to be hitting EOL. It went off line in the last week with no likelihood of coming back up.
Both Google Cache and Archive.org still have the pages saved for now. I have yet to figure how to get this to work though with a site on archive.org. Sometimes it caches older and sometimes new pages for differents parts of the site so the url is changing.
Typically I do the following
wget
--recursive
--no-clobber
--page-requisites
--html-extension
--convert-links
--restrict-file-names=windows
--domains website.org
--no-parent
www.website.org/tutorials/html/
Is there a way to get this command to work for archive.org?
wget
wget
edited 2 days ago
Rui F Ribeiro
38.2k1475123
38.2k1475123
asked 2 days ago
William
3141214
3141214
put on hold as too broad by Rui F Ribeiro, mosvy, sam, X Tian, RalfFriedl yesterday
Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.
put on hold as too broad by Rui F Ribeiro, mosvy, sam, X Tian, RalfFriedl yesterday
Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.
The blog entry suggests you retrieve the result of a search as a csv file of page ids, and pass that to wget.
– meuh
2 days ago
1
@meuh Wouldn't that qualify as an answer if you copy-pasted the steps here? (If you do, ping me and I'll come back and upvote) ;-)
– Fabby
2 days ago
@Meuh me too (FWIW)
– roaima
2 days ago
add a comment |
The blog entry suggests you retrieve the result of a search as a csv file of page ids, and pass that to wget.
– meuh
2 days ago
1
@meuh Wouldn't that qualify as an answer if you copy-pasted the steps here? (If you do, ping me and I'll come back and upvote) ;-)
– Fabby
2 days ago
@Meuh me too (FWIW)
– roaima
2 days ago
The blog entry suggests you retrieve the result of a search as a csv file of page ids, and pass that to wget.
– meuh
2 days ago
The blog entry suggests you retrieve the result of a search as a csv file of page ids, and pass that to wget.
– meuh
2 days ago
1
1
@meuh Wouldn't that qualify as an answer if you copy-pasted the steps here? (If you do, ping me and I'll come back and upvote) ;-)
– Fabby
2 days ago
@meuh Wouldn't that qualify as an answer if you copy-pasted the steps here? (If you do, ping me and I'll come back and upvote) ;-)
– Fabby
2 days ago
@Meuh me too (FWIW)
– roaima
2 days ago
@Meuh me too (FWIW)
– roaima
2 days ago
add a comment |
1 Answer
1
active
oldest
votes
up vote
0
down vote
The blog entry suggests you retrieve the result of a search as a csv file of page identifiers, and pass that to wget.
Using their example, use the search form and enter suitable terms such as collection:prelinger AND subject:"Health and hygiene"
until you have a list of pages you want. Then click on the Advanced Search option below the search bar and select the option CSV format
, and ensure only identifier
is set in Fields to return
. (I had to fix the Query field to replace "
by a real quote). Change the Number of results
to the number found by your search.
Click search, and you will download a file search.csv
. Remove the first title line of this file, and remove all the enclosing double quotes on each line.
Pass the file to wget with
wget -r -H -nc -np -nH --cut-dirs=2 -e robots=off -l1 -i search.csv -B 'http://archive.org/download/'
Note that there is a discussion in the comments about -cut-dirs=2
rather than 1.
This was put in an example script, but now there is also a Python utility, described here.
Forgive me but where can I select everything for example.com
– William
2 days ago
Typically, you have to find the collection that an example page is in, and use that with further filtering if possible. I don't see any obvious "url=example.com" type of search result.
– meuh
2 days ago
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
The blog entry suggests you retrieve the result of a search as a csv file of page identifiers, and pass that to wget.
Using their example, use the search form and enter suitable terms such as collection:prelinger AND subject:"Health and hygiene"
until you have a list of pages you want. Then click on the Advanced Search option below the search bar and select the option CSV format
, and ensure only identifier
is set in Fields to return
. (I had to fix the Query field to replace "
by a real quote). Change the Number of results
to the number found by your search.
Click search, and you will download a file search.csv
. Remove the first title line of this file, and remove all the enclosing double quotes on each line.
Pass the file to wget with
wget -r -H -nc -np -nH --cut-dirs=2 -e robots=off -l1 -i search.csv -B 'http://archive.org/download/'
Note that there is a discussion in the comments about -cut-dirs=2
rather than 1.
This was put in an example script, but now there is also a Python utility, described here.
Forgive me but where can I select everything for example.com
– William
2 days ago
Typically, you have to find the collection that an example page is in, and use that with further filtering if possible. I don't see any obvious "url=example.com" type of search result.
– meuh
2 days ago
add a comment |
up vote
0
down vote
The blog entry suggests you retrieve the result of a search as a csv file of page identifiers, and pass that to wget.
Using their example, use the search form and enter suitable terms such as collection:prelinger AND subject:"Health and hygiene"
until you have a list of pages you want. Then click on the Advanced Search option below the search bar and select the option CSV format
, and ensure only identifier
is set in Fields to return
. (I had to fix the Query field to replace "
by a real quote). Change the Number of results
to the number found by your search.
Click search, and you will download a file search.csv
. Remove the first title line of this file, and remove all the enclosing double quotes on each line.
Pass the file to wget with
wget -r -H -nc -np -nH --cut-dirs=2 -e robots=off -l1 -i search.csv -B 'http://archive.org/download/'
Note that there is a discussion in the comments about -cut-dirs=2
rather than 1.
This was put in an example script, but now there is also a Python utility, described here.
Forgive me but where can I select everything for example.com
– William
2 days ago
Typically, you have to find the collection that an example page is in, and use that with further filtering if possible. I don't see any obvious "url=example.com" type of search result.
– meuh
2 days ago
add a comment |
up vote
0
down vote
up vote
0
down vote
The blog entry suggests you retrieve the result of a search as a csv file of page identifiers, and pass that to wget.
Using their example, use the search form and enter suitable terms such as collection:prelinger AND subject:"Health and hygiene"
until you have a list of pages you want. Then click on the Advanced Search option below the search bar and select the option CSV format
, and ensure only identifier
is set in Fields to return
. (I had to fix the Query field to replace "
by a real quote). Change the Number of results
to the number found by your search.
Click search, and you will download a file search.csv
. Remove the first title line of this file, and remove all the enclosing double quotes on each line.
Pass the file to wget with
wget -r -H -nc -np -nH --cut-dirs=2 -e robots=off -l1 -i search.csv -B 'http://archive.org/download/'
Note that there is a discussion in the comments about -cut-dirs=2
rather than 1.
This was put in an example script, but now there is also a Python utility, described here.
The blog entry suggests you retrieve the result of a search as a csv file of page identifiers, and pass that to wget.
Using their example, use the search form and enter suitable terms such as collection:prelinger AND subject:"Health and hygiene"
until you have a list of pages you want. Then click on the Advanced Search option below the search bar and select the option CSV format
, and ensure only identifier
is set in Fields to return
. (I had to fix the Query field to replace "
by a real quote). Change the Number of results
to the number found by your search.
Click search, and you will download a file search.csv
. Remove the first title line of this file, and remove all the enclosing double quotes on each line.
Pass the file to wget with
wget -r -H -nc -np -nH --cut-dirs=2 -e robots=off -l1 -i search.csv -B 'http://archive.org/download/'
Note that there is a discussion in the comments about -cut-dirs=2
rather than 1.
This was put in an example script, but now there is also a Python utility, described here.
edited 2 days ago
answered 2 days ago
meuh
31k11754
31k11754
Forgive me but where can I select everything for example.com
– William
2 days ago
Typically, you have to find the collection that an example page is in, and use that with further filtering if possible. I don't see any obvious "url=example.com" type of search result.
– meuh
2 days ago
add a comment |
Forgive me but where can I select everything for example.com
– William
2 days ago
Typically, you have to find the collection that an example page is in, and use that with further filtering if possible. I don't see any obvious "url=example.com" type of search result.
– meuh
2 days ago
Forgive me but where can I select everything for example.com
– William
2 days ago
Forgive me but where can I select everything for example.com
– William
2 days ago
Typically, you have to find the collection that an example page is in, and use that with further filtering if possible. I don't see any obvious "url=example.com" type of search result.
– meuh
2 days ago
Typically, you have to find the collection that an example page is in, and use that with further filtering if possible. I don't see any obvious "url=example.com" type of search result.
– meuh
2 days ago
add a comment |
The blog entry suggests you retrieve the result of a search as a csv file of page ids, and pass that to wget.
– meuh
2 days ago
1
@meuh Wouldn't that qualify as an answer if you copy-pasted the steps here? (If you do, ping me and I'll come back and upvote) ;-)
– Fabby
2 days ago
@Meuh me too (FWIW)
– roaima
2 days ago