Wget download entire site archive [on hold]











up vote
0
down vote

favorite












A beloved forum of mine seems to be hitting EOL. It went off line in the last week with no likelihood of coming back up.



Both Google Cache and Archive.org still have the pages saved for now. I have yet to figure how to get this to work though with a site on archive.org. Sometimes it caches older and sometimes new pages for differents parts of the site so the url is changing.



Typically I do the following



wget 
--recursive
--no-clobber
--page-requisites
--html-extension
--convert-links
--restrict-file-names=windows
--domains website.org
--no-parent
www.website.org/tutorials/html/


Is there a way to get this command to work for archive.org?










share|improve this question















put on hold as too broad by Rui F Ribeiro, mosvy, sam, X Tian, RalfFriedl yesterday


Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.















  • The blog entry suggests you retrieve the result of a search as a csv file of page ids, and pass that to wget.
    – meuh
    2 days ago






  • 1




    @meuh Wouldn't that qualify as an answer if you copy-pasted the steps here? (If you do, ping me and I'll come back and upvote) ;-)
    – Fabby
    2 days ago












  • @Meuh me too (FWIW)
    – roaima
    2 days ago















up vote
0
down vote

favorite












A beloved forum of mine seems to be hitting EOL. It went off line in the last week with no likelihood of coming back up.



Both Google Cache and Archive.org still have the pages saved for now. I have yet to figure how to get this to work though with a site on archive.org. Sometimes it caches older and sometimes new pages for differents parts of the site so the url is changing.



Typically I do the following



wget 
--recursive
--no-clobber
--page-requisites
--html-extension
--convert-links
--restrict-file-names=windows
--domains website.org
--no-parent
www.website.org/tutorials/html/


Is there a way to get this command to work for archive.org?










share|improve this question















put on hold as too broad by Rui F Ribeiro, mosvy, sam, X Tian, RalfFriedl yesterday


Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.















  • The blog entry suggests you retrieve the result of a search as a csv file of page ids, and pass that to wget.
    – meuh
    2 days ago






  • 1




    @meuh Wouldn't that qualify as an answer if you copy-pasted the steps here? (If you do, ping me and I'll come back and upvote) ;-)
    – Fabby
    2 days ago












  • @Meuh me too (FWIW)
    – roaima
    2 days ago













up vote
0
down vote

favorite









up vote
0
down vote

favorite











A beloved forum of mine seems to be hitting EOL. It went off line in the last week with no likelihood of coming back up.



Both Google Cache and Archive.org still have the pages saved for now. I have yet to figure how to get this to work though with a site on archive.org. Sometimes it caches older and sometimes new pages for differents parts of the site so the url is changing.



Typically I do the following



wget 
--recursive
--no-clobber
--page-requisites
--html-extension
--convert-links
--restrict-file-names=windows
--domains website.org
--no-parent
www.website.org/tutorials/html/


Is there a way to get this command to work for archive.org?










share|improve this question















A beloved forum of mine seems to be hitting EOL. It went off line in the last week with no likelihood of coming back up.



Both Google Cache and Archive.org still have the pages saved for now. I have yet to figure how to get this to work though with a site on archive.org. Sometimes it caches older and sometimes new pages for differents parts of the site so the url is changing.



Typically I do the following



wget 
--recursive
--no-clobber
--page-requisites
--html-extension
--convert-links
--restrict-file-names=windows
--domains website.org
--no-parent
www.website.org/tutorials/html/


Is there a way to get this command to work for archive.org?







wget






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 2 days ago









Rui F Ribeiro

38.2k1475123




38.2k1475123










asked 2 days ago









William

3141214




3141214




put on hold as too broad by Rui F Ribeiro, mosvy, sam, X Tian, RalfFriedl yesterday


Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.






put on hold as too broad by Rui F Ribeiro, mosvy, sam, X Tian, RalfFriedl yesterday


Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.














  • The blog entry suggests you retrieve the result of a search as a csv file of page ids, and pass that to wget.
    – meuh
    2 days ago






  • 1




    @meuh Wouldn't that qualify as an answer if you copy-pasted the steps here? (If you do, ping me and I'll come back and upvote) ;-)
    – Fabby
    2 days ago












  • @Meuh me too (FWIW)
    – roaima
    2 days ago


















  • The blog entry suggests you retrieve the result of a search as a csv file of page ids, and pass that to wget.
    – meuh
    2 days ago






  • 1




    @meuh Wouldn't that qualify as an answer if you copy-pasted the steps here? (If you do, ping me and I'll come back and upvote) ;-)
    – Fabby
    2 days ago












  • @Meuh me too (FWIW)
    – roaima
    2 days ago
















The blog entry suggests you retrieve the result of a search as a csv file of page ids, and pass that to wget.
– meuh
2 days ago




The blog entry suggests you retrieve the result of a search as a csv file of page ids, and pass that to wget.
– meuh
2 days ago




1




1




@meuh Wouldn't that qualify as an answer if you copy-pasted the steps here? (If you do, ping me and I'll come back and upvote) ;-)
– Fabby
2 days ago






@meuh Wouldn't that qualify as an answer if you copy-pasted the steps here? (If you do, ping me and I'll come back and upvote) ;-)
– Fabby
2 days ago














@Meuh me too (FWIW)
– roaima
2 days ago




@Meuh me too (FWIW)
– roaima
2 days ago










1 Answer
1






active

oldest

votes

















up vote
0
down vote













The blog entry suggests you retrieve the result of a search as a csv file of page identifiers, and pass that to wget.



Using their example, use the search form and enter suitable terms such as collection:prelinger AND subject:"Health and hygiene" until you have a list of pages you want. Then click on the Advanced Search option below the search bar and select the option CSV format, and ensure only identifier is set in Fields to return. (I had to fix the Query field to replace " by a real quote). Change the Number of results to the number found by your search.



Click search, and you will download a file search.csv. Remove the first title line of this file, and remove all the enclosing double quotes on each line.



Pass the file to wget with



wget -r -H -nc -np -nH --cut-dirs=2 -e robots=off -l1 -i search.csv -B 'http://archive.org/download/'


Note that there is a discussion in the comments about -cut-dirs=2 rather than 1.
This was put in an example script, but now there is also a Python utility, described here.






share|improve this answer























  • Forgive me but where can I select everything for example.com
    – William
    2 days ago












  • Typically, you have to find the collection that an example page is in, and use that with further filtering if possible. I don't see any obvious "url=example.com" type of search result.
    – meuh
    2 days ago


















1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
0
down vote













The blog entry suggests you retrieve the result of a search as a csv file of page identifiers, and pass that to wget.



Using their example, use the search form and enter suitable terms such as collection:prelinger AND subject:"Health and hygiene" until you have a list of pages you want. Then click on the Advanced Search option below the search bar and select the option CSV format, and ensure only identifier is set in Fields to return. (I had to fix the Query field to replace " by a real quote). Change the Number of results to the number found by your search.



Click search, and you will download a file search.csv. Remove the first title line of this file, and remove all the enclosing double quotes on each line.



Pass the file to wget with



wget -r -H -nc -np -nH --cut-dirs=2 -e robots=off -l1 -i search.csv -B 'http://archive.org/download/'


Note that there is a discussion in the comments about -cut-dirs=2 rather than 1.
This was put in an example script, but now there is also a Python utility, described here.






share|improve this answer























  • Forgive me but where can I select everything for example.com
    – William
    2 days ago












  • Typically, you have to find the collection that an example page is in, and use that with further filtering if possible. I don't see any obvious "url=example.com" type of search result.
    – meuh
    2 days ago















up vote
0
down vote













The blog entry suggests you retrieve the result of a search as a csv file of page identifiers, and pass that to wget.



Using their example, use the search form and enter suitable terms such as collection:prelinger AND subject:"Health and hygiene" until you have a list of pages you want. Then click on the Advanced Search option below the search bar and select the option CSV format, and ensure only identifier is set in Fields to return. (I had to fix the Query field to replace " by a real quote). Change the Number of results to the number found by your search.



Click search, and you will download a file search.csv. Remove the first title line of this file, and remove all the enclosing double quotes on each line.



Pass the file to wget with



wget -r -H -nc -np -nH --cut-dirs=2 -e robots=off -l1 -i search.csv -B 'http://archive.org/download/'


Note that there is a discussion in the comments about -cut-dirs=2 rather than 1.
This was put in an example script, but now there is also a Python utility, described here.






share|improve this answer























  • Forgive me but where can I select everything for example.com
    – William
    2 days ago












  • Typically, you have to find the collection that an example page is in, and use that with further filtering if possible. I don't see any obvious "url=example.com" type of search result.
    – meuh
    2 days ago













up vote
0
down vote










up vote
0
down vote









The blog entry suggests you retrieve the result of a search as a csv file of page identifiers, and pass that to wget.



Using their example, use the search form and enter suitable terms such as collection:prelinger AND subject:"Health and hygiene" until you have a list of pages you want. Then click on the Advanced Search option below the search bar and select the option CSV format, and ensure only identifier is set in Fields to return. (I had to fix the Query field to replace " by a real quote). Change the Number of results to the number found by your search.



Click search, and you will download a file search.csv. Remove the first title line of this file, and remove all the enclosing double quotes on each line.



Pass the file to wget with



wget -r -H -nc -np -nH --cut-dirs=2 -e robots=off -l1 -i search.csv -B 'http://archive.org/download/'


Note that there is a discussion in the comments about -cut-dirs=2 rather than 1.
This was put in an example script, but now there is also a Python utility, described here.






share|improve this answer














The blog entry suggests you retrieve the result of a search as a csv file of page identifiers, and pass that to wget.



Using their example, use the search form and enter suitable terms such as collection:prelinger AND subject:"Health and hygiene" until you have a list of pages you want. Then click on the Advanced Search option below the search bar and select the option CSV format, and ensure only identifier is set in Fields to return. (I had to fix the Query field to replace " by a real quote). Change the Number of results to the number found by your search.



Click search, and you will download a file search.csv. Remove the first title line of this file, and remove all the enclosing double quotes on each line.



Pass the file to wget with



wget -r -H -nc -np -nH --cut-dirs=2 -e robots=off -l1 -i search.csv -B 'http://archive.org/download/'


Note that there is a discussion in the comments about -cut-dirs=2 rather than 1.
This was put in an example script, but now there is also a Python utility, described here.







share|improve this answer














share|improve this answer



share|improve this answer








edited 2 days ago

























answered 2 days ago









meuh

31k11754




31k11754












  • Forgive me but where can I select everything for example.com
    – William
    2 days ago












  • Typically, you have to find the collection that an example page is in, and use that with further filtering if possible. I don't see any obvious "url=example.com" type of search result.
    – meuh
    2 days ago


















  • Forgive me but where can I select everything for example.com
    – William
    2 days ago












  • Typically, you have to find the collection that an example page is in, and use that with further filtering if possible. I don't see any obvious "url=example.com" type of search result.
    – meuh
    2 days ago
















Forgive me but where can I select everything for example.com
– William
2 days ago






Forgive me but where can I select everything for example.com
– William
2 days ago














Typically, you have to find the collection that an example page is in, and use that with further filtering if possible. I don't see any obvious "url=example.com" type of search result.
– meuh
2 days ago




Typically, you have to find the collection that an example page is in, and use that with further filtering if possible. I don't see any obvious "url=example.com" type of search result.
– meuh
2 days ago



Popular posts from this blog

Accessing regular linux commands in Huawei's Dopra Linux

Can't connect RFCOMM socket: Host is down

Kernel panic - not syncing: Fatal Exception in Interrupt