Wget download entire site archive [on hold]

up vote
0
down vote

favorite

A beloved forum of mine seems to be hitting EOL. It went off line in the last week with no likelihood of coming back up.

Both Google Cache and Archive.org still have the pages saved for now. I have yet to figure how to get this to work though with a site on archive.org. Sometimes it caches older and sometimes new pages for differents parts of the site so the url is changing.

Typically I do the following

wget 

     --recursive 

     --no-clobber 

     --page-requisites 

     --html-extension 

     --convert-links 

     --restrict-file-names=windows 

     --domains website.org 

     --no-parent 

         www.website.org/tutorials/html/

Is there a way to get this command to work for archive.org?

edited 2 days ago

Rui F Ribeiro

38.2k1475123

asked 2 days ago

William

3141214

put on hold as too broad by Rui F Ribeiro, mosvy, sam, X Tian, RalfFriedl yesterday

Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.

The blog entry suggests you retrieve the result of a search as a csv file of page ids, and pass that to wget.
– meuh
2 days ago

1

@meuh Wouldn't that qualify as an answer if you copy-pasted the steps here? (If you do, ping me and I'll come back and upvote) ;-)
– Fabby
2 days ago

@Meuh me too (FWIW)
– roaima
2 days ago

add a comment |

up vote
0
down vote

favorite

A beloved forum of mine seems to be hitting EOL. It went off line in the last week with no likelihood of coming back up.

Typically I do the following

wget 

     --recursive 

     --no-clobber 

     --page-requisites 

     --html-extension 

     --convert-links 

     --restrict-file-names=windows 

     --domains website.org 

     --no-parent 

         www.website.org/tutorials/html/

Is there a way to get this command to work for archive.org?

edited 2 days ago

Rui F Ribeiro

38.2k1475123

asked 2 days ago

William

3141214

put on hold as too broad by Rui F Ribeiro, mosvy, sam, X Tian, RalfFriedl yesterday

The blog entry suggests you retrieve the result of a search as a csv file of page ids, and pass that to wget.
– meuh
2 days ago

1

@meuh Wouldn't that qualify as an answer if you copy-pasted the steps here? (If you do, ping me and I'll come back and upvote) ;-)
– Fabby
2 days ago

@Meuh me too (FWIW)
– roaima
2 days ago

add a comment |

up vote
0
down vote

favorite

A beloved forum of mine seems to be hitting EOL. It went off line in the last week with no likelihood of coming back up.

Typically I do the following

wget 

     --recursive 

     --no-clobber 

     --page-requisites 

     --html-extension 

     --convert-links 

     --restrict-file-names=windows 

     --domains website.org 

     --no-parent 

         www.website.org/tutorials/html/

Is there a way to get this command to work for archive.org?

edited 2 days ago

Rui F Ribeiro

38.2k1475123

asked 2 days ago

William

3141214

A beloved forum of mine seems to be hitting EOL. It went off line in the last week with no likelihood of coming back up.

Typically I do the following

wget 

     --recursive 

     --no-clobber 

     --page-requisites 

     --html-extension 

     --convert-links 

     --restrict-file-names=windows 

     --domains website.org 

     --no-parent 

         www.website.org/tutorials/html/

Is there a way to get this command to work for archive.org?

wget

edited 2 days ago

Rui F Ribeiro

38.2k1475123

asked 2 days ago

William

3141214

edited 2 days ago

Rui F Ribeiro

38.2k1475123

asked 2 days ago

William

3141214

edited 2 days ago

Rui F Ribeiro

38.2k1475123

edited 2 days ago

Rui F Ribeiro

38.2k1475123

edited 2 days ago

Rui F Ribeiro

38.2k1475123

asked 2 days ago

William

3141214

asked 2 days ago

William

3141214

asked 2 days ago

William

3141214

put on hold as too broad by Rui F Ribeiro, mosvy, sam, X Tian, RalfFriedl yesterday

The blog entry suggests you retrieve the result of a search as a csv file of page ids, and pass that to wget.
– meuh
2 days ago

1

@meuh Wouldn't that qualify as an answer if you copy-pasted the steps here? (If you do, ping me and I'll come back and upvote) ;-)
– Fabby
2 days ago

@Meuh me too (FWIW)
– roaima
2 days ago

add a comment |

The blog entry suggests you retrieve the result of a search as a csv file of page ids, and pass that to wget.
– meuh
2 days ago

1

@meuh Wouldn't that qualify as an answer if you copy-pasted the steps here? (If you do, ping me and I'll come back and upvote) ;-)
– Fabby
2 days ago

@Meuh me too (FWIW)
– roaima
2 days ago

The blog entry suggests you retrieve the result of a search as a csv file of page ids, and pass that to wget.
– meuh
2 days ago

@meuh Wouldn't that qualify as an answer if you copy-pasted the steps here? (If you do, ping me and I'll come back and upvote) ;-)
– Fabby
2 days ago

@Meuh me too (FWIW)
– roaima
2 days ago

add a comment |

1 Answer
1

active

oldest

votes

up vote
0
down vote

The blog entry suggests you retrieve the result of a search as a csv file of page identifiers, and pass that to wget.

Using their example, use the search form and enter suitable terms such as collection:prelinger AND subject:"Health and hygiene" until you have a list of pages you want. Then click on the Advanced Search option below the search bar and select the option CSV format, and ensure only identifier is set in Fields to return. (I had to fix the Query field to replace " by a real quote). Change the Number of results to the number found by your search.

Click search, and you will download a file search.csv. Remove the first title line of this file, and remove all the enclosing double quotes on each line.

Pass the file to wget with

wget -r -H -nc -np -nH --cut-dirs=2 -e robots=off -l1 -i search.csv -B 'http://archive.org/download/'

Note that there is a discussion in the comments about -cut-dirs=2 rather than 1.
This was put in an example script, but now there is also a Python utility, described here.

edited 2 days ago

answered 2 days ago

meuh

31k11754

Forgive me but where can I select everything for example.com
– William
2 days ago

Typically, you have to find the collection that an example page is in, and use that with further filtering if possible. I don't see any obvious "url=example.com" type of search result.
– meuh
2 days ago

add a comment |

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
0
down vote

The blog entry suggests you retrieve the result of a search as a csv file of page identifiers, and pass that to wget.

Click search, and you will download a file search.csv. Remove the first title line of this file, and remove all the enclosing double quotes on each line.

Pass the file to wget with

wget -r -H -nc -np -nH --cut-dirs=2 -e robots=off -l1 -i search.csv -B 'http://archive.org/download/'

Note that there is a discussion in the comments about -cut-dirs=2 rather than 1.
This was put in an example script, but now there is also a Python utility, described here.

edited 2 days ago

answered 2 days ago

meuh

31k11754

Forgive me but where can I select everything for example.com
– William
2 days ago

Typically, you have to find the collection that an example page is in, and use that with further filtering if possible. I don't see any obvious "url=example.com" type of search result.
– meuh
2 days ago

add a comment |

up vote
0
down vote

The blog entry suggests you retrieve the result of a search as a csv file of page identifiers, and pass that to wget.

Click search, and you will download a file search.csv. Remove the first title line of this file, and remove all the enclosing double quotes on each line.

Pass the file to wget with

wget -r -H -nc -np -nH --cut-dirs=2 -e robots=off -l1 -i search.csv -B 'http://archive.org/download/'

Note that there is a discussion in the comments about -cut-dirs=2 rather than 1.
This was put in an example script, but now there is also a Python utility, described here.

edited 2 days ago

answered 2 days ago

meuh

31k11754

Forgive me but where can I select everything for example.com
– William
2 days ago

Typically, you have to find the collection that an example page is in, and use that with further filtering if possible. I don't see any obvious "url=example.com" type of search result.
– meuh
2 days ago

add a comment |

up vote
0
down vote

The blog entry suggests you retrieve the result of a search as a csv file of page identifiers, and pass that to wget.

Click search, and you will download a file search.csv. Remove the first title line of this file, and remove all the enclosing double quotes on each line.

Pass the file to wget with

wget -r -H -nc -np -nH --cut-dirs=2 -e robots=off -l1 -i search.csv -B 'http://archive.org/download/'

Note that there is a discussion in the comments about -cut-dirs=2 rather than 1.
This was put in an example script, but now there is also a Python utility, described here.

edited 2 days ago

answered 2 days ago

meuh

31k11754

The blog entry suggests you retrieve the result of a search as a csv file of page identifiers, and pass that to wget.

Click search, and you will download a file search.csv. Remove the first title line of this file, and remove all the enclosing double quotes on each line.

Pass the file to wget with

wget -r -H -nc -np -nH --cut-dirs=2 -e robots=off -l1 -i search.csv -B 'http://archive.org/download/'

Note that there is a discussion in the comments about -cut-dirs=2 rather than 1.
This was put in an example script, but now there is also a Python utility, described here.

edited 2 days ago

answered 2 days ago

meuh

31k11754

edited 2 days ago

answered 2 days ago

meuh

31k11754

answered 2 days ago

meuh

31k11754

answered 2 days ago

meuh

31k11754

Forgive me but where can I select everything for example.com
– William
2 days ago

Typically, you have to find the collection that an example page is in, and use that with further filtering if possible. I don't see any obvious "url=example.com" type of search result.
– meuh
2 days ago

add a comment |

Forgive me but where can I select everything for example.com
– William
2 days ago

Typically, you have to find the collection that an example page is in, and use that with further filtering if possible. I don't see any obvious "url=example.com" type of search result.
– meuh
2 days ago

Forgive me but where can I select everything for example.com
– William
2 days ago

Typically, you have to find the collection that an example page is in, and use that with further filtering if possible. I don't see any obvious "url=example.com" type of search result.
– meuh
2 days ago

add a comment |

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Sstrhsrtj