Mass convert thousands of downloaded (with wget) HTML documents to DOCX

up vote
1
down vote

favorite

I would like to process and convert all the files downloaded from wget in HTML format from a URL.

I want to convert a complete web page to DOCX format. We are talking about 3000 HTML documents downloaded from the URL. This task becomes tedious with Pandoc without automating.

Could it be automatically done in some way?

edited yesterday

Kurt Pfeifle

35828

asked Apr 24 at 0:38

user3127939

Do you want 3000 stand alone word docs or do you want one massive doc with internal links, etc?
– ivanivan
Apr 24 at 0:44

when downloading the url with wget creates me 3000 html files, I would have them independent if possible, the content of the web page in docx.
– user3127939
Apr 24 at 2:08

look at the headless doc conversion option for open|libre office
– ivanivan
Apr 24 at 2:17

what's the use if I can not agree with wget, you are not answering my question.
– user3127939
Apr 24 at 12:10

add a comment |

up vote
1
down vote

favorite

I would like to process and convert all the files downloaded from wget in HTML format from a URL.

I want to convert a complete web page to DOCX format. We are talking about 3000 HTML documents downloaded from the URL. This task becomes tedious with Pandoc without automating.

Could it be automatically done in some way?

edited yesterday

Kurt Pfeifle

35828

asked Apr 24 at 0:38

user3127939

Do you want 3000 stand alone word docs or do you want one massive doc with internal links, etc?
– ivanivan
Apr 24 at 0:44

when downloading the url with wget creates me 3000 html files, I would have them independent if possible, the content of the web page in docx.
– user3127939
Apr 24 at 2:08

look at the headless doc conversion option for open|libre office
– ivanivan
Apr 24 at 2:17

what's the use if I can not agree with wget, you are not answering my question.
– user3127939
Apr 24 at 12:10

add a comment |

up vote
1
down vote

favorite

I would like to process and convert all the files downloaded from wget in HTML format from a URL.

I want to convert a complete web page to DOCX format. We are talking about 3000 HTML documents downloaded from the URL. This task becomes tedious with Pandoc without automating.

Could it be automatically done in some way?

edited yesterday

Kurt Pfeifle

35828

asked Apr 24 at 0:38

user3127939

I would like to process and convert all the files downloaded from wget in HTML format from a URL.

I want to convert a complete web page to DOCX format. We are talking about 3000 HTML documents downloaded from the URL. This task becomes tedious with Pandoc without automating.

Could it be automatically done in some way?

wget html pandoc

edited yesterday

Kurt Pfeifle

35828

asked Apr 24 at 0:38

user3127939

edited yesterday

Kurt Pfeifle

35828

asked Apr 24 at 0:38

user3127939

edited yesterday

Kurt Pfeifle

35828

edited yesterday

Kurt Pfeifle

35828

edited yesterday

Kurt Pfeifle

35828

asked Apr 24 at 0:38

user3127939

asked Apr 24 at 0:38

user3127939

asked Apr 24 at 0:38

user3127939

Do you want 3000 stand alone word docs or do you want one massive doc with internal links, etc?
– ivanivan
Apr 24 at 0:44

when downloading the url with wget creates me 3000 html files, I would have them independent if possible, the content of the web page in docx.
– user3127939
Apr 24 at 2:08

look at the headless doc conversion option for open|libre office
– ivanivan
Apr 24 at 2:17

what's the use if I can not agree with wget, you are not answering my question.
– user3127939
Apr 24 at 12:10

add a comment |

Do you want 3000 stand alone word docs or do you want one massive doc with internal links, etc?
– ivanivan
Apr 24 at 0:44

when downloading the url with wget creates me 3000 html files, I would have them independent if possible, the content of the web page in docx.
– user3127939
Apr 24 at 2:08

look at the headless doc conversion option for open|libre office
– ivanivan
Apr 24 at 2:17

what's the use if I can not agree with wget, you are not answering my question.
– user3127939
Apr 24 at 12:10

Do you want 3000 stand alone word docs or do you want one massive doc with internal links, etc?
– ivanivan
Apr 24 at 0:44

when downloading the url with wget creates me 3000 html files, I would have them independent if possible, the content of the web page in docx.
– user3127939
Apr 24 at 2:08

look at the headless doc conversion option for open|libre office
– ivanivan
Apr 24 at 2:17

what's the use if I can not agree with wget, you are not answering my question.
– user3127939
Apr 24 at 12:10

add a comment |

1 Answer
1

active

oldest

votes

up vote
0
down vote

1. Convert after downloading

Whats the problem with using Pandoc on your saved HTML files?

Assuming your HTML are all in the a directory named wget-html, you could do the following:



 cd wget-html



 find . -name "*.docx" 

   | xargs -0     

   pandoc         

     --from=html  

     --to=docx    

     --toc        

     --standalone 

     --output={}.pdf

     {} ;

This will create a PDF file for each "path/to/some.html" named "path/to/some.html.pdf".

2. Convert while downloading

If you want to achieve this, say so. But first please indicate which exact wget command you were using.

answered yesterday

Kurt Pfeifle

35828

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f439612%2fmass-convert-thousands-of-downloaded-with-wget-html-documents-to-docx%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
0
down vote

1. Convert after downloading

Whats the problem with using Pandoc on your saved HTML files?

Assuming your HTML are all in the a directory named wget-html, you could do the following:



 cd wget-html



 find . -name "*.docx" 

   | xargs -0     

   pandoc         

     --from=html  

     --to=docx    

     --toc        

     --standalone 

     --output={}.pdf

     {} ;

This will create a PDF file for each "path/to/some.html" named "path/to/some.html.pdf".

2. Convert while downloading

If you want to achieve this, say so. But first please indicate which exact wget command you were using.

answered yesterday

Kurt Pfeifle

35828

add a comment |

up vote
0
down vote

1. Convert after downloading

Whats the problem with using Pandoc on your saved HTML files?

Assuming your HTML are all in the a directory named wget-html, you could do the following:



 cd wget-html



 find . -name "*.docx" 

   | xargs -0     

   pandoc         

     --from=html  

     --to=docx    

     --toc        

     --standalone 

     --output={}.pdf

     {} ;

This will create a PDF file for each "path/to/some.html" named "path/to/some.html.pdf".

2. Convert while downloading

If you want to achieve this, say so. But first please indicate which exact wget command you were using.

answered yesterday

Kurt Pfeifle

35828

add a comment |

up vote
0
down vote

1. Convert after downloading

Whats the problem with using Pandoc on your saved HTML files?

Assuming your HTML are all in the a directory named wget-html, you could do the following:



 cd wget-html



 find . -name "*.docx" 

   | xargs -0     

   pandoc         

     --from=html  

     --to=docx    

     --toc        

     --standalone 

     --output={}.pdf

     {} ;

This will create a PDF file for each "path/to/some.html" named "path/to/some.html.pdf".

2. Convert while downloading

If you want to achieve this, say so. But first please indicate which exact wget command you were using.

answered yesterday

Kurt Pfeifle

35828

1. Convert after downloading

Whats the problem with using Pandoc on your saved HTML files?

Assuming your HTML are all in the a directory named wget-html, you could do the following:



 cd wget-html



 find . -name "*.docx" 

   | xargs -0     

   pandoc         

     --from=html  

     --to=docx    

     --toc        

     --standalone 

     --output={}.pdf

     {} ;

This will create a PDF file for each "path/to/some.html" named "path/to/some.html.pdf".

2. Convert while downloading

If you want to achieve this, say so. But first please indicate which exact wget command you were using.

answered yesterday

Kurt Pfeifle

35828

answered yesterday

Kurt Pfeifle

35828

answered yesterday

Kurt Pfeifle

35828

answered yesterday

Kurt Pfeifle

35828

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Unix & Linux Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Sstrhsrtj