Mass convert thousands of downloaded (with wget) HTML documents to DOCX











up vote
1
down vote

favorite
1












I would like to process and convert all the files downloaded from wget in HTML format from a URL.



I want to convert a complete web page to DOCX format. We are talking about 3000 HTML documents downloaded from the URL. This task becomes tedious with Pandoc without automating.



Could it be automatically done in some way?










share|improve this question
























  • Do you want 3000 stand alone word docs or do you want one massive doc with internal links, etc?
    – ivanivan
    Apr 24 at 0:44










  • when downloading the url with wget creates me 3000 html files, I would have them independent if possible, the content of the web page in docx.
    – user3127939
    Apr 24 at 2:08












  • look at the headless doc conversion option for open|libre office
    – ivanivan
    Apr 24 at 2:17










  • what's the use if I can not agree with wget, you are not answering my question.
    – user3127939
    Apr 24 at 12:10















up vote
1
down vote

favorite
1












I would like to process and convert all the files downloaded from wget in HTML format from a URL.



I want to convert a complete web page to DOCX format. We are talking about 3000 HTML documents downloaded from the URL. This task becomes tedious with Pandoc without automating.



Could it be automatically done in some way?










share|improve this question
























  • Do you want 3000 stand alone word docs or do you want one massive doc with internal links, etc?
    – ivanivan
    Apr 24 at 0:44










  • when downloading the url with wget creates me 3000 html files, I would have them independent if possible, the content of the web page in docx.
    – user3127939
    Apr 24 at 2:08












  • look at the headless doc conversion option for open|libre office
    – ivanivan
    Apr 24 at 2:17










  • what's the use if I can not agree with wget, you are not answering my question.
    – user3127939
    Apr 24 at 12:10













up vote
1
down vote

favorite
1









up vote
1
down vote

favorite
1






1





I would like to process and convert all the files downloaded from wget in HTML format from a URL.



I want to convert a complete web page to DOCX format. We are talking about 3000 HTML documents downloaded from the URL. This task becomes tedious with Pandoc without automating.



Could it be automatically done in some way?










share|improve this question















I would like to process and convert all the files downloaded from wget in HTML format from a URL.



I want to convert a complete web page to DOCX format. We are talking about 3000 HTML documents downloaded from the URL. This task becomes tedious with Pandoc without automating.



Could it be automatically done in some way?







wget html pandoc






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited yesterday









Kurt Pfeifle

35828




35828










asked Apr 24 at 0:38









user3127939

61




61












  • Do you want 3000 stand alone word docs or do you want one massive doc with internal links, etc?
    – ivanivan
    Apr 24 at 0:44










  • when downloading the url with wget creates me 3000 html files, I would have them independent if possible, the content of the web page in docx.
    – user3127939
    Apr 24 at 2:08












  • look at the headless doc conversion option for open|libre office
    – ivanivan
    Apr 24 at 2:17










  • what's the use if I can not agree with wget, you are not answering my question.
    – user3127939
    Apr 24 at 12:10


















  • Do you want 3000 stand alone word docs or do you want one massive doc with internal links, etc?
    – ivanivan
    Apr 24 at 0:44










  • when downloading the url with wget creates me 3000 html files, I would have them independent if possible, the content of the web page in docx.
    – user3127939
    Apr 24 at 2:08












  • look at the headless doc conversion option for open|libre office
    – ivanivan
    Apr 24 at 2:17










  • what's the use if I can not agree with wget, you are not answering my question.
    – user3127939
    Apr 24 at 12:10
















Do you want 3000 stand alone word docs or do you want one massive doc with internal links, etc?
– ivanivan
Apr 24 at 0:44




Do you want 3000 stand alone word docs or do you want one massive doc with internal links, etc?
– ivanivan
Apr 24 at 0:44












when downloading the url with wget creates me 3000 html files, I would have them independent if possible, the content of the web page in docx.
– user3127939
Apr 24 at 2:08






when downloading the url with wget creates me 3000 html files, I would have them independent if possible, the content of the web page in docx.
– user3127939
Apr 24 at 2:08














look at the headless doc conversion option for open|libre office
– ivanivan
Apr 24 at 2:17




look at the headless doc conversion option for open|libre office
– ivanivan
Apr 24 at 2:17












what's the use if I can not agree with wget, you are not answering my question.
– user3127939
Apr 24 at 12:10




what's the use if I can not agree with wget, you are not answering my question.
– user3127939
Apr 24 at 12:10










1 Answer
1






active

oldest

votes

















up vote
0
down vote













1. Convert after downloading



Whats the problem with using Pandoc on your saved HTML files?



Assuming your HTML are all in the a directory named wget-html, you could do the following:




cd wget-html

find . -name "*.docx"
| xargs -0
pandoc
--from=html
--to=docx
--toc
--standalone
--output={}.pdf
{} ;


This will create a PDF file for each "path/to/some.html" named "path/to/some.html.pdf".



2. Convert while downloading



If you want to achieve this, say so. But first please indicate which exact wget command you were using.






share|improve this answer





















    Your Answer








    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "106"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f439612%2fmass-convert-thousands-of-downloaded-with-wget-html-documents-to-docx%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    0
    down vote













    1. Convert after downloading



    Whats the problem with using Pandoc on your saved HTML files?



    Assuming your HTML are all in the a directory named wget-html, you could do the following:




    cd wget-html

    find . -name "*.docx"
    | xargs -0
    pandoc
    --from=html
    --to=docx
    --toc
    --standalone
    --output={}.pdf
    {} ;


    This will create a PDF file for each "path/to/some.html" named "path/to/some.html.pdf".



    2. Convert while downloading



    If you want to achieve this, say so. But first please indicate which exact wget command you were using.






    share|improve this answer

























      up vote
      0
      down vote













      1. Convert after downloading



      Whats the problem with using Pandoc on your saved HTML files?



      Assuming your HTML are all in the a directory named wget-html, you could do the following:




      cd wget-html

      find . -name "*.docx"
      | xargs -0
      pandoc
      --from=html
      --to=docx
      --toc
      --standalone
      --output={}.pdf
      {} ;


      This will create a PDF file for each "path/to/some.html" named "path/to/some.html.pdf".



      2. Convert while downloading



      If you want to achieve this, say so. But first please indicate which exact wget command you were using.






      share|improve this answer























        up vote
        0
        down vote










        up vote
        0
        down vote









        1. Convert after downloading



        Whats the problem with using Pandoc on your saved HTML files?



        Assuming your HTML are all in the a directory named wget-html, you could do the following:




        cd wget-html

        find . -name "*.docx"
        | xargs -0
        pandoc
        --from=html
        --to=docx
        --toc
        --standalone
        --output={}.pdf
        {} ;


        This will create a PDF file for each "path/to/some.html" named "path/to/some.html.pdf".



        2. Convert while downloading



        If you want to achieve this, say so. But first please indicate which exact wget command you were using.






        share|improve this answer












        1. Convert after downloading



        Whats the problem with using Pandoc on your saved HTML files?



        Assuming your HTML are all in the a directory named wget-html, you could do the following:




        cd wget-html

        find . -name "*.docx"
        | xargs -0
        pandoc
        --from=html
        --to=docx
        --toc
        --standalone
        --output={}.pdf
        {} ;


        This will create a PDF file for each "path/to/some.html" named "path/to/some.html.pdf".



        2. Convert while downloading



        If you want to achieve this, say so. But first please indicate which exact wget command you were using.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered yesterday









        Kurt Pfeifle

        35828




        35828






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Unix & Linux Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f439612%2fmass-convert-thousands-of-downloaded-with-wget-html-documents-to-docx%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            サソリ

            広島県道265号伴広島線

            Setup Asymptote in Texstudio