simple command to strip header and footer from a file












4















I want a command to strip an XML-Header and Footer from a file:



<?xml version="1.0" encoding="UTF-8"?>
<conxml>
<MsgPain001>
<HashValue>A9C72997C702A2F841B0EEEC3BD274DE1CB7BEA4B813E030D068CB853BCFECA6</HashValue>
<HashAlgorithm>SHA256</HashAlgorithm>
<Document>
...
</Document>
<Document>
...
</Document>
</MsgPain001>
</conxml>


...



Should become just



<Document>
...
</Document>
<Document>
...
</Document>


(note the indenting, the indent of the first document-tag should be stripped of.



This sounds like a (greedy) regex



<Document>.*</Document>


But I don't get it due to the linefeeds.



I need it in a pipe to compute a hash over the contained documents.










share|improve this question





























    4















    I want a command to strip an XML-Header and Footer from a file:



    <?xml version="1.0" encoding="UTF-8"?>
    <conxml>
    <MsgPain001>
    <HashValue>A9C72997C702A2F841B0EEEC3BD274DE1CB7BEA4B813E030D068CB853BCFECA6</HashValue>
    <HashAlgorithm>SHA256</HashAlgorithm>
    <Document>
    ...
    </Document>
    <Document>
    ...
    </Document>
    </MsgPain001>
    </conxml>


    ...



    Should become just



    <Document>
    ...
    </Document>
    <Document>
    ...
    </Document>


    (note the indenting, the indent of the first document-tag should be stripped of.



    This sounds like a (greedy) regex



    <Document>.*</Document>


    But I don't get it due to the linefeeds.



    I need it in a pipe to compute a hash over the contained documents.










    share|improve this question



























      4












      4








      4








      I want a command to strip an XML-Header and Footer from a file:



      <?xml version="1.0" encoding="UTF-8"?>
      <conxml>
      <MsgPain001>
      <HashValue>A9C72997C702A2F841B0EEEC3BD274DE1CB7BEA4B813E030D068CB853BCFECA6</HashValue>
      <HashAlgorithm>SHA256</HashAlgorithm>
      <Document>
      ...
      </Document>
      <Document>
      ...
      </Document>
      </MsgPain001>
      </conxml>


      ...



      Should become just



      <Document>
      ...
      </Document>
      <Document>
      ...
      </Document>


      (note the indenting, the indent of the first document-tag should be stripped of.



      This sounds like a (greedy) regex



      <Document>.*</Document>


      But I don't get it due to the linefeeds.



      I need it in a pipe to compute a hash over the contained documents.










      share|improve this question
















      I want a command to strip an XML-Header and Footer from a file:



      <?xml version="1.0" encoding="UTF-8"?>
      <conxml>
      <MsgPain001>
      <HashValue>A9C72997C702A2F841B0EEEC3BD274DE1CB7BEA4B813E030D068CB853BCFECA6</HashValue>
      <HashAlgorithm>SHA256</HashAlgorithm>
      <Document>
      ...
      </Document>
      <Document>
      ...
      </Document>
      </MsgPain001>
      </conxml>


      ...



      Should become just



      <Document>
      ...
      </Document>
      <Document>
      ...
      </Document>


      (note the indenting, the indent of the first document-tag should be stripped of.



      This sounds like a (greedy) regex



      <Document>.*</Document>


      But I don't get it due to the linefeeds.



      I need it in a pipe to compute a hash over the contained documents.







      sed regular-expression






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited 29 mins ago









      Rui F Ribeiro

      41.5k1483140




      41.5k1483140










      asked Oct 20 '11 at 13:34









      BastlBastl

      2314




      2314






















          2 Answers
          2






          active

          oldest

          votes


















          6














          Using sed:



           sed -n '/<Document>/,/</Document>/ p' yourfile.xml


          Explanation:





          • -n makes sed silent, meaning it does not output the whole file contents,


          • /pattern/ searches for lines including specified pattern,


          • a,b (the comma) tells sed to perform an action on the lines from a to b (where a and b get defined by matching the above patterns),


          • p stands for print and is the action performed on the lines that matched the above.




          Edit: If you'd like to additionally strip the whitespace before <Document>, it can be done this way:



           sed -ne '/ <Document>/s/^ *//' -e '/<Document>/,/</Document>/ p' yourfile.xml





          share|improve this answer


























          • thanks, I'm sed noob. What about indenting whitespace? What does the ',' do ?

            – Bastl
            Oct 20 '11 at 13:50













          • It works with whitespace as well as any other characters surrounding <Document>. See the update of my answer for deeper explanation.

            – rozcietrzewiacz
            Oct 20 '11 at 13:59













          • good. that's nearly perfect. Now I need to strip off preceeding whitespace from the first line. Is it possible inside your command?

            – Bastl
            Oct 20 '11 at 14:06













          • Yes, though it'll be a bit more complicated - see update. (At this point, I am not sure if it is the simplest way.)

            – rozcietrzewiacz
            Oct 20 '11 at 14:36








          • 1





            @Bastl Note that if there's any text between </Document> and the next <Document>, it'll be stripped.

            – Gilles
            Oct 20 '11 at 17:29



















          1














          To prevent text from being stripped between </Document> and the next <Document> you may have to use a series of sed commands (cf. Gilles' comment above).



          Essentially sed reads the entire file into the hold buffer (so that the file contents can be treated as a single line) and marks the first and last Document tags for further processing.



          # version 1
          # marker: HERE
          cat file.xml |
          sed -n '1h;1!H;${;g;s/(<Document>.*</Document>)/HERE1HERE/g;p;}' |
          sed -n -e '/HERE<Document>/,/</Document>HERE/ p' |
          sed -e 's/^ *HERE(<Document>)/1/' -e 's/(</Document>)HERE *$/1/'

          # version 2 (using the Bash shell)
          # marker: $'01'
          cat file.xml |
          sed -n $'1h;1!H;${;g;s/\(<Document>.*<\/Document>\)/01\101/g;p;}' |
          sed -n -e $'/01<Document>/,/<\/Document>01/ p' |
          sed -e $'s/^ *01//' -e $'s/01 *$//' |
          cat -vet


          ... but I guess all this could be done more elegantly (& reliably) using xmlstarlet!






          share|improve this answer























            Your Answer








            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "106"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f22988%2fsimple-command-to-strip-header-and-footer-from-a-file%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            6














            Using sed:



             sed -n '/<Document>/,/</Document>/ p' yourfile.xml


            Explanation:





            • -n makes sed silent, meaning it does not output the whole file contents,


            • /pattern/ searches for lines including specified pattern,


            • a,b (the comma) tells sed to perform an action on the lines from a to b (where a and b get defined by matching the above patterns),


            • p stands for print and is the action performed on the lines that matched the above.




            Edit: If you'd like to additionally strip the whitespace before <Document>, it can be done this way:



             sed -ne '/ <Document>/s/^ *//' -e '/<Document>/,/</Document>/ p' yourfile.xml





            share|improve this answer


























            • thanks, I'm sed noob. What about indenting whitespace? What does the ',' do ?

              – Bastl
              Oct 20 '11 at 13:50













            • It works with whitespace as well as any other characters surrounding <Document>. See the update of my answer for deeper explanation.

              – rozcietrzewiacz
              Oct 20 '11 at 13:59













            • good. that's nearly perfect. Now I need to strip off preceeding whitespace from the first line. Is it possible inside your command?

              – Bastl
              Oct 20 '11 at 14:06













            • Yes, though it'll be a bit more complicated - see update. (At this point, I am not sure if it is the simplest way.)

              – rozcietrzewiacz
              Oct 20 '11 at 14:36








            • 1





              @Bastl Note that if there's any text between </Document> and the next <Document>, it'll be stripped.

              – Gilles
              Oct 20 '11 at 17:29
















            6














            Using sed:



             sed -n '/<Document>/,/</Document>/ p' yourfile.xml


            Explanation:





            • -n makes sed silent, meaning it does not output the whole file contents,


            • /pattern/ searches for lines including specified pattern,


            • a,b (the comma) tells sed to perform an action on the lines from a to b (where a and b get defined by matching the above patterns),


            • p stands for print and is the action performed on the lines that matched the above.




            Edit: If you'd like to additionally strip the whitespace before <Document>, it can be done this way:



             sed -ne '/ <Document>/s/^ *//' -e '/<Document>/,/</Document>/ p' yourfile.xml





            share|improve this answer


























            • thanks, I'm sed noob. What about indenting whitespace? What does the ',' do ?

              – Bastl
              Oct 20 '11 at 13:50













            • It works with whitespace as well as any other characters surrounding <Document>. See the update of my answer for deeper explanation.

              – rozcietrzewiacz
              Oct 20 '11 at 13:59













            • good. that's nearly perfect. Now I need to strip off preceeding whitespace from the first line. Is it possible inside your command?

              – Bastl
              Oct 20 '11 at 14:06













            • Yes, though it'll be a bit more complicated - see update. (At this point, I am not sure if it is the simplest way.)

              – rozcietrzewiacz
              Oct 20 '11 at 14:36








            • 1





              @Bastl Note that if there's any text between </Document> and the next <Document>, it'll be stripped.

              – Gilles
              Oct 20 '11 at 17:29














            6












            6








            6







            Using sed:



             sed -n '/<Document>/,/</Document>/ p' yourfile.xml


            Explanation:





            • -n makes sed silent, meaning it does not output the whole file contents,


            • /pattern/ searches for lines including specified pattern,


            • a,b (the comma) tells sed to perform an action on the lines from a to b (where a and b get defined by matching the above patterns),


            • p stands for print and is the action performed on the lines that matched the above.




            Edit: If you'd like to additionally strip the whitespace before <Document>, it can be done this way:



             sed -ne '/ <Document>/s/^ *//' -e '/<Document>/,/</Document>/ p' yourfile.xml





            share|improve this answer















            Using sed:



             sed -n '/<Document>/,/</Document>/ p' yourfile.xml


            Explanation:





            • -n makes sed silent, meaning it does not output the whole file contents,


            • /pattern/ searches for lines including specified pattern,


            • a,b (the comma) tells sed to perform an action on the lines from a to b (where a and b get defined by matching the above patterns),


            • p stands for print and is the action performed on the lines that matched the above.




            Edit: If you'd like to additionally strip the whitespace before <Document>, it can be done this way:



             sed -ne '/ <Document>/s/^ *//' -e '/<Document>/,/</Document>/ p' yourfile.xml






            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Oct 20 '11 at 14:36

























            answered Oct 20 '11 at 13:45









            rozcietrzewiaczrozcietrzewiacz

            29.4k47392




            29.4k47392













            • thanks, I'm sed noob. What about indenting whitespace? What does the ',' do ?

              – Bastl
              Oct 20 '11 at 13:50













            • It works with whitespace as well as any other characters surrounding <Document>. See the update of my answer for deeper explanation.

              – rozcietrzewiacz
              Oct 20 '11 at 13:59













            • good. that's nearly perfect. Now I need to strip off preceeding whitespace from the first line. Is it possible inside your command?

              – Bastl
              Oct 20 '11 at 14:06













            • Yes, though it'll be a bit more complicated - see update. (At this point, I am not sure if it is the simplest way.)

              – rozcietrzewiacz
              Oct 20 '11 at 14:36








            • 1





              @Bastl Note that if there's any text between </Document> and the next <Document>, it'll be stripped.

              – Gilles
              Oct 20 '11 at 17:29



















            • thanks, I'm sed noob. What about indenting whitespace? What does the ',' do ?

              – Bastl
              Oct 20 '11 at 13:50













            • It works with whitespace as well as any other characters surrounding <Document>. See the update of my answer for deeper explanation.

              – rozcietrzewiacz
              Oct 20 '11 at 13:59













            • good. that's nearly perfect. Now I need to strip off preceeding whitespace from the first line. Is it possible inside your command?

              – Bastl
              Oct 20 '11 at 14:06













            • Yes, though it'll be a bit more complicated - see update. (At this point, I am not sure if it is the simplest way.)

              – rozcietrzewiacz
              Oct 20 '11 at 14:36








            • 1





              @Bastl Note that if there's any text between </Document> and the next <Document>, it'll be stripped.

              – Gilles
              Oct 20 '11 at 17:29

















            thanks, I'm sed noob. What about indenting whitespace? What does the ',' do ?

            – Bastl
            Oct 20 '11 at 13:50







            thanks, I'm sed noob. What about indenting whitespace? What does the ',' do ?

            – Bastl
            Oct 20 '11 at 13:50















            It works with whitespace as well as any other characters surrounding <Document>. See the update of my answer for deeper explanation.

            – rozcietrzewiacz
            Oct 20 '11 at 13:59







            It works with whitespace as well as any other characters surrounding <Document>. See the update of my answer for deeper explanation.

            – rozcietrzewiacz
            Oct 20 '11 at 13:59















            good. that's nearly perfect. Now I need to strip off preceeding whitespace from the first line. Is it possible inside your command?

            – Bastl
            Oct 20 '11 at 14:06







            good. that's nearly perfect. Now I need to strip off preceeding whitespace from the first line. Is it possible inside your command?

            – Bastl
            Oct 20 '11 at 14:06















            Yes, though it'll be a bit more complicated - see update. (At this point, I am not sure if it is the simplest way.)

            – rozcietrzewiacz
            Oct 20 '11 at 14:36







            Yes, though it'll be a bit more complicated - see update. (At this point, I am not sure if it is the simplest way.)

            – rozcietrzewiacz
            Oct 20 '11 at 14:36






            1




            1





            @Bastl Note that if there's any text between </Document> and the next <Document>, it'll be stripped.

            – Gilles
            Oct 20 '11 at 17:29





            @Bastl Note that if there's any text between </Document> and the next <Document>, it'll be stripped.

            – Gilles
            Oct 20 '11 at 17:29













            1














            To prevent text from being stripped between </Document> and the next <Document> you may have to use a series of sed commands (cf. Gilles' comment above).



            Essentially sed reads the entire file into the hold buffer (so that the file contents can be treated as a single line) and marks the first and last Document tags for further processing.



            # version 1
            # marker: HERE
            cat file.xml |
            sed -n '1h;1!H;${;g;s/(<Document>.*</Document>)/HERE1HERE/g;p;}' |
            sed -n -e '/HERE<Document>/,/</Document>HERE/ p' |
            sed -e 's/^ *HERE(<Document>)/1/' -e 's/(</Document>)HERE *$/1/'

            # version 2 (using the Bash shell)
            # marker: $'01'
            cat file.xml |
            sed -n $'1h;1!H;${;g;s/\(<Document>.*<\/Document>\)/01\101/g;p;}' |
            sed -n -e $'/01<Document>/,/<\/Document>01/ p' |
            sed -e $'s/^ *01//' -e $'s/01 *$//' |
            cat -vet


            ... but I guess all this could be done more elegantly (& reliably) using xmlstarlet!






            share|improve this answer




























              1














              To prevent text from being stripped between </Document> and the next <Document> you may have to use a series of sed commands (cf. Gilles' comment above).



              Essentially sed reads the entire file into the hold buffer (so that the file contents can be treated as a single line) and marks the first and last Document tags for further processing.



              # version 1
              # marker: HERE
              cat file.xml |
              sed -n '1h;1!H;${;g;s/(<Document>.*</Document>)/HERE1HERE/g;p;}' |
              sed -n -e '/HERE<Document>/,/</Document>HERE/ p' |
              sed -e 's/^ *HERE(<Document>)/1/' -e 's/(</Document>)HERE *$/1/'

              # version 2 (using the Bash shell)
              # marker: $'01'
              cat file.xml |
              sed -n $'1h;1!H;${;g;s/\(<Document>.*<\/Document>\)/01\101/g;p;}' |
              sed -n -e $'/01<Document>/,/<\/Document>01/ p' |
              sed -e $'s/^ *01//' -e $'s/01 *$//' |
              cat -vet


              ... but I guess all this could be done more elegantly (& reliably) using xmlstarlet!






              share|improve this answer


























                1












                1








                1







                To prevent text from being stripped between </Document> and the next <Document> you may have to use a series of sed commands (cf. Gilles' comment above).



                Essentially sed reads the entire file into the hold buffer (so that the file contents can be treated as a single line) and marks the first and last Document tags for further processing.



                # version 1
                # marker: HERE
                cat file.xml |
                sed -n '1h;1!H;${;g;s/(<Document>.*</Document>)/HERE1HERE/g;p;}' |
                sed -n -e '/HERE<Document>/,/</Document>HERE/ p' |
                sed -e 's/^ *HERE(<Document>)/1/' -e 's/(</Document>)HERE *$/1/'

                # version 2 (using the Bash shell)
                # marker: $'01'
                cat file.xml |
                sed -n $'1h;1!H;${;g;s/\(<Document>.*<\/Document>\)/01\101/g;p;}' |
                sed -n -e $'/01<Document>/,/<\/Document>01/ p' |
                sed -e $'s/^ *01//' -e $'s/01 *$//' |
                cat -vet


                ... but I guess all this could be done more elegantly (& reliably) using xmlstarlet!






                share|improve this answer













                To prevent text from being stripped between </Document> and the next <Document> you may have to use a series of sed commands (cf. Gilles' comment above).



                Essentially sed reads the entire file into the hold buffer (so that the file contents can be treated as a single line) and marks the first and last Document tags for further processing.



                # version 1
                # marker: HERE
                cat file.xml |
                sed -n '1h;1!H;${;g;s/(<Document>.*</Document>)/HERE1HERE/g;p;}' |
                sed -n -e '/HERE<Document>/,/</Document>HERE/ p' |
                sed -e 's/^ *HERE(<Document>)/1/' -e 's/(</Document>)HERE *$/1/'

                # version 2 (using the Bash shell)
                # marker: $'01'
                cat file.xml |
                sed -n $'1h;1!H;${;g;s/\(<Document>.*<\/Document>\)/01\101/g;p;}' |
                sed -n -e $'/01<Document>/,/<\/Document>01/ p' |
                sed -e $'s/^ *01//' -e $'s/01 *$//' |
                cat -vet


                ... but I guess all this could be done more elegantly (& reliably) using xmlstarlet!







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Oct 21 '11 at 12:48









                jonjon

                111




                111






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Unix & Linux Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f22988%2fsimple-command-to-strip-header-and-footer-from-a-file%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Accessing regular linux commands in Huawei's Dopra Linux

                    Can't connect RFCOMM socket: Host is down

                    Kernel panic - not syncing: Fatal Exception in Interrupt