Process Unicode files with BOM correctly with POSIX tools












0














Trying to use grep today, I ran into the familiar problem of the Byte Order Mark (BOM) in a Unicode file (UTF-8, in this case). Specifically, I was trying to find a file beginning with XYZ with the pattern grep '^XYZ', but of course grep treated the BOM as three separate characters and did not match the first line of the file if the first line started with XYZ. I even tried to update the regular expression to ignore spaces ('^[[:space:]]*XYZ'), but to no avail.



Other questions have dealt with converting files or targeting the BOM specifically, but I want to know if POSIX tools have a general option to handle Unicode files correctly. If grep handled the Unicode file correctly, it would consider the file contents to start after the BOM and match XYZ on the first line just like any other line.










share|improve this question


















  • 2




    POSIX doesn't have a concept of bytes in text files that aren't part of characters. grep "^($(printf 'xefxbbxbf'))?XYZ" would work, modulo other zero-width no-break spaces starting later lines.
    – Michael Homer
    Jan 3 '18 at 1:08










  • ... and so I don't think there's an option or even a compliant locale that could do it, but I'm less certain about locales. Arguably this behaviour is correct, though unhelpful for your use case.
    – Michael Homer
    Jan 3 '18 at 1:11
















0














Trying to use grep today, I ran into the familiar problem of the Byte Order Mark (BOM) in a Unicode file (UTF-8, in this case). Specifically, I was trying to find a file beginning with XYZ with the pattern grep '^XYZ', but of course grep treated the BOM as three separate characters and did not match the first line of the file if the first line started with XYZ. I even tried to update the regular expression to ignore spaces ('^[[:space:]]*XYZ'), but to no avail.



Other questions have dealt with converting files or targeting the BOM specifically, but I want to know if POSIX tools have a general option to handle Unicode files correctly. If grep handled the Unicode file correctly, it would consider the file contents to start after the BOM and match XYZ on the first line just like any other line.










share|improve this question


















  • 2




    POSIX doesn't have a concept of bytes in text files that aren't part of characters. grep "^($(printf 'xefxbbxbf'))?XYZ" would work, modulo other zero-width no-break spaces starting later lines.
    – Michael Homer
    Jan 3 '18 at 1:08










  • ... and so I don't think there's an option or even a compliant locale that could do it, but I'm less certain about locales. Arguably this behaviour is correct, though unhelpful for your use case.
    – Michael Homer
    Jan 3 '18 at 1:11














0












0








0







Trying to use grep today, I ran into the familiar problem of the Byte Order Mark (BOM) in a Unicode file (UTF-8, in this case). Specifically, I was trying to find a file beginning with XYZ with the pattern grep '^XYZ', but of course grep treated the BOM as three separate characters and did not match the first line of the file if the first line started with XYZ. I even tried to update the regular expression to ignore spaces ('^[[:space:]]*XYZ'), but to no avail.



Other questions have dealt with converting files or targeting the BOM specifically, but I want to know if POSIX tools have a general option to handle Unicode files correctly. If grep handled the Unicode file correctly, it would consider the file contents to start after the BOM and match XYZ on the first line just like any other line.










share|improve this question













Trying to use grep today, I ran into the familiar problem of the Byte Order Mark (BOM) in a Unicode file (UTF-8, in this case). Specifically, I was trying to find a file beginning with XYZ with the pattern grep '^XYZ', but of course grep treated the BOM as three separate characters and did not match the first line of the file if the first line started with XYZ. I even tried to update the regular expression to ignore spaces ('^[[:space:]]*XYZ'), but to no avail.



Other questions have dealt with converting files or targeting the BOM specifically, but I want to know if POSIX tools have a general option to handle Unicode files correctly. If grep handled the Unicode file correctly, it would consider the file contents to start after the BOM and match XYZ on the first line just like any other line.







grep regular-expression posix unicode






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Jan 2 '18 at 23:56









palswim

1,59611732




1,59611732








  • 2




    POSIX doesn't have a concept of bytes in text files that aren't part of characters. grep "^($(printf 'xefxbbxbf'))?XYZ" would work, modulo other zero-width no-break spaces starting later lines.
    – Michael Homer
    Jan 3 '18 at 1:08










  • ... and so I don't think there's an option or even a compliant locale that could do it, but I'm less certain about locales. Arguably this behaviour is correct, though unhelpful for your use case.
    – Michael Homer
    Jan 3 '18 at 1:11














  • 2




    POSIX doesn't have a concept of bytes in text files that aren't part of characters. grep "^($(printf 'xefxbbxbf'))?XYZ" would work, modulo other zero-width no-break spaces starting later lines.
    – Michael Homer
    Jan 3 '18 at 1:08










  • ... and so I don't think there's an option or even a compliant locale that could do it, but I'm less certain about locales. Arguably this behaviour is correct, though unhelpful for your use case.
    – Michael Homer
    Jan 3 '18 at 1:11








2




2




POSIX doesn't have a concept of bytes in text files that aren't part of characters. grep "^($(printf 'xefxbbxbf'))?XYZ" would work, modulo other zero-width no-break spaces starting later lines.
– Michael Homer
Jan 3 '18 at 1:08




POSIX doesn't have a concept of bytes in text files that aren't part of characters. grep "^($(printf 'xefxbbxbf'))?XYZ" would work, modulo other zero-width no-break spaces starting later lines.
– Michael Homer
Jan 3 '18 at 1:08












... and so I don't think there's an option or even a compliant locale that could do it, but I'm less certain about locales. Arguably this behaviour is correct, though unhelpful for your use case.
– Michael Homer
Jan 3 '18 at 1:11




... and so I don't think there's an option or even a compliant locale that could do it, but I'm less certain about locales. Arguably this behaviour is correct, though unhelpful for your use case.
– Michael Homer
Jan 3 '18 at 1:11










3 Answers
3






active

oldest

votes


















4














The Unicode Consortium has an FAQ that includes How should I deal with BOMs. This portion includes:




Where a text data stream is known to be plain Unicode text (but not which endian), then BOM can be used as a signature. If there is no BOM, the text should be interpreted as big-endian.




and




Where the precise type of the data stream is known (e.g. Unicode big-endian or Unicode little-endian), the BOM should not be used. In particular, whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used.




Note that UTF-8 is always of known endianness, because it has no endianness. So as long as you know the text is UTF-8, "the BOM should not be used."



Even cat will return incorrect results when using a BOM unnecessarily, as the BOMs of all files but the first will be treated as zero-width non-breaking spaces. But, the power of UNIX lies in filters.



For operations on a single file or stream,sed "1s/^$(printf '357273277')//" in a pipeline will strip a BOM if present, leaving all other streams intact.



For operations with multiple files, a shell with process substitution (like Bash, but unfortunately not POSIX shell) is useful:



sb() { sed "1s/$(printf '357273277')//" "$@" ; }
cat <(sb file1) <(sb file2) …





share|improve this answer





















  • Interesting; how, then, does any random text file mark itself as UTF-8 or ASCII without a BOM?
    – palswim
    Jan 5 '18 at 20:43






  • 1




    Note that this answer assumes you know the encoding. Plain ASCII is identical to UTF-8. "Extended ASCII" (aka arbitrary 8-bit character sets) frequently use sequences that would be invalid in UTF-8, so the distinction isn't difficult
    – Fox
    Jan 5 '18 at 21:56










  • Ah, right, the whole using 7 out of the 8 bits deal.
    – palswim
    Jan 10 '18 at 20:03










  • I think you misunderstood the meaning of the quoted text. It refers to out of band signalling: when you already know the stream encoding, there's no need to add an inband encoding marker. The part I'd have quoted from your source is this: "Where a text data stream is known to be plain text, but of unknown encoding, BOM can be used as a signature. If there is no BOM, the encoding could be anything." It's possible usually to guess the right one, but the purpose of the BOM is exactly to avoid this guesswork. The real issue is that POSIX tools have no concept of unicode semantics.
    – Matyas Koszik
    4 hours ago



















0














From the other answer, it sounds like I was dealing with files with an improper BOM signature.



So, the answer is that POSIX tools handle Unicode (UTF-8) files correctly already.



If you have bad Unicode, of course they don't handle it correctly, but you can use the BOM targeting from other questions to deal with superfluous BOM signatures.






share|improve this answer





























    0














    Most POSIX tools operate on bytes, and not characters. Unicode signalling is meaningless to them, so it'll be treated like any other piece of data.






    share|improve this answer








    New contributor




    Matyas Koszik is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.


















      Your Answer








      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "106"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f414397%2fprocess-unicode-files-with-bom-correctly-with-posix-tools%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      3 Answers
      3






      active

      oldest

      votes








      3 Answers
      3






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      4














      The Unicode Consortium has an FAQ that includes How should I deal with BOMs. This portion includes:




      Where a text data stream is known to be plain Unicode text (but not which endian), then BOM can be used as a signature. If there is no BOM, the text should be interpreted as big-endian.




      and




      Where the precise type of the data stream is known (e.g. Unicode big-endian or Unicode little-endian), the BOM should not be used. In particular, whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used.




      Note that UTF-8 is always of known endianness, because it has no endianness. So as long as you know the text is UTF-8, "the BOM should not be used."



      Even cat will return incorrect results when using a BOM unnecessarily, as the BOMs of all files but the first will be treated as zero-width non-breaking spaces. But, the power of UNIX lies in filters.



      For operations on a single file or stream,sed "1s/^$(printf '357273277')//" in a pipeline will strip a BOM if present, leaving all other streams intact.



      For operations with multiple files, a shell with process substitution (like Bash, but unfortunately not POSIX shell) is useful:



      sb() { sed "1s/$(printf '357273277')//" "$@" ; }
      cat <(sb file1) <(sb file2) …





      share|improve this answer





















      • Interesting; how, then, does any random text file mark itself as UTF-8 or ASCII without a BOM?
        – palswim
        Jan 5 '18 at 20:43






      • 1




        Note that this answer assumes you know the encoding. Plain ASCII is identical to UTF-8. "Extended ASCII" (aka arbitrary 8-bit character sets) frequently use sequences that would be invalid in UTF-8, so the distinction isn't difficult
        – Fox
        Jan 5 '18 at 21:56










      • Ah, right, the whole using 7 out of the 8 bits deal.
        – palswim
        Jan 10 '18 at 20:03










      • I think you misunderstood the meaning of the quoted text. It refers to out of band signalling: when you already know the stream encoding, there's no need to add an inband encoding marker. The part I'd have quoted from your source is this: "Where a text data stream is known to be plain text, but of unknown encoding, BOM can be used as a signature. If there is no BOM, the encoding could be anything." It's possible usually to guess the right one, but the purpose of the BOM is exactly to avoid this guesswork. The real issue is that POSIX tools have no concept of unicode semantics.
        – Matyas Koszik
        4 hours ago
















      4














      The Unicode Consortium has an FAQ that includes How should I deal with BOMs. This portion includes:




      Where a text data stream is known to be plain Unicode text (but not which endian), then BOM can be used as a signature. If there is no BOM, the text should be interpreted as big-endian.




      and




      Where the precise type of the data stream is known (e.g. Unicode big-endian or Unicode little-endian), the BOM should not be used. In particular, whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used.




      Note that UTF-8 is always of known endianness, because it has no endianness. So as long as you know the text is UTF-8, "the BOM should not be used."



      Even cat will return incorrect results when using a BOM unnecessarily, as the BOMs of all files but the first will be treated as zero-width non-breaking spaces. But, the power of UNIX lies in filters.



      For operations on a single file or stream,sed "1s/^$(printf '357273277')//" in a pipeline will strip a BOM if present, leaving all other streams intact.



      For operations with multiple files, a shell with process substitution (like Bash, but unfortunately not POSIX shell) is useful:



      sb() { sed "1s/$(printf '357273277')//" "$@" ; }
      cat <(sb file1) <(sb file2) …





      share|improve this answer





















      • Interesting; how, then, does any random text file mark itself as UTF-8 or ASCII without a BOM?
        – palswim
        Jan 5 '18 at 20:43






      • 1




        Note that this answer assumes you know the encoding. Plain ASCII is identical to UTF-8. "Extended ASCII" (aka arbitrary 8-bit character sets) frequently use sequences that would be invalid in UTF-8, so the distinction isn't difficult
        – Fox
        Jan 5 '18 at 21:56










      • Ah, right, the whole using 7 out of the 8 bits deal.
        – palswim
        Jan 10 '18 at 20:03










      • I think you misunderstood the meaning of the quoted text. It refers to out of band signalling: when you already know the stream encoding, there's no need to add an inband encoding marker. The part I'd have quoted from your source is this: "Where a text data stream is known to be plain text, but of unknown encoding, BOM can be used as a signature. If there is no BOM, the encoding could be anything." It's possible usually to guess the right one, but the purpose of the BOM is exactly to avoid this guesswork. The real issue is that POSIX tools have no concept of unicode semantics.
        – Matyas Koszik
        4 hours ago














      4












      4








      4






      The Unicode Consortium has an FAQ that includes How should I deal with BOMs. This portion includes:




      Where a text data stream is known to be plain Unicode text (but not which endian), then BOM can be used as a signature. If there is no BOM, the text should be interpreted as big-endian.




      and




      Where the precise type of the data stream is known (e.g. Unicode big-endian or Unicode little-endian), the BOM should not be used. In particular, whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used.




      Note that UTF-8 is always of known endianness, because it has no endianness. So as long as you know the text is UTF-8, "the BOM should not be used."



      Even cat will return incorrect results when using a BOM unnecessarily, as the BOMs of all files but the first will be treated as zero-width non-breaking spaces. But, the power of UNIX lies in filters.



      For operations on a single file or stream,sed "1s/^$(printf '357273277')//" in a pipeline will strip a BOM if present, leaving all other streams intact.



      For operations with multiple files, a shell with process substitution (like Bash, but unfortunately not POSIX shell) is useful:



      sb() { sed "1s/$(printf '357273277')//" "$@" ; }
      cat <(sb file1) <(sb file2) …





      share|improve this answer












      The Unicode Consortium has an FAQ that includes How should I deal with BOMs. This portion includes:




      Where a text data stream is known to be plain Unicode text (but not which endian), then BOM can be used as a signature. If there is no BOM, the text should be interpreted as big-endian.




      and




      Where the precise type of the data stream is known (e.g. Unicode big-endian or Unicode little-endian), the BOM should not be used. In particular, whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used.




      Note that UTF-8 is always of known endianness, because it has no endianness. So as long as you know the text is UTF-8, "the BOM should not be used."



      Even cat will return incorrect results when using a BOM unnecessarily, as the BOMs of all files but the first will be treated as zero-width non-breaking spaces. But, the power of UNIX lies in filters.



      For operations on a single file or stream,sed "1s/^$(printf '357273277')//" in a pipeline will strip a BOM if present, leaving all other streams intact.



      For operations with multiple files, a shell with process substitution (like Bash, but unfortunately not POSIX shell) is useful:



      sb() { sed "1s/$(printf '357273277')//" "$@" ; }
      cat <(sb file1) <(sb file2) …






      share|improve this answer












      share|improve this answer



      share|improve this answer










      answered Jan 3 '18 at 5:09









      Fox

      5,23411232




      5,23411232












      • Interesting; how, then, does any random text file mark itself as UTF-8 or ASCII without a BOM?
        – palswim
        Jan 5 '18 at 20:43






      • 1




        Note that this answer assumes you know the encoding. Plain ASCII is identical to UTF-8. "Extended ASCII" (aka arbitrary 8-bit character sets) frequently use sequences that would be invalid in UTF-8, so the distinction isn't difficult
        – Fox
        Jan 5 '18 at 21:56










      • Ah, right, the whole using 7 out of the 8 bits deal.
        – palswim
        Jan 10 '18 at 20:03










      • I think you misunderstood the meaning of the quoted text. It refers to out of band signalling: when you already know the stream encoding, there's no need to add an inband encoding marker. The part I'd have quoted from your source is this: "Where a text data stream is known to be plain text, but of unknown encoding, BOM can be used as a signature. If there is no BOM, the encoding could be anything." It's possible usually to guess the right one, but the purpose of the BOM is exactly to avoid this guesswork. The real issue is that POSIX tools have no concept of unicode semantics.
        – Matyas Koszik
        4 hours ago


















      • Interesting; how, then, does any random text file mark itself as UTF-8 or ASCII without a BOM?
        – palswim
        Jan 5 '18 at 20:43






      • 1




        Note that this answer assumes you know the encoding. Plain ASCII is identical to UTF-8. "Extended ASCII" (aka arbitrary 8-bit character sets) frequently use sequences that would be invalid in UTF-8, so the distinction isn't difficult
        – Fox
        Jan 5 '18 at 21:56










      • Ah, right, the whole using 7 out of the 8 bits deal.
        – palswim
        Jan 10 '18 at 20:03










      • I think you misunderstood the meaning of the quoted text. It refers to out of band signalling: when you already know the stream encoding, there's no need to add an inband encoding marker. The part I'd have quoted from your source is this: "Where a text data stream is known to be plain text, but of unknown encoding, BOM can be used as a signature. If there is no BOM, the encoding could be anything." It's possible usually to guess the right one, but the purpose of the BOM is exactly to avoid this guesswork. The real issue is that POSIX tools have no concept of unicode semantics.
        – Matyas Koszik
        4 hours ago
















      Interesting; how, then, does any random text file mark itself as UTF-8 or ASCII without a BOM?
      – palswim
      Jan 5 '18 at 20:43




      Interesting; how, then, does any random text file mark itself as UTF-8 or ASCII without a BOM?
      – palswim
      Jan 5 '18 at 20:43




      1




      1




      Note that this answer assumes you know the encoding. Plain ASCII is identical to UTF-8. "Extended ASCII" (aka arbitrary 8-bit character sets) frequently use sequences that would be invalid in UTF-8, so the distinction isn't difficult
      – Fox
      Jan 5 '18 at 21:56




      Note that this answer assumes you know the encoding. Plain ASCII is identical to UTF-8. "Extended ASCII" (aka arbitrary 8-bit character sets) frequently use sequences that would be invalid in UTF-8, so the distinction isn't difficult
      – Fox
      Jan 5 '18 at 21:56












      Ah, right, the whole using 7 out of the 8 bits deal.
      – palswim
      Jan 10 '18 at 20:03




      Ah, right, the whole using 7 out of the 8 bits deal.
      – palswim
      Jan 10 '18 at 20:03












      I think you misunderstood the meaning of the quoted text. It refers to out of band signalling: when you already know the stream encoding, there's no need to add an inband encoding marker. The part I'd have quoted from your source is this: "Where a text data stream is known to be plain text, but of unknown encoding, BOM can be used as a signature. If there is no BOM, the encoding could be anything." It's possible usually to guess the right one, but the purpose of the BOM is exactly to avoid this guesswork. The real issue is that POSIX tools have no concept of unicode semantics.
      – Matyas Koszik
      4 hours ago




      I think you misunderstood the meaning of the quoted text. It refers to out of band signalling: when you already know the stream encoding, there's no need to add an inband encoding marker. The part I'd have quoted from your source is this: "Where a text data stream is known to be plain text, but of unknown encoding, BOM can be used as a signature. If there is no BOM, the encoding could be anything." It's possible usually to guess the right one, but the purpose of the BOM is exactly to avoid this guesswork. The real issue is that POSIX tools have no concept of unicode semantics.
      – Matyas Koszik
      4 hours ago













      0














      From the other answer, it sounds like I was dealing with files with an improper BOM signature.



      So, the answer is that POSIX tools handle Unicode (UTF-8) files correctly already.



      If you have bad Unicode, of course they don't handle it correctly, but you can use the BOM targeting from other questions to deal with superfluous BOM signatures.






      share|improve this answer


























        0














        From the other answer, it sounds like I was dealing with files with an improper BOM signature.



        So, the answer is that POSIX tools handle Unicode (UTF-8) files correctly already.



        If you have bad Unicode, of course they don't handle it correctly, but you can use the BOM targeting from other questions to deal with superfluous BOM signatures.






        share|improve this answer
























          0












          0








          0






          From the other answer, it sounds like I was dealing with files with an improper BOM signature.



          So, the answer is that POSIX tools handle Unicode (UTF-8) files correctly already.



          If you have bad Unicode, of course they don't handle it correctly, but you can use the BOM targeting from other questions to deal with superfluous BOM signatures.






          share|improve this answer












          From the other answer, it sounds like I was dealing with files with an improper BOM signature.



          So, the answer is that POSIX tools handle Unicode (UTF-8) files correctly already.



          If you have bad Unicode, of course they don't handle it correctly, but you can use the BOM targeting from other questions to deal with superfluous BOM signatures.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Feb 11 '18 at 0:18









          palswim

          1,59611732




          1,59611732























              0














              Most POSIX tools operate on bytes, and not characters. Unicode signalling is meaningless to them, so it'll be treated like any other piece of data.






              share|improve this answer








              New contributor




              Matyas Koszik is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
              Check out our Code of Conduct.























                0














                Most POSIX tools operate on bytes, and not characters. Unicode signalling is meaningless to them, so it'll be treated like any other piece of data.






                share|improve this answer








                New contributor




                Matyas Koszik is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.





















                  0












                  0








                  0






                  Most POSIX tools operate on bytes, and not characters. Unicode signalling is meaningless to them, so it'll be treated like any other piece of data.






                  share|improve this answer








                  New contributor




                  Matyas Koszik is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.









                  Most POSIX tools operate on bytes, and not characters. Unicode signalling is meaningless to them, so it'll be treated like any other piece of data.







                  share|improve this answer








                  New contributor




                  Matyas Koszik is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.









                  share|improve this answer



                  share|improve this answer






                  New contributor




                  Matyas Koszik is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.









                  answered 3 hours ago









                  Matyas Koszik

                  1011




                  1011




                  New contributor




                  Matyas Koszik is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.





                  New contributor





                  Matyas Koszik is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.






                  Matyas Koszik is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Unix & Linux Stack Exchange!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.





                      Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                      Please pay close attention to the following guidance:


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f414397%2fprocess-unicode-files-with-bom-correctly-with-posix-tools%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Accessing regular linux commands in Huawei's Dopra Linux

                      Can't connect RFCOMM socket: Host is down

                      Kernel panic - not syncing: Fatal Exception in Interrupt