Distinguish ascii from UTF-8 characters in the same file
On Ubuntu 18.04, I created a dummy text file with just one UTF-8 character, è. The other characters are all ascii:
$ cat dummytext
Hello
Helloè
This is the resulting hexdump:
$ hexdump -C dummyfile
00000000 48 65 6c 6c 6f 0a 48 65 6c 6c 6f c3 a8 0a |Hello.Hello...|
0000000e
The file is identified as
$ file dummyfile
dummyfile2: UTF-8 Unicode text
Each character is represented by a single byte, except for the UTF-8 è character, which is c3a8, so it is represented by 2 bytes. How can the file contents be correctly interpreted, if the number of bytes used to represent each character is not constant?
My guess: maybe the parser, when encountering a hex value which is greater than the last ascii character 7F (and this is the case of c3), is forced to read at least another byte, to determine the right character to be printed?
text-processing unicode character-encoding ascii
add a comment |
On Ubuntu 18.04, I created a dummy text file with just one UTF-8 character, è. The other characters are all ascii:
$ cat dummytext
Hello
Helloè
This is the resulting hexdump:
$ hexdump -C dummyfile
00000000 48 65 6c 6c 6f 0a 48 65 6c 6c 6f c3 a8 0a |Hello.Hello...|
0000000e
The file is identified as
$ file dummyfile
dummyfile2: UTF-8 Unicode text
Each character is represented by a single byte, except for the UTF-8 è character, which is c3a8, so it is represented by 2 bytes. How can the file contents be correctly interpreted, if the number of bytes used to represent each character is not constant?
My guess: maybe the parser, when encountering a hex value which is greater than the last ascii character 7F (and this is the case of c3), is forced to read at least another byte, to determine the right character to be printed?
text-processing unicode character-encoding ascii
I think that you haven't quite expressed the question that you mean to ask. Your question seems actually to be two questions: How doesfileknow that this is UTF-8, when it could instead be an old 8-bit encoding? followed by How does a UTF-8 decoder know where multiple-byte sequences begin and end?.
– JdeBP
5 hours ago
@JdeBP Maybe unconsciously the actual questions were the ones you wrote (even if I just usedfileas a further verification). DopeGhoti's answer fits to the second one. For the first one, maybefilelooks for bytes "whose high order bit is set" and then is able to guess if there is an UTF-8 encoding.
– BowPark
5 hours ago
1
Thefilecommand on Ubuntu, as one of its tests, reads the first 96KiB of the file and checks whether there are any non-ASCII well-formed UTF-8 characters in it.
– Mark Plotnick
4 hours ago
add a comment |
On Ubuntu 18.04, I created a dummy text file with just one UTF-8 character, è. The other characters are all ascii:
$ cat dummytext
Hello
Helloè
This is the resulting hexdump:
$ hexdump -C dummyfile
00000000 48 65 6c 6c 6f 0a 48 65 6c 6c 6f c3 a8 0a |Hello.Hello...|
0000000e
The file is identified as
$ file dummyfile
dummyfile2: UTF-8 Unicode text
Each character is represented by a single byte, except for the UTF-8 è character, which is c3a8, so it is represented by 2 bytes. How can the file contents be correctly interpreted, if the number of bytes used to represent each character is not constant?
My guess: maybe the parser, when encountering a hex value which is greater than the last ascii character 7F (and this is the case of c3), is forced to read at least another byte, to determine the right character to be printed?
text-processing unicode character-encoding ascii
On Ubuntu 18.04, I created a dummy text file with just one UTF-8 character, è. The other characters are all ascii:
$ cat dummytext
Hello
Helloè
This is the resulting hexdump:
$ hexdump -C dummyfile
00000000 48 65 6c 6c 6f 0a 48 65 6c 6c 6f c3 a8 0a |Hello.Hello...|
0000000e
The file is identified as
$ file dummyfile
dummyfile2: UTF-8 Unicode text
Each character is represented by a single byte, except for the UTF-8 è character, which is c3a8, so it is represented by 2 bytes. How can the file contents be correctly interpreted, if the number of bytes used to represent each character is not constant?
My guess: maybe the parser, when encountering a hex value which is greater than the last ascii character 7F (and this is the case of c3), is forced to read at least another byte, to determine the right character to be printed?
text-processing unicode character-encoding ascii
text-processing unicode character-encoding ascii
asked 6 hours ago
BowParkBowPark
1,60882746
1,60882746
I think that you haven't quite expressed the question that you mean to ask. Your question seems actually to be two questions: How doesfileknow that this is UTF-8, when it could instead be an old 8-bit encoding? followed by How does a UTF-8 decoder know where multiple-byte sequences begin and end?.
– JdeBP
5 hours ago
@JdeBP Maybe unconsciously the actual questions were the ones you wrote (even if I just usedfileas a further verification). DopeGhoti's answer fits to the second one. For the first one, maybefilelooks for bytes "whose high order bit is set" and then is able to guess if there is an UTF-8 encoding.
– BowPark
5 hours ago
1
Thefilecommand on Ubuntu, as one of its tests, reads the first 96KiB of the file and checks whether there are any non-ASCII well-formed UTF-8 characters in it.
– Mark Plotnick
4 hours ago
add a comment |
I think that you haven't quite expressed the question that you mean to ask. Your question seems actually to be two questions: How doesfileknow that this is UTF-8, when it could instead be an old 8-bit encoding? followed by How does a UTF-8 decoder know where multiple-byte sequences begin and end?.
– JdeBP
5 hours ago
@JdeBP Maybe unconsciously the actual questions were the ones you wrote (even if I just usedfileas a further verification). DopeGhoti's answer fits to the second one. For the first one, maybefilelooks for bytes "whose high order bit is set" and then is able to guess if there is an UTF-8 encoding.
– BowPark
5 hours ago
1
Thefilecommand on Ubuntu, as one of its tests, reads the first 96KiB of the file and checks whether there are any non-ASCII well-formed UTF-8 characters in it.
– Mark Plotnick
4 hours ago
I think that you haven't quite expressed the question that you mean to ask. Your question seems actually to be two questions: How does
file know that this is UTF-8, when it could instead be an old 8-bit encoding? followed by How does a UTF-8 decoder know where multiple-byte sequences begin and end?.– JdeBP
5 hours ago
I think that you haven't quite expressed the question that you mean to ask. Your question seems actually to be two questions: How does
file know that this is UTF-8, when it could instead be an old 8-bit encoding? followed by How does a UTF-8 decoder know where multiple-byte sequences begin and end?.– JdeBP
5 hours ago
@JdeBP Maybe unconsciously the actual questions were the ones you wrote (even if I just used
file as a further verification). DopeGhoti's answer fits to the second one. For the first one, maybe file looks for bytes "whose high order bit is set" and then is able to guess if there is an UTF-8 encoding.– BowPark
5 hours ago
@JdeBP Maybe unconsciously the actual questions were the ones you wrote (even if I just used
file as a further verification). DopeGhoti's answer fits to the second one. For the first one, maybe file looks for bytes "whose high order bit is set" and then is able to guess if there is an UTF-8 encoding.– BowPark
5 hours ago
1
1
The
file command on Ubuntu, as one of its tests, reads the first 96KiB of the file and checks whether there are any non-ASCII well-formed UTF-8 characters in it.– Mark Plotnick
4 hours ago
The
file command on Ubuntu, as one of its tests, reads the first 96KiB of the file and checks whether there are any non-ASCII well-formed UTF-8 characters in it.– Mark Plotnick
4 hours ago
add a comment |
1 Answer
1
active
oldest
votes
From the BSD manual, section 5, the page on UTF8 reads:
DESCRIPTION
The UTF-8 encoding represents UCS-4 characters as a sequence of octets, using between 1 and 6 for each character. It is backwards
compatible with ASCII, so0x00-0x7frefer to the ASCII character set.
The multibyte encoding of non-ASCII characters consist entirely of bytes whose high order bit is set. The actual encoding is
represented by the following table:
[0x00000000 - 0x0000007f] [00000000.0bbbbbbb] -> 0bbbbbbb
[0x00000080 - 0x000007ff] [00000bbb.bbbbbbbb] -> 110bbbbb, 10bbbbbb
[0x00000800 - 0x0000ffff] [bbbbbbbb.bbbbbbbb] ->
1110bbbb, 10bbbbbb, 10bbbbbb
[0x00010000 - 0x001fffff] [00000000.000bbbbb.bbbbbbbb.bbbbbbbb] ->
11110bbb, 10bbbbbb, 10bbbbbb, 10bbbbbb
[0x00200000 - 0x03ffffff] [000000bb.bbbbbbbb.bbbbbbbb.bbbbbbbb] ->
111110bb, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb
[0x04000000 - 0x7fffffff] [0bbbbbbb.bbbbbbbb.bbbbbbbb.bbbbbbbb] ->
1111110b, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb
If more than a single representation of a value exists (for example,
0x00;0xC0 0x80;0xE0 0x80 0x80), the shortest representation
is always used. Longer ones are detected as an error as they pose a
potential security risk, and destroy the 1:1 character:octet sequence mapping.
From the Linux manual, section 7, the page on UTF8 similarly reads:
DESCRIPTION
[... UTF-8 is situationally better than UCS-2 in part because i]n addition, the majority of UNIX tools expect ASCII files and can't read 16-bit words as characters without major modifications. [...]
The UTF-8 encoding of Unicode and UCS does not have these problems and is the common way in which Unicode is used on UNIX-style operating systems.
Properties
The UTF-8 encoding has the following nice properties:
- UCS characters
0x00000000to0x0000007f(the classic US-ASCII characters) are encoded simply as bytes0x00to0x7f(ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.
So it's not really possible to distinguish ASCII from UTF-8 because, in a UTF-8 file, ASCII is UTF-8. file looks at the first 96KiB of a file and tries to determine what it is. Because it sees more than zero UTF-8 code sequences, it determines the file to be UTF-8 because it is a strict superset of ASCII.
Thank you. In Ubuntu there is not the same manpage. The homologous one is in section 7, and it is not as concise and clear as yours, which can instead be found in FreeBSD.
– BowPark
4 hours ago
I've added a similar citation from the Linux manual (7) to go along with the BSD manual (5) one.
– DopeGhoti
2 hours ago
Thank you so much!
– BowPark
1 hour ago
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f507782%2fdistinguish-ascii-from-utf-8-characters-in-the-same-file%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
From the BSD manual, section 5, the page on UTF8 reads:
DESCRIPTION
The UTF-8 encoding represents UCS-4 characters as a sequence of octets, using between 1 and 6 for each character. It is backwards
compatible with ASCII, so0x00-0x7frefer to the ASCII character set.
The multibyte encoding of non-ASCII characters consist entirely of bytes whose high order bit is set. The actual encoding is
represented by the following table:
[0x00000000 - 0x0000007f] [00000000.0bbbbbbb] -> 0bbbbbbb
[0x00000080 - 0x000007ff] [00000bbb.bbbbbbbb] -> 110bbbbb, 10bbbbbb
[0x00000800 - 0x0000ffff] [bbbbbbbb.bbbbbbbb] ->
1110bbbb, 10bbbbbb, 10bbbbbb
[0x00010000 - 0x001fffff] [00000000.000bbbbb.bbbbbbbb.bbbbbbbb] ->
11110bbb, 10bbbbbb, 10bbbbbb, 10bbbbbb
[0x00200000 - 0x03ffffff] [000000bb.bbbbbbbb.bbbbbbbb.bbbbbbbb] ->
111110bb, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb
[0x04000000 - 0x7fffffff] [0bbbbbbb.bbbbbbbb.bbbbbbbb.bbbbbbbb] ->
1111110b, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb
If more than a single representation of a value exists (for example,
0x00;0xC0 0x80;0xE0 0x80 0x80), the shortest representation
is always used. Longer ones are detected as an error as they pose a
potential security risk, and destroy the 1:1 character:octet sequence mapping.
From the Linux manual, section 7, the page on UTF8 similarly reads:
DESCRIPTION
[... UTF-8 is situationally better than UCS-2 in part because i]n addition, the majority of UNIX tools expect ASCII files and can't read 16-bit words as characters without major modifications. [...]
The UTF-8 encoding of Unicode and UCS does not have these problems and is the common way in which Unicode is used on UNIX-style operating systems.
Properties
The UTF-8 encoding has the following nice properties:
- UCS characters
0x00000000to0x0000007f(the classic US-ASCII characters) are encoded simply as bytes0x00to0x7f(ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.
So it's not really possible to distinguish ASCII from UTF-8 because, in a UTF-8 file, ASCII is UTF-8. file looks at the first 96KiB of a file and tries to determine what it is. Because it sees more than zero UTF-8 code sequences, it determines the file to be UTF-8 because it is a strict superset of ASCII.
Thank you. In Ubuntu there is not the same manpage. The homologous one is in section 7, and it is not as concise and clear as yours, which can instead be found in FreeBSD.
– BowPark
4 hours ago
I've added a similar citation from the Linux manual (7) to go along with the BSD manual (5) one.
– DopeGhoti
2 hours ago
Thank you so much!
– BowPark
1 hour ago
add a comment |
From the BSD manual, section 5, the page on UTF8 reads:
DESCRIPTION
The UTF-8 encoding represents UCS-4 characters as a sequence of octets, using between 1 and 6 for each character. It is backwards
compatible with ASCII, so0x00-0x7frefer to the ASCII character set.
The multibyte encoding of non-ASCII characters consist entirely of bytes whose high order bit is set. The actual encoding is
represented by the following table:
[0x00000000 - 0x0000007f] [00000000.0bbbbbbb] -> 0bbbbbbb
[0x00000080 - 0x000007ff] [00000bbb.bbbbbbbb] -> 110bbbbb, 10bbbbbb
[0x00000800 - 0x0000ffff] [bbbbbbbb.bbbbbbbb] ->
1110bbbb, 10bbbbbb, 10bbbbbb
[0x00010000 - 0x001fffff] [00000000.000bbbbb.bbbbbbbb.bbbbbbbb] ->
11110bbb, 10bbbbbb, 10bbbbbb, 10bbbbbb
[0x00200000 - 0x03ffffff] [000000bb.bbbbbbbb.bbbbbbbb.bbbbbbbb] ->
111110bb, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb
[0x04000000 - 0x7fffffff] [0bbbbbbb.bbbbbbbb.bbbbbbbb.bbbbbbbb] ->
1111110b, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb
If more than a single representation of a value exists (for example,
0x00;0xC0 0x80;0xE0 0x80 0x80), the shortest representation
is always used. Longer ones are detected as an error as they pose a
potential security risk, and destroy the 1:1 character:octet sequence mapping.
From the Linux manual, section 7, the page on UTF8 similarly reads:
DESCRIPTION
[... UTF-8 is situationally better than UCS-2 in part because i]n addition, the majority of UNIX tools expect ASCII files and can't read 16-bit words as characters without major modifications. [...]
The UTF-8 encoding of Unicode and UCS does not have these problems and is the common way in which Unicode is used on UNIX-style operating systems.
Properties
The UTF-8 encoding has the following nice properties:
- UCS characters
0x00000000to0x0000007f(the classic US-ASCII characters) are encoded simply as bytes0x00to0x7f(ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.
So it's not really possible to distinguish ASCII from UTF-8 because, in a UTF-8 file, ASCII is UTF-8. file looks at the first 96KiB of a file and tries to determine what it is. Because it sees more than zero UTF-8 code sequences, it determines the file to be UTF-8 because it is a strict superset of ASCII.
Thank you. In Ubuntu there is not the same manpage. The homologous one is in section 7, and it is not as concise and clear as yours, which can instead be found in FreeBSD.
– BowPark
4 hours ago
I've added a similar citation from the Linux manual (7) to go along with the BSD manual (5) one.
– DopeGhoti
2 hours ago
Thank you so much!
– BowPark
1 hour ago
add a comment |
From the BSD manual, section 5, the page on UTF8 reads:
DESCRIPTION
The UTF-8 encoding represents UCS-4 characters as a sequence of octets, using between 1 and 6 for each character. It is backwards
compatible with ASCII, so0x00-0x7frefer to the ASCII character set.
The multibyte encoding of non-ASCII characters consist entirely of bytes whose high order bit is set. The actual encoding is
represented by the following table:
[0x00000000 - 0x0000007f] [00000000.0bbbbbbb] -> 0bbbbbbb
[0x00000080 - 0x000007ff] [00000bbb.bbbbbbbb] -> 110bbbbb, 10bbbbbb
[0x00000800 - 0x0000ffff] [bbbbbbbb.bbbbbbbb] ->
1110bbbb, 10bbbbbb, 10bbbbbb
[0x00010000 - 0x001fffff] [00000000.000bbbbb.bbbbbbbb.bbbbbbbb] ->
11110bbb, 10bbbbbb, 10bbbbbb, 10bbbbbb
[0x00200000 - 0x03ffffff] [000000bb.bbbbbbbb.bbbbbbbb.bbbbbbbb] ->
111110bb, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb
[0x04000000 - 0x7fffffff] [0bbbbbbb.bbbbbbbb.bbbbbbbb.bbbbbbbb] ->
1111110b, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb
If more than a single representation of a value exists (for example,
0x00;0xC0 0x80;0xE0 0x80 0x80), the shortest representation
is always used. Longer ones are detected as an error as they pose a
potential security risk, and destroy the 1:1 character:octet sequence mapping.
From the Linux manual, section 7, the page on UTF8 similarly reads:
DESCRIPTION
[... UTF-8 is situationally better than UCS-2 in part because i]n addition, the majority of UNIX tools expect ASCII files and can't read 16-bit words as characters without major modifications. [...]
The UTF-8 encoding of Unicode and UCS does not have these problems and is the common way in which Unicode is used on UNIX-style operating systems.
Properties
The UTF-8 encoding has the following nice properties:
- UCS characters
0x00000000to0x0000007f(the classic US-ASCII characters) are encoded simply as bytes0x00to0x7f(ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.
So it's not really possible to distinguish ASCII from UTF-8 because, in a UTF-8 file, ASCII is UTF-8. file looks at the first 96KiB of a file and tries to determine what it is. Because it sees more than zero UTF-8 code sequences, it determines the file to be UTF-8 because it is a strict superset of ASCII.
From the BSD manual, section 5, the page on UTF8 reads:
DESCRIPTION
The UTF-8 encoding represents UCS-4 characters as a sequence of octets, using between 1 and 6 for each character. It is backwards
compatible with ASCII, so0x00-0x7frefer to the ASCII character set.
The multibyte encoding of non-ASCII characters consist entirely of bytes whose high order bit is set. The actual encoding is
represented by the following table:
[0x00000000 - 0x0000007f] [00000000.0bbbbbbb] -> 0bbbbbbb
[0x00000080 - 0x000007ff] [00000bbb.bbbbbbbb] -> 110bbbbb, 10bbbbbb
[0x00000800 - 0x0000ffff] [bbbbbbbb.bbbbbbbb] ->
1110bbbb, 10bbbbbb, 10bbbbbb
[0x00010000 - 0x001fffff] [00000000.000bbbbb.bbbbbbbb.bbbbbbbb] ->
11110bbb, 10bbbbbb, 10bbbbbb, 10bbbbbb
[0x00200000 - 0x03ffffff] [000000bb.bbbbbbbb.bbbbbbbb.bbbbbbbb] ->
111110bb, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb
[0x04000000 - 0x7fffffff] [0bbbbbbb.bbbbbbbb.bbbbbbbb.bbbbbbbb] ->
1111110b, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb
If more than a single representation of a value exists (for example,
0x00;0xC0 0x80;0xE0 0x80 0x80), the shortest representation
is always used. Longer ones are detected as an error as they pose a
potential security risk, and destroy the 1:1 character:octet sequence mapping.
From the Linux manual, section 7, the page on UTF8 similarly reads:
DESCRIPTION
[... UTF-8 is situationally better than UCS-2 in part because i]n addition, the majority of UNIX tools expect ASCII files and can't read 16-bit words as characters without major modifications. [...]
The UTF-8 encoding of Unicode and UCS does not have these problems and is the common way in which Unicode is used on UNIX-style operating systems.
Properties
The UTF-8 encoding has the following nice properties:
- UCS characters
0x00000000to0x0000007f(the classic US-ASCII characters) are encoded simply as bytes0x00to0x7f(ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.
So it's not really possible to distinguish ASCII from UTF-8 because, in a UTF-8 file, ASCII is UTF-8. file looks at the first 96KiB of a file and tries to determine what it is. Because it sees more than zero UTF-8 code sequences, it determines the file to be UTF-8 because it is a strict superset of ASCII.
edited 2 hours ago
answered 6 hours ago
DopeGhotiDopeGhoti
46.5k56190
46.5k56190
Thank you. In Ubuntu there is not the same manpage. The homologous one is in section 7, and it is not as concise and clear as yours, which can instead be found in FreeBSD.
– BowPark
4 hours ago
I've added a similar citation from the Linux manual (7) to go along with the BSD manual (5) one.
– DopeGhoti
2 hours ago
Thank you so much!
– BowPark
1 hour ago
add a comment |
Thank you. In Ubuntu there is not the same manpage. The homologous one is in section 7, and it is not as concise and clear as yours, which can instead be found in FreeBSD.
– BowPark
4 hours ago
I've added a similar citation from the Linux manual (7) to go along with the BSD manual (5) one.
– DopeGhoti
2 hours ago
Thank you so much!
– BowPark
1 hour ago
Thank you. In Ubuntu there is not the same manpage. The homologous one is in section 7, and it is not as concise and clear as yours, which can instead be found in FreeBSD.
– BowPark
4 hours ago
Thank you. In Ubuntu there is not the same manpage. The homologous one is in section 7, and it is not as concise and clear as yours, which can instead be found in FreeBSD.
– BowPark
4 hours ago
I've added a similar citation from the Linux manual (7) to go along with the BSD manual (5) one.
– DopeGhoti
2 hours ago
I've added a similar citation from the Linux manual (7) to go along with the BSD manual (5) one.
– DopeGhoti
2 hours ago
Thank you so much!
– BowPark
1 hour ago
Thank you so much!
– BowPark
1 hour ago
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f507782%2fdistinguish-ascii-from-utf-8-characters-in-the-same-file%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
I think that you haven't quite expressed the question that you mean to ask. Your question seems actually to be two questions: How does
fileknow that this is UTF-8, when it could instead be an old 8-bit encoding? followed by How does a UTF-8 decoder know where multiple-byte sequences begin and end?.– JdeBP
5 hours ago
@JdeBP Maybe unconsciously the actual questions were the ones you wrote (even if I just used
fileas a further verification). DopeGhoti's answer fits to the second one. For the first one, maybefilelooks for bytes "whose high order bit is set" and then is able to guess if there is an UTF-8 encoding.– BowPark
5 hours ago
1
The
filecommand on Ubuntu, as one of its tests, reads the first 96KiB of the file and checks whether there are any non-ASCII well-formed UTF-8 characters in it.– Mark Plotnick
4 hours ago