Distinguish ascii from UTF-8 characters in the same file

On Ubuntu 18.04, I created a dummy text file with just one UTF-8 character, è. The other characters are all ascii:

$ cat dummytext

Hello

Helloè

This is the resulting hexdump:

$ hexdump -C dummyfile

00000000  48 65 6c 6c 6f 0a 48 65  6c 6c 6f c3 a8 0a        |Hello.Hello...|

0000000e

The file is identified as

$ file dummyfile

dummyfile2: UTF-8 Unicode text

Each character is represented by a single byte, except for the UTF-8 è character, which is c3a8, so it is represented by 2 bytes. How can the file contents be correctly interpreted, if the number of bytes used to represent each character is not constant?

My guess: maybe the parser, when encountering a hex value which is greater than the last ascii character 7F (and this is the case of c3), is forced to read at least another byte, to determine the right character to be printed?

asked 6 hours ago

BowPark

1,60882746

I think that you haven't quite expressed the question that you mean to ask. Your question seems actually to be two questions: How does file know that this is UTF-8, when it could instead be an old 8-bit encoding? followed by How does a UTF-8 decoder know where multiple-byte sequences begin and end?.

– JdeBP
5 hours ago

@JdeBP Maybe unconsciously the actual questions were the ones you wrote (even if I just used file as a further verification). DopeGhoti's answer fits to the second one. For the first one, maybe file looks for bytes "whose high order bit is set" and then is able to guess if there is an UTF-8 encoding.

– BowPark
5 hours ago

1

The file command on Ubuntu, as one of its tests, reads the first 96KiB of the file and checks whether there are any non-ASCII well-formed UTF-8 characters in it.

– Mark Plotnick
4 hours ago

add a comment |

On Ubuntu 18.04, I created a dummy text file with just one UTF-8 character, è. The other characters are all ascii:

$ cat dummytext

Hello

Helloè

This is the resulting hexdump:

$ hexdump -C dummyfile

00000000  48 65 6c 6c 6f 0a 48 65  6c 6c 6f c3 a8 0a        |Hello.Hello...|

0000000e

The file is identified as

$ file dummyfile

dummyfile2: UTF-8 Unicode text

asked 6 hours ago

BowPark

1,60882746

I think that you haven't quite expressed the question that you mean to ask. Your question seems actually to be two questions: How does file know that this is UTF-8, when it could instead be an old 8-bit encoding? followed by How does a UTF-8 decoder know where multiple-byte sequences begin and end?.

– JdeBP
5 hours ago

@JdeBP Maybe unconsciously the actual questions were the ones you wrote (even if I just used file as a further verification). DopeGhoti's answer fits to the second one. For the first one, maybe file looks for bytes "whose high order bit is set" and then is able to guess if there is an UTF-8 encoding.

– BowPark
5 hours ago

1

The file command on Ubuntu, as one of its tests, reads the first 96KiB of the file and checks whether there are any non-ASCII well-formed UTF-8 characters in it.

– Mark Plotnick
4 hours ago

add a comment |

On Ubuntu 18.04, I created a dummy text file with just one UTF-8 character, è. The other characters are all ascii:

$ cat dummytext

Hello

Helloè

This is the resulting hexdump:

$ hexdump -C dummyfile

00000000  48 65 6c 6c 6f 0a 48 65  6c 6c 6f c3 a8 0a        |Hello.Hello...|

0000000e

The file is identified as

$ file dummyfile

dummyfile2: UTF-8 Unicode text

asked 6 hours ago

BowPark

1,60882746

On Ubuntu 18.04, I created a dummy text file with just one UTF-8 character, è. The other characters are all ascii:

$ cat dummytext

Hello

Helloè

This is the resulting hexdump:

$ hexdump -C dummyfile

00000000  48 65 6c 6c 6f 0a 48 65  6c 6c 6f c3 a8 0a        |Hello.Hello...|

0000000e

The file is identified as

$ file dummyfile

dummyfile2: UTF-8 Unicode text

text-processing unicode character-encoding ascii

asked 6 hours ago

BowPark

1,60882746

asked 6 hours ago

BowPark

1,60882746

asked 6 hours ago

BowPark

1,60882746

asked 6 hours ago

BowPark

1,60882746

asked 6 hours ago

BowPark

1,60882746

I think that you haven't quite expressed the question that you mean to ask. Your question seems actually to be two questions: How does file know that this is UTF-8, when it could instead be an old 8-bit encoding? followed by How does a UTF-8 decoder know where multiple-byte sequences begin and end?.

– JdeBP
5 hours ago

@JdeBP Maybe unconsciously the actual questions were the ones you wrote (even if I just used file as a further verification). DopeGhoti's answer fits to the second one. For the first one, maybe file looks for bytes "whose high order bit is set" and then is able to guess if there is an UTF-8 encoding.

– BowPark
5 hours ago

1

The file command on Ubuntu, as one of its tests, reads the first 96KiB of the file and checks whether there are any non-ASCII well-formed UTF-8 characters in it.

– Mark Plotnick
4 hours ago

add a comment |

I think that you haven't quite expressed the question that you mean to ask. Your question seems actually to be two questions: How does file know that this is UTF-8, when it could instead be an old 8-bit encoding? followed by How does a UTF-8 decoder know where multiple-byte sequences begin and end?.

– JdeBP
5 hours ago

@JdeBP Maybe unconsciously the actual questions were the ones you wrote (even if I just used file as a further verification). DopeGhoti's answer fits to the second one. For the first one, maybe file looks for bytes "whose high order bit is set" and then is able to guess if there is an UTF-8 encoding.

– BowPark
5 hours ago

1

The file command on Ubuntu, as one of its tests, reads the first 96KiB of the file and checks whether there are any non-ASCII well-formed UTF-8 characters in it.

– Mark Plotnick
4 hours ago

I think that you haven't quite expressed the question that you mean to ask. Your question seems actually to be two questions: How does file know that this is UTF-8, when it could instead be an old 8-bit encoding? followed by How does a UTF-8 decoder know where multiple-byte sequences begin and end?.

– JdeBP
5 hours ago

@JdeBP Maybe unconsciously the actual questions were the ones you wrote (even if I just used file as a further verification). DopeGhoti's answer fits to the second one. For the first one, maybe file looks for bytes "whose high order bit is set" and then is able to guess if there is an UTF-8 encoding.

– BowPark
5 hours ago

The file command on Ubuntu, as one of its tests, reads the first 96KiB of the file and checks whether there are any non-ASCII well-formed UTF-8 characters in it.

– Mark Plotnick
4 hours ago

add a comment |

1 Answer
1

active

oldest

votes

From the BSD manual, section 5, the page on UTF8 reads:

DESCRIPTION

The UTF-8 encoding represents UCS-4 characters as a sequence of octets, using between 1 and 6 for each character. It is backwards
compatible with ASCII, so 0x00-0x7f refer to the ASCII character set.

The multibyte encoding of non-ASCII characters consist entirely of bytes whose high order bit is set. The actual encoding is
represented by the following table:
 [0x00000000 - 0x0000007f] [00000000.0bbbbbbb] -> 0bbbbbbb

 [0x00000080 - 0x000007ff] [00000bbb.bbbbbbbb] -> 110bbbbb, 10bbbbbb

 [0x00000800 - 0x0000ffff] [bbbbbbbb.bbbbbbbb] ->

         1110bbbb, 10bbbbbb, 10bbbbbb

 [0x00010000 - 0x001fffff] [00000000.000bbbbb.bbbbbbbb.bbbbbbbb] ->

         11110bbb, 10bbbbbb, 10bbbbbb, 10bbbbbb

 [0x00200000 - 0x03ffffff] [000000bb.bbbbbbbb.bbbbbbbb.bbbbbbbb] ->

         111110bb, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb

 [0x04000000 - 0x7fffffff] [0bbbbbbb.bbbbbbbb.bbbbbbbb.bbbbbbbb] ->

         1111110b, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb
If more than a single representation of a value exists (for example, 0x00; 0xC0 0x80;0xE0 0x80 0x80), the shortest representation
is always used. Longer ones are detected as an error as they pose a
potential security risk, and destroy the 1:1 character:octet sequence mapping.

From the Linux manual, section 7, the page on UTF8 similarly reads:

DESCRIPTION

[... UTF-8 is situationally better than UCS-2 in part because i]n addition, the majority of UNIX tools expect ASCII files and can't read 16-bit words as characters without major modifications. [...]

The UTF-8 encoding of Unicode and UCS does not have these problems and is the common way in which Unicode is used on UNIX-style operating systems.

Properties

The UTF-8 encoding has the following nice properties:

UCS characters 0x00000000 to 0x0000007f (the classic US-ASCII characters) are encoded simply as bytes 0x00 to 0x7f (ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.

So it's not really possible to distinguish ASCII from UTF-8 because, in a UTF-8 file, ASCII is UTF-8. file looks at the first 96KiB of a file and tries to determine what it is. Because it sees more than zero UTF-8 code sequences, it determines the file to be UTF-8 because it is a strict superset of ASCII.

edited 2 hours ago

answered 6 hours ago

DopeGhoti

46.5k56190

Thank you. In Ubuntu there is not the same manpage. The homologous one is in section 7, and it is not as concise and clear as yours, which can instead be found in FreeBSD.

– BowPark
4 hours ago

I've added a similar citation from the Linux manual (7) to go along with the BSD manual (5) one.

– DopeGhoti
2 hours ago

Thank you so much!

– BowPark
1 hour ago

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f507782%2fdistinguish-ascii-from-utf-8-characters-in-the-same-file%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

From the BSD manual, section 5, the page on UTF8 reads:

DESCRIPTION

The UTF-8 encoding represents UCS-4 characters as a sequence of octets, using between 1 and 6 for each character. It is backwards
compatible with ASCII, so 0x00-0x7f refer to the ASCII character set.

The multibyte encoding of non-ASCII characters consist entirely of bytes whose high order bit is set. The actual encoding is
represented by the following table:
 [0x00000000 - 0x0000007f] [00000000.0bbbbbbb] -> 0bbbbbbb

 [0x00000080 - 0x000007ff] [00000bbb.bbbbbbbb] -> 110bbbbb, 10bbbbbb

 [0x00000800 - 0x0000ffff] [bbbbbbbb.bbbbbbbb] ->

         1110bbbb, 10bbbbbb, 10bbbbbb

 [0x00010000 - 0x001fffff] [00000000.000bbbbb.bbbbbbbb.bbbbbbbb] ->

         11110bbb, 10bbbbbb, 10bbbbbb, 10bbbbbb

 [0x00200000 - 0x03ffffff] [000000bb.bbbbbbbb.bbbbbbbb.bbbbbbbb] ->

         111110bb, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb

 [0x04000000 - 0x7fffffff] [0bbbbbbb.bbbbbbbb.bbbbbbbb.bbbbbbbb] ->

         1111110b, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb
If more than a single representation of a value exists (for example, 0x00; 0xC0 0x80;0xE0 0x80 0x80), the shortest representation
is always used. Longer ones are detected as an error as they pose a
potential security risk, and destroy the 1:1 character:octet sequence mapping.

From the Linux manual, section 7, the page on UTF8 similarly reads:

DESCRIPTION

[... UTF-8 is situationally better than UCS-2 in part because i]n addition, the majority of UNIX tools expect ASCII files and can't read 16-bit words as characters without major modifications. [...]

The UTF-8 encoding of Unicode and UCS does not have these problems and is the common way in which Unicode is used on UNIX-style operating systems.

Properties

The UTF-8 encoding has the following nice properties:

UCS characters 0x00000000 to 0x0000007f (the classic US-ASCII characters) are encoded simply as bytes 0x00 to 0x7f (ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.

edited 2 hours ago

answered 6 hours ago

DopeGhoti

46.5k56190

Thank you. In Ubuntu there is not the same manpage. The homologous one is in section 7, and it is not as concise and clear as yours, which can instead be found in FreeBSD.

– BowPark
4 hours ago

I've added a similar citation from the Linux manual (7) to go along with the BSD manual (5) one.

– DopeGhoti
2 hours ago

Thank you so much!

– BowPark
1 hour ago

add a comment |

From the BSD manual, section 5, the page on UTF8 reads:

DESCRIPTION

The UTF-8 encoding represents UCS-4 characters as a sequence of octets, using between 1 and 6 for each character. It is backwards
compatible with ASCII, so 0x00-0x7f refer to the ASCII character set.

The multibyte encoding of non-ASCII characters consist entirely of bytes whose high order bit is set. The actual encoding is
represented by the following table:
 [0x00000000 - 0x0000007f] [00000000.0bbbbbbb] -> 0bbbbbbb

 [0x00000080 - 0x000007ff] [00000bbb.bbbbbbbb] -> 110bbbbb, 10bbbbbb

 [0x00000800 - 0x0000ffff] [bbbbbbbb.bbbbbbbb] ->

         1110bbbb, 10bbbbbb, 10bbbbbb

 [0x00010000 - 0x001fffff] [00000000.000bbbbb.bbbbbbbb.bbbbbbbb] ->

         11110bbb, 10bbbbbb, 10bbbbbb, 10bbbbbb

 [0x00200000 - 0x03ffffff] [000000bb.bbbbbbbb.bbbbbbbb.bbbbbbbb] ->

         111110bb, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb

 [0x04000000 - 0x7fffffff] [0bbbbbbb.bbbbbbbb.bbbbbbbb.bbbbbbbb] ->

         1111110b, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb
If more than a single representation of a value exists (for example, 0x00; 0xC0 0x80;0xE0 0x80 0x80), the shortest representation
is always used. Longer ones are detected as an error as they pose a
potential security risk, and destroy the 1:1 character:octet sequence mapping.

From the Linux manual, section 7, the page on UTF8 similarly reads:

DESCRIPTION

[... UTF-8 is situationally better than UCS-2 in part because i]n addition, the majority of UNIX tools expect ASCII files and can't read 16-bit words as characters without major modifications. [...]

The UTF-8 encoding of Unicode and UCS does not have these problems and is the common way in which Unicode is used on UNIX-style operating systems.

Properties

The UTF-8 encoding has the following nice properties:

UCS characters 0x00000000 to 0x0000007f (the classic US-ASCII characters) are encoded simply as bytes 0x00 to 0x7f (ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.

edited 2 hours ago

answered 6 hours ago

DopeGhoti

46.5k56190

Thank you. In Ubuntu there is not the same manpage. The homologous one is in section 7, and it is not as concise and clear as yours, which can instead be found in FreeBSD.

– BowPark
4 hours ago

I've added a similar citation from the Linux manual (7) to go along with the BSD manual (5) one.

– DopeGhoti
2 hours ago

Thank you so much!

– BowPark
1 hour ago

add a comment |

From the BSD manual, section 5, the page on UTF8 reads:

DESCRIPTION

The UTF-8 encoding represents UCS-4 characters as a sequence of octets, using between 1 and 6 for each character. It is backwards
compatible with ASCII, so 0x00-0x7f refer to the ASCII character set.

The multibyte encoding of non-ASCII characters consist entirely of bytes whose high order bit is set. The actual encoding is
represented by the following table:
 [0x00000000 - 0x0000007f] [00000000.0bbbbbbb] -> 0bbbbbbb

 [0x00000080 - 0x000007ff] [00000bbb.bbbbbbbb] -> 110bbbbb, 10bbbbbb

 [0x00000800 - 0x0000ffff] [bbbbbbbb.bbbbbbbb] ->

         1110bbbb, 10bbbbbb, 10bbbbbb

 [0x00010000 - 0x001fffff] [00000000.000bbbbb.bbbbbbbb.bbbbbbbb] ->

         11110bbb, 10bbbbbb, 10bbbbbb, 10bbbbbb

 [0x00200000 - 0x03ffffff] [000000bb.bbbbbbbb.bbbbbbbb.bbbbbbbb] ->

         111110bb, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb

 [0x04000000 - 0x7fffffff] [0bbbbbbb.bbbbbbbb.bbbbbbbb.bbbbbbbb] ->

         1111110b, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb
If more than a single representation of a value exists (for example, 0x00; 0xC0 0x80;0xE0 0x80 0x80), the shortest representation
is always used. Longer ones are detected as an error as they pose a
potential security risk, and destroy the 1:1 character:octet sequence mapping.

From the Linux manual, section 7, the page on UTF8 similarly reads:

DESCRIPTION

[... UTF-8 is situationally better than UCS-2 in part because i]n addition, the majority of UNIX tools expect ASCII files and can't read 16-bit words as characters without major modifications. [...]

The UTF-8 encoding of Unicode and UCS does not have these problems and is the common way in which Unicode is used on UNIX-style operating systems.

Properties

The UTF-8 encoding has the following nice properties:

UCS characters 0x00000000 to 0x0000007f (the classic US-ASCII characters) are encoded simply as bytes 0x00 to 0x7f (ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.

edited 2 hours ago

answered 6 hours ago

DopeGhoti

46.5k56190

From the BSD manual, section 5, the page on UTF8 reads:

DESCRIPTION

The UTF-8 encoding represents UCS-4 characters as a sequence of octets, using between 1 and 6 for each character. It is backwards
compatible with ASCII, so 0x00-0x7f refer to the ASCII character set.

The multibyte encoding of non-ASCII characters consist entirely of bytes whose high order bit is set. The actual encoding is
represented by the following table:
 [0x00000000 - 0x0000007f] [00000000.0bbbbbbb] -> 0bbbbbbb

 [0x00000080 - 0x000007ff] [00000bbb.bbbbbbbb] -> 110bbbbb, 10bbbbbb

 [0x00000800 - 0x0000ffff] [bbbbbbbb.bbbbbbbb] ->

         1110bbbb, 10bbbbbb, 10bbbbbb

 [0x00010000 - 0x001fffff] [00000000.000bbbbb.bbbbbbbb.bbbbbbbb] ->

         11110bbb, 10bbbbbb, 10bbbbbb, 10bbbbbb

 [0x00200000 - 0x03ffffff] [000000bb.bbbbbbbb.bbbbbbbb.bbbbbbbb] ->

         111110bb, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb

 [0x04000000 - 0x7fffffff] [0bbbbbbb.bbbbbbbb.bbbbbbbb.bbbbbbbb] ->

         1111110b, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb
If more than a single representation of a value exists (for example, 0x00; 0xC0 0x80;0xE0 0x80 0x80), the shortest representation
is always used. Longer ones are detected as an error as they pose a
potential security risk, and destroy the 1:1 character:octet sequence mapping.

From the Linux manual, section 7, the page on UTF8 similarly reads:

DESCRIPTION

[... UTF-8 is situationally better than UCS-2 in part because i]n addition, the majority of UNIX tools expect ASCII files and can't read 16-bit words as characters without major modifications. [...]

The UTF-8 encoding of Unicode and UCS does not have these problems and is the common way in which Unicode is used on UNIX-style operating systems.

Properties

The UTF-8 encoding has the following nice properties:

UCS characters 0x00000000 to 0x0000007f (the classic US-ASCII characters) are encoded simply as bytes 0x00 to 0x7f (ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.

edited 2 hours ago

answered 6 hours ago

DopeGhoti

46.5k56190

edited 2 hours ago

answered 6 hours ago

DopeGhoti

46.5k56190

answered 6 hours ago

DopeGhoti

46.5k56190

answered 6 hours ago

DopeGhoti

46.5k56190

Thank you. In Ubuntu there is not the same manpage. The homologous one is in section 7, and it is not as concise and clear as yours, which can instead be found in FreeBSD.

– BowPark
4 hours ago

I've added a similar citation from the Linux manual (7) to go along with the BSD manual (5) one.

– DopeGhoti
2 hours ago

Thank you so much!

– BowPark
1 hour ago

add a comment |

Thank you. In Ubuntu there is not the same manpage. The homologous one is in section 7, and it is not as concise and clear as yours, which can instead be found in FreeBSD.

– BowPark
4 hours ago

I've added a similar citation from the Linux manual (7) to go along with the BSD manual (5) one.

– DopeGhoti
2 hours ago

Thank you so much!

– BowPark
1 hour ago

Thank you. In Ubuntu there is not the same manpage. The homologous one is in section 7, and it is not as concise and clear as yours, which can instead be found in FreeBSD.

– BowPark
4 hours ago

I've added a similar citation from the Linux manual (7) to go along with the BSD manual (5) one.

– DopeGhoti
2 hours ago

Thank you so much!

– BowPark
1 hour ago

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Unix & Linux Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Sstrhsrtj