Why is bash removing other digits?

up vote
3
down vote

favorite

On this (It is not intended to be a range, but an explicit list):

$ a='0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९'

$ echo "${a//[0123456789]}"

  ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९

Bash is incorrectly (IMO) removing the digits ٠١٢٣٤٥٦٧٨٩ (the second group).

The characters are all different (hand formatted):

$ for c in $(echo "$a" | grep -o .); do printf '\U%04x ' "'$c"; done; echo

U0030 U0031 U0032 U0033 U0034 U0035 U0036 U0037 U0038 U0039

U0660 U0661 U0662 U0663 U0664 U0665 U0666 U0667 U0668 U0669

U06f0 U06f1 U06f2 U06f3 U06f4 U06f5 U06f6 U06f7 U06f8 U06f9

U07c0 U07c1 U07c2 U07c3 U07c4 U07c5 U07c6 U07c7 U07c8 U07c9

U0966 U0967 U0968 U0969 U096a U096b U096c U096d U096e U096f

Which correspond to:

123456789    # Hindu-Arabic Arabic numerals

٠١٢٣٤٥٦٧٨٩   # ARABIC-INDIC

۰۱۲۳۴۵۶۷۸۹   # EXTENDED ARABIC-INDIC/PERSIAN

߀߁߂߃߄߅߆߇߈߉  # NKO DIGIT

०१२३४५६७८९   # DEVANAGARI

To ensure there are no problems with pasting from this website, it is also possible to produce this Unicode content into the a variable using the Unicode escapes:

a=$(echo -e 'u0030u0031u0032u0033u0034u0035u0036u0037u0038u0039 u0660u0661u0662u0663u0664u0665u0666u0667u0668u0669 u06f0u06f1u06f2u06f3u06f4u06f5u06f6u06f7u06f8u06f9 u07c0u07c1u07c2u07c3u07c4u07c5u07c6u07c7u07c8u07c9 u0966u0967u0968u0969u096au096bu096cu096du096eu096f')

Or using the $'...' strings which accept escapes directly:

a=$'u0030u0031u0032u0033u0034u0035u0036u0037u0038u0039 u0660u0661u0662u0663u0664u0665u0666u0667u0668u0669 u06f0u06f1u06f2u06f3u06f4u06f5u06f6u06f7u06f8u06f9 u07c0u07c1u07c2u07c3u07c4u07c5u07c6u07c7u07c8u07c9 u0966u0967u0968u0969u096au096bu096cu096du096eu096f'

Other shells do not work as bash (hand formatted):

$ for sh in zsh ksh lksh mksh bash; do $sh -c 'a="0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९"; echo "$0 : ${a//[0123456789]}" $sh'; done

zsh  :  ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९

ksh  :  ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९

lksh :  ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९

mksh :  ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९

bash :   ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९

The bash sort order is:

$ mkdir test1; cd test1; IFS=$' tn'

$ touch $(echo "$a" | grep -o .)

$ printf '%s' *; echo

߃߇߆߁߂߅߉߄߀߈0٠०۰1١१۱٢2२۲3٣३۳٤4४۴٥5५۵٦6६۶7٧७۷8٨८۸٩9९۹



$ locale

LANG=en_US.utf8

LANGUAGE=

LC_CTYPE="en_US.utf8"

LC_NUMERIC="en_US.utf8"

LC_TIME="en_US.utf8"

LC_COLLATE="en_US.utf8"

LC_MONETARY="en_US.utf8"

LC_MESSAGES="en_US.utf8"

LC_PAPER="en_US.utf8"

LC_NAME="en_US.utf8"

LC_ADDRESS="en_US.utf8"

LC_TELEPHONE="en_US.utf8"

LC_MEASUREMENT="en_US.utf8"

LC_IDENTIFICATION="en_US.utf8"

LC_ALL=

It doesn't seem to be applying the sort order to remove characters.

It shouldn't anyway (IMO) as the characters are being explicitly listed.

So: Why?

Using bash 4.4.12 here. But it fails also with 3.0, 3.2, 4.0, 4.1, 4.4.23, 5.0 but not with 2.0.1 nor 2.0.5. It seems that a change in 3.0 caused the issue.

edited 2 days ago

Filipe Brandenburger

6,4101727

asked Nov 23 at 18:33

Isaac

9,70311445

1

Probably this depend on the LANG setting of your environment?
– Romeo Ninov
Nov 23 at 18:38

2

@RomeoNinov it depends on LC_COLLATE not on LANG
– mosvy
Nov 23 at 18:40

Yes, LC_COLLATE is changing it. That's why I posted the sort order that bash is using (en_US.utf8 if it needs to be known). But that doesn't match the result anyway. ... . And, if the characters are being explicitly given: Why should the collation be applied?
– Isaac
Nov 23 at 18:45

1

It shouldn't. It's a bug.
– mosvy
Nov 23 at 19:12

MacOS Sierra 10.12.4, Bash 4.4.12(1)-release and en_US.UTF-8: a='0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९'; b="${a//[0123456789]}"; echo "${#a} ${#b}" outputs 54 44 which is what you would expect. I do see your issue on Ubuntu 17.10, 4.4.12(1)-release and en_US.UTF-8 where the output of my command is 54 34. I spotted a report here for C.UTF-8 but I don't know if the underlying issue is relevant.
– Dennis Williamson
Nov 24 at 1:11

|
show 2 more comments

up vote
3
down vote

favorite

On this (It is not intended to be a range, but an explicit list):

$ a='0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९'

$ echo "${a//[0123456789]}"

  ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९

Bash is incorrectly (IMO) removing the digits ٠١٢٣٤٥٦٧٨٩ (the second group).

The characters are all different (hand formatted):

$ for c in $(echo "$a" | grep -o .); do printf '\U%04x ' "'$c"; done; echo

U0030 U0031 U0032 U0033 U0034 U0035 U0036 U0037 U0038 U0039

U0660 U0661 U0662 U0663 U0664 U0665 U0666 U0667 U0668 U0669

U06f0 U06f1 U06f2 U06f3 U06f4 U06f5 U06f6 U06f7 U06f8 U06f9

U07c0 U07c1 U07c2 U07c3 U07c4 U07c5 U07c6 U07c7 U07c8 U07c9

U0966 U0967 U0968 U0969 U096a U096b U096c U096d U096e U096f

Which correspond to:

123456789    # Hindu-Arabic Arabic numerals

٠١٢٣٤٥٦٧٨٩   # ARABIC-INDIC

۰۱۲۳۴۵۶۷۸۹   # EXTENDED ARABIC-INDIC/PERSIAN

߀߁߂߃߄߅߆߇߈߉  # NKO DIGIT

०१२३४५६७८९   # DEVANAGARI

To ensure there are no problems with pasting from this website, it is also possible to produce this Unicode content into the a variable using the Unicode escapes:

a=$(echo -e 'u0030u0031u0032u0033u0034u0035u0036u0037u0038u0039 u0660u0661u0662u0663u0664u0665u0666u0667u0668u0669 u06f0u06f1u06f2u06f3u06f4u06f5u06f6u06f7u06f8u06f9 u07c0u07c1u07c2u07c3u07c4u07c5u07c6u07c7u07c8u07c9 u0966u0967u0968u0969u096au096bu096cu096du096eu096f')

Or using the $'...' strings which accept escapes directly:

a=$'u0030u0031u0032u0033u0034u0035u0036u0037u0038u0039 u0660u0661u0662u0663u0664u0665u0666u0667u0668u0669 u06f0u06f1u06f2u06f3u06f4u06f5u06f6u06f7u06f8u06f9 u07c0u07c1u07c2u07c3u07c4u07c5u07c6u07c7u07c8u07c9 u0966u0967u0968u0969u096au096bu096cu096du096eu096f'

Other shells do not work as bash (hand formatted):

$ for sh in zsh ksh lksh mksh bash; do $sh -c 'a="0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९"; echo "$0 : ${a//[0123456789]}" $sh'; done

zsh  :  ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९

ksh  :  ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९

lksh :  ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९

mksh :  ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९

bash :   ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९

The bash sort order is:

$ mkdir test1; cd test1; IFS=$' tn'

$ touch $(echo "$a" | grep -o .)

$ printf '%s' *; echo

߃߇߆߁߂߅߉߄߀߈0٠०۰1١१۱٢2२۲3٣३۳٤4४۴٥5५۵٦6६۶7٧७۷8٨८۸٩9९۹



$ locale

LANG=en_US.utf8

LANGUAGE=

LC_CTYPE="en_US.utf8"

LC_NUMERIC="en_US.utf8"

LC_TIME="en_US.utf8"

LC_COLLATE="en_US.utf8"

LC_MONETARY="en_US.utf8"

LC_MESSAGES="en_US.utf8"

LC_PAPER="en_US.utf8"

LC_NAME="en_US.utf8"

LC_ADDRESS="en_US.utf8"

LC_TELEPHONE="en_US.utf8"

LC_MEASUREMENT="en_US.utf8"

LC_IDENTIFICATION="en_US.utf8"

LC_ALL=

It doesn't seem to be applying the sort order to remove characters.

It shouldn't anyway (IMO) as the characters are being explicitly listed.

So: Why?

Using bash 4.4.12 here. But it fails also with 3.0, 3.2, 4.0, 4.1, 4.4.23, 5.0 but not with 2.0.1 nor 2.0.5. It seems that a change in 3.0 caused the issue.

edited 2 days ago

Filipe Brandenburger

6,4101727

asked Nov 23 at 18:33

Isaac

9,70311445

1

Probably this depend on the LANG setting of your environment?
– Romeo Ninov
Nov 23 at 18:38

2

@RomeoNinov it depends on LC_COLLATE not on LANG
– mosvy
Nov 23 at 18:40

Yes, LC_COLLATE is changing it. That's why I posted the sort order that bash is using (en_US.utf8 if it needs to be known). But that doesn't match the result anyway. ... . And, if the characters are being explicitly given: Why should the collation be applied?
– Isaac
Nov 23 at 18:45

1

It shouldn't. It's a bug.
– mosvy
Nov 23 at 19:12

MacOS Sierra 10.12.4, Bash 4.4.12(1)-release and en_US.UTF-8: a='0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९'; b="${a//[0123456789]}"; echo "${#a} ${#b}" outputs 54 44 which is what you would expect. I do see your issue on Ubuntu 17.10, 4.4.12(1)-release and en_US.UTF-8 where the output of my command is 54 34. I spotted a report here for C.UTF-8 but I don't know if the underlying issue is relevant.
– Dennis Williamson
Nov 24 at 1:11

|
show 2 more comments

up vote
3
down vote

favorite

On this (It is not intended to be a range, but an explicit list):

$ a='0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९'

$ echo "${a//[0123456789]}"

  ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९

Bash is incorrectly (IMO) removing the digits ٠١٢٣٤٥٦٧٨٩ (the second group).

The characters are all different (hand formatted):

$ for c in $(echo "$a" | grep -o .); do printf '\U%04x ' "'$c"; done; echo

U0030 U0031 U0032 U0033 U0034 U0035 U0036 U0037 U0038 U0039

U0660 U0661 U0662 U0663 U0664 U0665 U0666 U0667 U0668 U0669

U06f0 U06f1 U06f2 U06f3 U06f4 U06f5 U06f6 U06f7 U06f8 U06f9

U07c0 U07c1 U07c2 U07c3 U07c4 U07c5 U07c6 U07c7 U07c8 U07c9

U0966 U0967 U0968 U0969 U096a U096b U096c U096d U096e U096f

Which correspond to:

123456789    # Hindu-Arabic Arabic numerals

٠١٢٣٤٥٦٧٨٩   # ARABIC-INDIC

۰۱۲۳۴۵۶۷۸۹   # EXTENDED ARABIC-INDIC/PERSIAN

߀߁߂߃߄߅߆߇߈߉  # NKO DIGIT

०१२३४५६७८९   # DEVANAGARI

To ensure there are no problems with pasting from this website, it is also possible to produce this Unicode content into the a variable using the Unicode escapes:

a=$(echo -e 'u0030u0031u0032u0033u0034u0035u0036u0037u0038u0039 u0660u0661u0662u0663u0664u0665u0666u0667u0668u0669 u06f0u06f1u06f2u06f3u06f4u06f5u06f6u06f7u06f8u06f9 u07c0u07c1u07c2u07c3u07c4u07c5u07c6u07c7u07c8u07c9 u0966u0967u0968u0969u096au096bu096cu096du096eu096f')

Or using the $'...' strings which accept escapes directly:

a=$'u0030u0031u0032u0033u0034u0035u0036u0037u0038u0039 u0660u0661u0662u0663u0664u0665u0666u0667u0668u0669 u06f0u06f1u06f2u06f3u06f4u06f5u06f6u06f7u06f8u06f9 u07c0u07c1u07c2u07c3u07c4u07c5u07c6u07c7u07c8u07c9 u0966u0967u0968u0969u096au096bu096cu096du096eu096f'

Other shells do not work as bash (hand formatted):

$ for sh in zsh ksh lksh mksh bash; do $sh -c 'a="0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९"; echo "$0 : ${a//[0123456789]}" $sh'; done

zsh  :  ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९

ksh  :  ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९

lksh :  ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९

mksh :  ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९

bash :   ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९

The bash sort order is:

$ mkdir test1; cd test1; IFS=$' tn'

$ touch $(echo "$a" | grep -o .)

$ printf '%s' *; echo

߃߇߆߁߂߅߉߄߀߈0٠०۰1١१۱٢2२۲3٣३۳٤4४۴٥5५۵٦6६۶7٧७۷8٨८۸٩9९۹



$ locale

LANG=en_US.utf8

LANGUAGE=

LC_CTYPE="en_US.utf8"

LC_NUMERIC="en_US.utf8"

LC_TIME="en_US.utf8"

LC_COLLATE="en_US.utf8"

LC_MONETARY="en_US.utf8"

LC_MESSAGES="en_US.utf8"

LC_PAPER="en_US.utf8"

LC_NAME="en_US.utf8"

LC_ADDRESS="en_US.utf8"

LC_TELEPHONE="en_US.utf8"

LC_MEASUREMENT="en_US.utf8"

LC_IDENTIFICATION="en_US.utf8"

LC_ALL=

It doesn't seem to be applying the sort order to remove characters.

It shouldn't anyway (IMO) as the characters are being explicitly listed.

So: Why?

Using bash 4.4.12 here. But it fails also with 3.0, 3.2, 4.0, 4.1, 4.4.23, 5.0 but not with 2.0.1 nor 2.0.5. It seems that a change in 3.0 caused the issue.

edited 2 days ago

Filipe Brandenburger

6,4101727

asked Nov 23 at 18:33

Isaac

9,70311445

On this (It is not intended to be a range, but an explicit list):

$ a='0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९'

$ echo "${a//[0123456789]}"

  ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९

Bash is incorrectly (IMO) removing the digits ٠١٢٣٤٥٦٧٨٩ (the second group).

The characters are all different (hand formatted):

$ for c in $(echo "$a" | grep -o .); do printf '\U%04x ' "'$c"; done; echo

U0030 U0031 U0032 U0033 U0034 U0035 U0036 U0037 U0038 U0039

U0660 U0661 U0662 U0663 U0664 U0665 U0666 U0667 U0668 U0669

U06f0 U06f1 U06f2 U06f3 U06f4 U06f5 U06f6 U06f7 U06f8 U06f9

U07c0 U07c1 U07c2 U07c3 U07c4 U07c5 U07c6 U07c7 U07c8 U07c9

U0966 U0967 U0968 U0969 U096a U096b U096c U096d U096e U096f

Which correspond to:

123456789    # Hindu-Arabic Arabic numerals

٠١٢٣٤٥٦٧٨٩   # ARABIC-INDIC

۰۱۲۳۴۵۶۷۸۹   # EXTENDED ARABIC-INDIC/PERSIAN

߀߁߂߃߄߅߆߇߈߉  # NKO DIGIT

०१२३४५६७८९   # DEVANAGARI

To ensure there are no problems with pasting from this website, it is also possible to produce this Unicode content into the a variable using the Unicode escapes:

a=$(echo -e 'u0030u0031u0032u0033u0034u0035u0036u0037u0038u0039 u0660u0661u0662u0663u0664u0665u0666u0667u0668u0669 u06f0u06f1u06f2u06f3u06f4u06f5u06f6u06f7u06f8u06f9 u07c0u07c1u07c2u07c3u07c4u07c5u07c6u07c7u07c8u07c9 u0966u0967u0968u0969u096au096bu096cu096du096eu096f')

Or using the $'...' strings which accept escapes directly:

a=$'u0030u0031u0032u0033u0034u0035u0036u0037u0038u0039 u0660u0661u0662u0663u0664u0665u0666u0667u0668u0669 u06f0u06f1u06f2u06f3u06f4u06f5u06f6u06f7u06f8u06f9 u07c0u07c1u07c2u07c3u07c4u07c5u07c6u07c7u07c8u07c9 u0966u0967u0968u0969u096au096bu096cu096du096eu096f'

Other shells do not work as bash (hand formatted):

$ for sh in zsh ksh lksh mksh bash; do $sh -c 'a="0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९"; echo "$0 : ${a//[0123456789]}" $sh'; done

zsh  :  ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९

ksh  :  ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९

lksh :  ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९

mksh :  ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९

bash :   ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९

The bash sort order is:

$ mkdir test1; cd test1; IFS=$' tn'

$ touch $(echo "$a" | grep -o .)

$ printf '%s' *; echo

߃߇߆߁߂߅߉߄߀߈0٠०۰1١१۱٢2२۲3٣३۳٤4४۴٥5५۵٦6६۶7٧७۷8٨८۸٩9९۹



$ locale

LANG=en_US.utf8

LANGUAGE=

LC_CTYPE="en_US.utf8"

LC_NUMERIC="en_US.utf8"

LC_TIME="en_US.utf8"

LC_COLLATE="en_US.utf8"

LC_MONETARY="en_US.utf8"

LC_MESSAGES="en_US.utf8"

LC_PAPER="en_US.utf8"

LC_NAME="en_US.utf8"

LC_ADDRESS="en_US.utf8"

LC_TELEPHONE="en_US.utf8"

LC_MEASUREMENT="en_US.utf8"

LC_IDENTIFICATION="en_US.utf8"

LC_ALL=

It doesn't seem to be applying the sort order to remove characters.

It shouldn't anyway (IMO) as the characters are being explicitly listed.

So: Why?

Using bash 4.4.12 here. But it fails also with 3.0, 3.2, 4.0, 4.1, 4.4.23, 5.0 but not with 2.0.1 nor 2.0.5. It seems that a change in 3.0 caused the issue.

bash locale unicode

edited 2 days ago

Filipe Brandenburger

6,4101727

asked Nov 23 at 18:33

Isaac

9,70311445

edited 2 days ago

Filipe Brandenburger

6,4101727

asked Nov 23 at 18:33

Isaac

9,70311445

edited 2 days ago

Filipe Brandenburger

6,4101727

edited 2 days ago

Filipe Brandenburger

6,4101727

edited 2 days ago

Filipe Brandenburger

6,4101727

asked Nov 23 at 18:33

Isaac

9,70311445

asked Nov 23 at 18:33

Isaac

9,70311445

asked Nov 23 at 18:33

Isaac

9,70311445

1

Probably this depend on the LANG setting of your environment?
– Romeo Ninov
Nov 23 at 18:38

2

@RomeoNinov it depends on LC_COLLATE not on LANG
– mosvy
Nov 23 at 18:40

Yes, LC_COLLATE is changing it. That's why I posted the sort order that bash is using (en_US.utf8 if it needs to be known). But that doesn't match the result anyway. ... . And, if the characters are being explicitly given: Why should the collation be applied?
– Isaac
Nov 23 at 18:45

1

It shouldn't. It's a bug.
– mosvy
Nov 23 at 19:12

MacOS Sierra 10.12.4, Bash 4.4.12(1)-release and en_US.UTF-8: a='0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९'; b="${a//[0123456789]}"; echo "${#a} ${#b}" outputs 54 44 which is what you would expect. I do see your issue on Ubuntu 17.10, 4.4.12(1)-release and en_US.UTF-8 where the output of my command is 54 34. I spotted a report here for C.UTF-8 but I don't know if the underlying issue is relevant.
– Dennis Williamson
Nov 24 at 1:11

|
show 2 more comments

1

Probably this depend on the LANG setting of your environment?
– Romeo Ninov
Nov 23 at 18:38

2

@RomeoNinov it depends on LC_COLLATE not on LANG
– mosvy
Nov 23 at 18:40

Yes, LC_COLLATE is changing it. That's why I posted the sort order that bash is using (en_US.utf8 if it needs to be known). But that doesn't match the result anyway. ... . And, if the characters are being explicitly given: Why should the collation be applied?
– Isaac
Nov 23 at 18:45

1

It shouldn't. It's a bug.
– mosvy
Nov 23 at 19:12

MacOS Sierra 10.12.4, Bash 4.4.12(1)-release and en_US.UTF-8: a='0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९'; b="${a//[0123456789]}"; echo "${#a} ${#b}" outputs 54 44 which is what you would expect. I do see your issue on Ubuntu 17.10, 4.4.12(1)-release and en_US.UTF-8 where the output of my command is 54 34. I spotted a report here for C.UTF-8 but I don't know if the underlying issue is relevant.
– Dennis Williamson
Nov 24 at 1:11

Probably this depend on the LANG setting of your environment?
– Romeo Ninov
Nov 23 at 18:38

@RomeoNinov it depends on LC_COLLATE not on LANG
– mosvy
Nov 23 at 18:40

Yes, LC_COLLATE is changing it. That's why I posted the sort order that bash is using (en_US.utf8 if it needs to be known). But that doesn't match the result anyway. ... . And, if the characters are being explicitly given: Why should the collation be applied?
– Isaac
Nov 23 at 18:45

It shouldn't. It's a bug.
– mosvy
Nov 23 at 19:12

MacOS Sierra 10.12.4, Bash 4.4.12(1)-release and en_US.UTF-8:

a='0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९'; b="${a//[0123456789]}"; echo "${#a} ${#b}"

outputs 54 44 which is what you would expect. I do see your issue on Ubuntu 17.10, 4.4.12(1)-release and en_US.UTF-8 where the output of my command is 54 34. I spotted a report here for C.UTF-8 but I don't know if the underlying issue is relevant.
– Dennis Williamson
Nov 24 at 1:11

MacOS Sierra 10.12.4, Bash 4.4.12(1)-release and en_US.UTF-8:

a='0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९'; b="${a//[0123456789]}"; echo "${#a} ${#b}"

|
show 2 more comments

1 Answer
1

active

oldest

votes

up vote
3
down vote

I managed to reproduce this problem on Ubuntu 17.10 (glibc 2.26) and on Ubuntu 18.04 (glibc 2.27), but it seems to be fixed on Ubuntu 18.10 (glibc 2.28)

The problem is with the localedata, more specifically the LC_COLLATE data for en_US.utf8 (actually, that collation data comes from an ISO 14651 file which is included in most locales, so it probably affects all other utf8 locales as well.)

The localedata comes from glibc and the bug seems to be present there (though distros customize that data fairly heavily, so it's possible other distros with glibc <2.28 might not have the issue.)

In fact, the glibc 2.28 announcement starts listing new features with:

The localization data for ISO 14651 is updated to match the 2016
Edition 4 release of the standard, this matches data provided by
Unicode 9.0.0. This update introduces significant improvements to the
collation of Unicode characters.

Looking at the commits, it's a huge overhaul on the localedata, so that's probably what fixed the bug!

In short, the issue with collation of these two symbols (U0030, which is '0', and U0660, which is the Arabic-Indic zero '٠') is that they sort exactly the same, when compared using strcoll(3), which can be demonstrated with this short test using sort (which uses strcoll under the hood):

ubuntu-18.04$ { echo 0; echo -e 'u0660'; echo 0; } | sort

0

٠

0

And on glibc 2.28:

ubuntu-18.10$ { echo 0; echo -e 'u0660'; echo 0; } | sort

0

0

٠

As you can see, on the older glibc, it's not reordering the Arabic-Indic zero '٠', neither before nor after the '0', which proves they collate the same.

Looking at the glibc sources, we can understand why the problem happens.

In the glibc 2.27 sources for ISO 14651, the following definitions can be found:

<U0030> <0>;<BAS>;<MIN>;IGNORE # 171 0

<U0660> <0>;<BAS>;<MIN>;IGNORE

<U06F0> <0>;<PCL>;<MIN>;IGNORE

<U0966> <0>;"<BAS><NUM>";"<MIN><MIN>";IGNORE

So both '0' (u0030) and '٠' (u0660) expand to the exact same sequence (<0>;<BAS>;<MIN>;IGNORE) which means that strcoll will treat them the same. (This also explains why the other characters such as u06f0 and u0966 are not affected, since their expansion is different.)

Looking at the glibc 2.28 sources for ISO 14651, the following definitions are now found:

<U0030> <S0030>;<BASE>;<MIN>;<U0030> % DIGIT ZERO

<U0660> <S0030>;<BASE>;<MIN>;<U0660> % ARABIC-INDIC DIGIT ZERO

<U06F0> <S0030>;<BASE>;<MIN>;<U06F0> % EXTENDED ARABIC-INDIC DIGIT ZERO

<U07C0> <S0030>;<BASE>;<MIN>;<U07C0> % NKO DIGIT ZERO

<U0966> <S0030>;<BASE>;<MIN>;<U0966> % DEVANAGARI DIGIT ZERO

The fourth field is now always filled with the code point itself, which means they will have a defined sort order, even if the first few fields match. While the change for <U0660> was not introduced in this particular commit, its description explains the idea:

[...] putting the code point of the character into the fourth level
instead of “IGNORE”. Without that change, all such characters
would compare equal which would make a wcscoll test case fail.
It is better to have a clearly defined sort order even for characters
like this so it is good to use the code point as a tie-break.

localedata/locales/iso14651_t1_common: Use the code point of a
character in the fourth collation level instead of IGNORE for all
entries which have IGNORE on all 4 levels.

So hopefully this explains the bug with localedata in glibc <2.28 and the fix in glibc 2.28.

Regarding bash, if you look at the source code, you'll see that it handles a single character (0) in a bracket expression ([0]) the same as if it was a range with the character as both start and end ([0-0]):

cstart = cend = FOLD (cstart);

Then later it compares the current character with that range using RANGECMP:

if (RANGECMP (test, cstart, forcecoll) >= 0 && RANGECMP (test, cend, forcecoll) <= 0)

  goto matched;

And then RANGECMP (defined to rangecmp_wc in multi-byte mode) uses wcscoll(3) (which is the multi-byte version of strcoll):

return (wcscoll (s1, s2));

The fact that bash uses a range comparison for a single character (as a shortcut, to share a bit of the code with handling of ranges) makes it so that it accepts all characters that sort the same as well as the original character.

Other shells probably don't have this problem because they do a straight comparison if a range is not involved.

The reason why this issue started appearing on bash 3.0 is that bash 3.0 introduced support for multi-byte (Unicode), which ended up refactoring all this code and probably using locale-aware comparisons, which are connected to the issue.

UPDATE: This issue was reported as a bug to the bash project by @Isaac.

WORKAROUND: If upgrading to a distro that uses glibc 2.28 is unfeasible, a possible workaround is to use LC_COLLATE=C.utf8 or POSIX.utf8 which define a "trivial" sort order where no codepoints will sort the same. Considering the issue is with collation, setting LC_COLLATE only is enough. Testing this workaround on Ubuntu 17.10 and 18.04 showed it was enough to fix this problem.

edited 2 hours ago

answered 2 days ago

Filipe Brandenburger

6,4101727

In the text "Looking at the glibc 2.27 sources for ISO 14651, the following definitions are now found:" the version number should be 2.28, right? (Can't edit as it's too short of a change)
– Grisha Levit
2 days ago

@GrishaLevit Indeed! Thanks for spotting that. I just edited it to fix it. Cheers!
– Filipe Brandenburger
2 days ago

Sorry but the cstart = cend = FOLD (cstart); code only apply to a collating symbol (something written as [ [.cz.] ], that is between [. and .] inside the brackets) not in general to a bracket expression.
– Isaac
2 days ago

@Isaac I don't think it is, I think [[.cz.]] is handled a few lines above with p = PARSE_COLLSYM (p, &pc);. But that code is pretty complicated, so I'm not 100% sure I found all the right places... I'm still fairly confident that there are some range comparisons for a single character range, since there are cases where cstart = cend, and that would explain why characters that collate the same would look like a "match". That other shells probably implement that differently would explain why they wouldn't be affected by the issue.
– Filipe Brandenburger
2 days ago

Yes, it seems that it should apply to all characters inside a [ ]. I just not understand why. A collating table should be a total order or the confirmation that a character is absent will fail. If a and b sort equal then a sorted list where a a c doesn't confirm that b is absent in the list as the sort order could have been b a c.
– Isaac
2 days ago

|
show 7 more comments

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f483743%2fwhy-is-bash-removing-other-digits%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
3
down vote

I managed to reproduce this problem on Ubuntu 17.10 (glibc 2.26) and on Ubuntu 18.04 (glibc 2.27), but it seems to be fixed on Ubuntu 18.10 (glibc 2.28)

The localedata comes from glibc and the bug seems to be present there (though distros customize that data fairly heavily, so it's possible other distros with glibc <2.28 might not have the issue.)

In fact, the glibc 2.28 announcement starts listing new features with:

The localization data for ISO 14651 is updated to match the 2016
Edition 4 release of the standard, this matches data provided by
Unicode 9.0.0. This update introduces significant improvements to the
collation of Unicode characters.

Looking at the commits, it's a huge overhaul on the localedata, so that's probably what fixed the bug!

ubuntu-18.04$ { echo 0; echo -e 'u0660'; echo 0; } | sort

0

٠

0

And on glibc 2.28:

ubuntu-18.10$ { echo 0; echo -e 'u0660'; echo 0; } | sort

0

0

٠

As you can see, on the older glibc, it's not reordering the Arabic-Indic zero '٠', neither before nor after the '0', which proves they collate the same.

Looking at the glibc sources, we can understand why the problem happens.

In the glibc 2.27 sources for ISO 14651, the following definitions can be found:

<U0030> <0>;<BAS>;<MIN>;IGNORE # 171 0

<U0660> <0>;<BAS>;<MIN>;IGNORE

<U06F0> <0>;<PCL>;<MIN>;IGNORE

<U0966> <0>;"<BAS><NUM>";"<MIN><MIN>";IGNORE

Looking at the glibc 2.28 sources for ISO 14651, the following definitions are now found:

<U0030> <S0030>;<BASE>;<MIN>;<U0030> % DIGIT ZERO

<U0660> <S0030>;<BASE>;<MIN>;<U0660> % ARABIC-INDIC DIGIT ZERO

<U06F0> <S0030>;<BASE>;<MIN>;<U06F0> % EXTENDED ARABIC-INDIC DIGIT ZERO

<U07C0> <S0030>;<BASE>;<MIN>;<U07C0> % NKO DIGIT ZERO

<U0966> <S0030>;<BASE>;<MIN>;<U0966> % DEVANAGARI DIGIT ZERO

[...] putting the code point of the character into the fourth level
instead of “IGNORE”. Without that change, all such characters
would compare equal which would make a wcscoll test case fail.
It is better to have a clearly defined sort order even for characters
like this so it is good to use the code point as a tie-break.

localedata/locales/iso14651_t1_common: Use the code point of a
character in the fourth collation level instead of IGNORE for all
entries which have IGNORE on all 4 levels.

So hopefully this explains the bug with localedata in glibc <2.28 and the fix in glibc 2.28.

cstart = cend = FOLD (cstart);

Then later it compares the current character with that range using RANGECMP:

if (RANGECMP (test, cstart, forcecoll) >= 0 && RANGECMP (test, cend, forcecoll) <= 0)

  goto matched;

And then RANGECMP (defined to rangecmp_wc in multi-byte mode) uses wcscoll(3) (which is the multi-byte version of strcoll):

return (wcscoll (s1, s2));

Other shells probably don't have this problem because they do a straight comparison if a range is not involved.

UPDATE: This issue was reported as a bug to the bash project by @Isaac.

edited 2 hours ago

answered 2 days ago

Filipe Brandenburger

6,4101727

In the text "Looking at the glibc 2.27 sources for ISO 14651, the following definitions are now found:" the version number should be 2.28, right? (Can't edit as it's too short of a change)
– Grisha Levit
2 days ago

@GrishaLevit Indeed! Thanks for spotting that. I just edited it to fix it. Cheers!
– Filipe Brandenburger
2 days ago

Sorry but the cstart = cend = FOLD (cstart); code only apply to a collating symbol (something written as [ [.cz.] ], that is between [. and .] inside the brackets) not in general to a bracket expression.
– Isaac
2 days ago

@Isaac I don't think it is, I think [[.cz.]] is handled a few lines above with p = PARSE_COLLSYM (p, &pc);. But that code is pretty complicated, so I'm not 100% sure I found all the right places... I'm still fairly confident that there are some range comparisons for a single character range, since there are cases where cstart = cend, and that would explain why characters that collate the same would look like a "match". That other shells probably implement that differently would explain why they wouldn't be affected by the issue.
– Filipe Brandenburger
2 days ago

Yes, it seems that it should apply to all characters inside a [ ]. I just not understand why. A collating table should be a total order or the confirmation that a character is absent will fail. If a and b sort equal then a sorted list where a a c doesn't confirm that b is absent in the list as the sort order could have been b a c.
– Isaac
2 days ago

|
show 7 more comments

up vote
3
down vote

I managed to reproduce this problem on Ubuntu 17.10 (glibc 2.26) and on Ubuntu 18.04 (glibc 2.27), but it seems to be fixed on Ubuntu 18.10 (glibc 2.28)

The localedata comes from glibc and the bug seems to be present there (though distros customize that data fairly heavily, so it's possible other distros with glibc <2.28 might not have the issue.)

In fact, the glibc 2.28 announcement starts listing new features with:

The localization data for ISO 14651 is updated to match the 2016
Edition 4 release of the standard, this matches data provided by
Unicode 9.0.0. This update introduces significant improvements to the
collation of Unicode characters.

Looking at the commits, it's a huge overhaul on the localedata, so that's probably what fixed the bug!

ubuntu-18.04$ { echo 0; echo -e 'u0660'; echo 0; } | sort

0

٠

0

And on glibc 2.28:

ubuntu-18.10$ { echo 0; echo -e 'u0660'; echo 0; } | sort

0

0

٠

As you can see, on the older glibc, it's not reordering the Arabic-Indic zero '٠', neither before nor after the '0', which proves they collate the same.

Looking at the glibc sources, we can understand why the problem happens.

In the glibc 2.27 sources for ISO 14651, the following definitions can be found:

<U0030> <0>;<BAS>;<MIN>;IGNORE # 171 0

<U0660> <0>;<BAS>;<MIN>;IGNORE

<U06F0> <0>;<PCL>;<MIN>;IGNORE

<U0966> <0>;"<BAS><NUM>";"<MIN><MIN>";IGNORE

Looking at the glibc 2.28 sources for ISO 14651, the following definitions are now found:

<U0030> <S0030>;<BASE>;<MIN>;<U0030> % DIGIT ZERO

<U0660> <S0030>;<BASE>;<MIN>;<U0660> % ARABIC-INDIC DIGIT ZERO

<U06F0> <S0030>;<BASE>;<MIN>;<U06F0> % EXTENDED ARABIC-INDIC DIGIT ZERO

<U07C0> <S0030>;<BASE>;<MIN>;<U07C0> % NKO DIGIT ZERO

<U0966> <S0030>;<BASE>;<MIN>;<U0966> % DEVANAGARI DIGIT ZERO

[...] putting the code point of the character into the fourth level
instead of “IGNORE”. Without that change, all such characters
would compare equal which would make a wcscoll test case fail.
It is better to have a clearly defined sort order even for characters
like this so it is good to use the code point as a tie-break.

localedata/locales/iso14651_t1_common: Use the code point of a
character in the fourth collation level instead of IGNORE for all
entries which have IGNORE on all 4 levels.

So hopefully this explains the bug with localedata in glibc <2.28 and the fix in glibc 2.28.

cstart = cend = FOLD (cstart);

Then later it compares the current character with that range using RANGECMP:

if (RANGECMP (test, cstart, forcecoll) >= 0 && RANGECMP (test, cend, forcecoll) <= 0)

  goto matched;

And then RANGECMP (defined to rangecmp_wc in multi-byte mode) uses wcscoll(3) (which is the multi-byte version of strcoll):

return (wcscoll (s1, s2));

Other shells probably don't have this problem because they do a straight comparison if a range is not involved.

UPDATE: This issue was reported as a bug to the bash project by @Isaac.

edited 2 hours ago

answered 2 days ago

Filipe Brandenburger

6,4101727

In the text "Looking at the glibc 2.27 sources for ISO 14651, the following definitions are now found:" the version number should be 2.28, right? (Can't edit as it's too short of a change)
– Grisha Levit
2 days ago

@GrishaLevit Indeed! Thanks for spotting that. I just edited it to fix it. Cheers!
– Filipe Brandenburger
2 days ago

Sorry but the cstart = cend = FOLD (cstart); code only apply to a collating symbol (something written as [ [.cz.] ], that is between [. and .] inside the brackets) not in general to a bracket expression.
– Isaac
2 days ago

@Isaac I don't think it is, I think [[.cz.]] is handled a few lines above with p = PARSE_COLLSYM (p, &pc);. But that code is pretty complicated, so I'm not 100% sure I found all the right places... I'm still fairly confident that there are some range comparisons for a single character range, since there are cases where cstart = cend, and that would explain why characters that collate the same would look like a "match". That other shells probably implement that differently would explain why they wouldn't be affected by the issue.
– Filipe Brandenburger
2 days ago

Yes, it seems that it should apply to all characters inside a [ ]. I just not understand why. A collating table should be a total order or the confirmation that a character is absent will fail. If a and b sort equal then a sorted list where a a c doesn't confirm that b is absent in the list as the sort order could have been b a c.
– Isaac
2 days ago

|
show 7 more comments

up vote
3
down vote

I managed to reproduce this problem on Ubuntu 17.10 (glibc 2.26) and on Ubuntu 18.04 (glibc 2.27), but it seems to be fixed on Ubuntu 18.10 (glibc 2.28)

The localedata comes from glibc and the bug seems to be present there (though distros customize that data fairly heavily, so it's possible other distros with glibc <2.28 might not have the issue.)

In fact, the glibc 2.28 announcement starts listing new features with:

The localization data for ISO 14651 is updated to match the 2016
Edition 4 release of the standard, this matches data provided by
Unicode 9.0.0. This update introduces significant improvements to the
collation of Unicode characters.

Looking at the commits, it's a huge overhaul on the localedata, so that's probably what fixed the bug!

ubuntu-18.04$ { echo 0; echo -e 'u0660'; echo 0; } | sort

0

٠

0

And on glibc 2.28:

ubuntu-18.10$ { echo 0; echo -e 'u0660'; echo 0; } | sort

0

0

٠

As you can see, on the older glibc, it's not reordering the Arabic-Indic zero '٠', neither before nor after the '0', which proves they collate the same.

Looking at the glibc sources, we can understand why the problem happens.

In the glibc 2.27 sources for ISO 14651, the following definitions can be found:

<U0030> <0>;<BAS>;<MIN>;IGNORE # 171 0

<U0660> <0>;<BAS>;<MIN>;IGNORE

<U06F0> <0>;<PCL>;<MIN>;IGNORE

<U0966> <0>;"<BAS><NUM>";"<MIN><MIN>";IGNORE

Looking at the glibc 2.28 sources for ISO 14651, the following definitions are now found:

<U0030> <S0030>;<BASE>;<MIN>;<U0030> % DIGIT ZERO

<U0660> <S0030>;<BASE>;<MIN>;<U0660> % ARABIC-INDIC DIGIT ZERO

<U06F0> <S0030>;<BASE>;<MIN>;<U06F0> % EXTENDED ARABIC-INDIC DIGIT ZERO

<U07C0> <S0030>;<BASE>;<MIN>;<U07C0> % NKO DIGIT ZERO

<U0966> <S0030>;<BASE>;<MIN>;<U0966> % DEVANAGARI DIGIT ZERO

[...] putting the code point of the character into the fourth level
instead of “IGNORE”. Without that change, all such characters
would compare equal which would make a wcscoll test case fail.
It is better to have a clearly defined sort order even for characters
like this so it is good to use the code point as a tie-break.

localedata/locales/iso14651_t1_common: Use the code point of a
character in the fourth collation level instead of IGNORE for all
entries which have IGNORE on all 4 levels.

So hopefully this explains the bug with localedata in glibc <2.28 and the fix in glibc 2.28.

cstart = cend = FOLD (cstart);

Then later it compares the current character with that range using RANGECMP:

if (RANGECMP (test, cstart, forcecoll) >= 0 && RANGECMP (test, cend, forcecoll) <= 0)

  goto matched;

And then RANGECMP (defined to rangecmp_wc in multi-byte mode) uses wcscoll(3) (which is the multi-byte version of strcoll):

return (wcscoll (s1, s2));

Other shells probably don't have this problem because they do a straight comparison if a range is not involved.

UPDATE: This issue was reported as a bug to the bash project by @Isaac.

edited 2 hours ago

answered 2 days ago

Filipe Brandenburger

6,4101727

I managed to reproduce this problem on Ubuntu 17.10 (glibc 2.26) and on Ubuntu 18.04 (glibc 2.27), but it seems to be fixed on Ubuntu 18.10 (glibc 2.28)

The localedata comes from glibc and the bug seems to be present there (though distros customize that data fairly heavily, so it's possible other distros with glibc <2.28 might not have the issue.)

In fact, the glibc 2.28 announcement starts listing new features with:

The localization data for ISO 14651 is updated to match the 2016
Edition 4 release of the standard, this matches data provided by
Unicode 9.0.0. This update introduces significant improvements to the
collation of Unicode characters.

Looking at the commits, it's a huge overhaul on the localedata, so that's probably what fixed the bug!

ubuntu-18.04$ { echo 0; echo -e 'u0660'; echo 0; } | sort

0

٠

0

And on glibc 2.28:

ubuntu-18.10$ { echo 0; echo -e 'u0660'; echo 0; } | sort

0

0

٠

As you can see, on the older glibc, it's not reordering the Arabic-Indic zero '٠', neither before nor after the '0', which proves they collate the same.

Looking at the glibc sources, we can understand why the problem happens.

In the glibc 2.27 sources for ISO 14651, the following definitions can be found:

<U0030> <0>;<BAS>;<MIN>;IGNORE # 171 0

<U0660> <0>;<BAS>;<MIN>;IGNORE

<U06F0> <0>;<PCL>;<MIN>;IGNORE

<U0966> <0>;"<BAS><NUM>";"<MIN><MIN>";IGNORE

Looking at the glibc 2.28 sources for ISO 14651, the following definitions are now found:

<U0030> <S0030>;<BASE>;<MIN>;<U0030> % DIGIT ZERO

<U0660> <S0030>;<BASE>;<MIN>;<U0660> % ARABIC-INDIC DIGIT ZERO

<U06F0> <S0030>;<BASE>;<MIN>;<U06F0> % EXTENDED ARABIC-INDIC DIGIT ZERO

<U07C0> <S0030>;<BASE>;<MIN>;<U07C0> % NKO DIGIT ZERO

<U0966> <S0030>;<BASE>;<MIN>;<U0966> % DEVANAGARI DIGIT ZERO

[...] putting the code point of the character into the fourth level
instead of “IGNORE”. Without that change, all such characters
would compare equal which would make a wcscoll test case fail.
It is better to have a clearly defined sort order even for characters
like this so it is good to use the code point as a tie-break.

localedata/locales/iso14651_t1_common: Use the code point of a
character in the fourth collation level instead of IGNORE for all
entries which have IGNORE on all 4 levels.

So hopefully this explains the bug with localedata in glibc <2.28 and the fix in glibc 2.28.

cstart = cend = FOLD (cstart);

Then later it compares the current character with that range using RANGECMP:

if (RANGECMP (test, cstart, forcecoll) >= 0 && RANGECMP (test, cend, forcecoll) <= 0)

  goto matched;

And then RANGECMP (defined to rangecmp_wc in multi-byte mode) uses wcscoll(3) (which is the multi-byte version of strcoll):

return (wcscoll (s1, s2));

Other shells probably don't have this problem because they do a straight comparison if a range is not involved.

UPDATE: This issue was reported as a bug to the bash project by @Isaac.

edited 2 hours ago

answered 2 days ago

Filipe Brandenburger

6,4101727

edited 2 hours ago

answered 2 days ago

Filipe Brandenburger

6,4101727

answered 2 days ago

Filipe Brandenburger

6,4101727

answered 2 days ago

Filipe Brandenburger

6,4101727

In the text "Looking at the glibc 2.27 sources for ISO 14651, the following definitions are now found:" the version number should be 2.28, right? (Can't edit as it's too short of a change)
– Grisha Levit
2 days ago

@GrishaLevit Indeed! Thanks for spotting that. I just edited it to fix it. Cheers!
– Filipe Brandenburger
2 days ago

Sorry but the cstart = cend = FOLD (cstart); code only apply to a collating symbol (something written as [ [.cz.] ], that is between [. and .] inside the brackets) not in general to a bracket expression.
– Isaac
2 days ago

@Isaac I don't think it is, I think [[.cz.]] is handled a few lines above with p = PARSE_COLLSYM (p, &pc);. But that code is pretty complicated, so I'm not 100% sure I found all the right places... I'm still fairly confident that there are some range comparisons for a single character range, since there are cases where cstart = cend, and that would explain why characters that collate the same would look like a "match". That other shells probably implement that differently would explain why they wouldn't be affected by the issue.
– Filipe Brandenburger
2 days ago

Yes, it seems that it should apply to all characters inside a [ ]. I just not understand why. A collating table should be a total order or the confirmation that a character is absent will fail. If a and b sort equal then a sorted list where a a c doesn't confirm that b is absent in the list as the sort order could have been b a c.
– Isaac
2 days ago

|
show 7 more comments

In the text "Looking at the glibc 2.27 sources for ISO 14651, the following definitions are now found:" the version number should be 2.28, right? (Can't edit as it's too short of a change)
– Grisha Levit
2 days ago

@GrishaLevit Indeed! Thanks for spotting that. I just edited it to fix it. Cheers!
– Filipe Brandenburger
2 days ago

Sorry but the cstart = cend = FOLD (cstart); code only apply to a collating symbol (something written as [ [.cz.] ], that is between [. and .] inside the brackets) not in general to a bracket expression.
– Isaac
2 days ago

@Isaac I don't think it is, I think [[.cz.]] is handled a few lines above with p = PARSE_COLLSYM (p, &pc);. But that code is pretty complicated, so I'm not 100% sure I found all the right places... I'm still fairly confident that there are some range comparisons for a single character range, since there are cases where cstart = cend, and that would explain why characters that collate the same would look like a "match". That other shells probably implement that differently would explain why they wouldn't be affected by the issue.
– Filipe Brandenburger
2 days ago

Yes, it seems that it should apply to all characters inside a [ ]. I just not understand why. A collating table should be a total order or the confirmation that a character is absent will fail. If a and b sort equal then a sorted list where a a c doesn't confirm that b is absent in the list as the sort order could have been b a c.
– Isaac
2 days ago

In the text "Looking at the glibc 2.27 sources for ISO 14651, the following definitions are now found:" the version number should be 2.28, right? (Can't edit as it's too short of a change)
– Grisha Levit
2 days ago

@GrishaLevit Indeed! Thanks for spotting that. I just edited it to fix it. Cheers!
– Filipe Brandenburger
2 days ago

Sorry but the cstart = cend = FOLD (cstart); code only apply to a collating symbol (something written as [ [.cz.] ], that is between [. and .] inside the brackets) not in general to a bracket expression.
– Isaac
2 days ago

@Isaac I don't think it is, I think [[.cz.]] is handled a few lines above with p = PARSE_COLLSYM (p, &pc);. But that code is pretty complicated, so I'm not 100% sure I found all the right places... I'm still fairly confident that there are some range comparisons for a single character range, since there are cases where cstart = cend, and that would explain why characters that collate the same would look like a "match". That other shells probably implement that differently would explain why they wouldn't be affected by the issue.
– Filipe Brandenburger
2 days ago

Yes, it seems that it should apply to all characters inside a [ ]. I just not understand why. A collating table should be a total order or the confirmation that a character is absent will fail. If a and b sort equal then a sorted list where a a c doesn't confirm that b is absent in the list as the sort order could have been b a c.
– Isaac
2 days ago

|
show 7 more comments

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Sstrhsrtj