Why is bash removing other digits?











up vote
3
down vote

favorite












On this (It is not intended to be a range, but an explicit list):



$ a='0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९'
$ echo "${a//[0123456789]}"
۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९


Bash is incorrectly (IMO) removing the digits ٠١٢٣٤٥٦٧٨٩ (the second group).





The characters are all different (hand formatted):



$ for c in $(echo "$a" | grep -o .); do printf '\U%04x ' "'$c"; done; echo
U0030 U0031 U0032 U0033 U0034 U0035 U0036 U0037 U0038 U0039
U0660 U0661 U0662 U0663 U0664 U0665 U0666 U0667 U0668 U0669
U06f0 U06f1 U06f2 U06f3 U06f4 U06f5 U06f6 U06f7 U06f8 U06f9
U07c0 U07c1 U07c2 U07c3 U07c4 U07c5 U07c6 U07c7 U07c8 U07c9
U0966 U0967 U0968 U0969 U096a U096b U096c U096d U096e U096f


Which correspond to:



123456789    # Hindu-Arabic Arabic numerals
٠١٢٣٤٥٦٧٨٩ # ARABIC-INDIC
۰۱۲۳۴۵۶۷۸۹ # EXTENDED ARABIC-INDIC/PERSIAN
߀߁߂߃߄߅߆߇߈߉ # NKO DIGIT
०१२३४५६७८९ # DEVANAGARI


To ensure there are no problems with pasting from this website, it is also possible to produce this Unicode content into the a variable using the Unicode escapes:



a=$(echo -e 'u0030u0031u0032u0033u0034u0035u0036u0037u0038u0039 u0660u0661u0662u0663u0664u0665u0666u0667u0668u0669 u06f0u06f1u06f2u06f3u06f4u06f5u06f6u06f7u06f8u06f9 u07c0u07c1u07c2u07c3u07c4u07c5u07c6u07c7u07c8u07c9 u0966u0967u0968u0969u096au096bu096cu096du096eu096f')


Or using the $'...' strings which accept escapes directly:



a=$'u0030u0031u0032u0033u0034u0035u0036u0037u0038u0039 u0660u0661u0662u0663u0664u0665u0666u0667u0668u0669 u06f0u06f1u06f2u06f3u06f4u06f5u06f6u06f7u06f8u06f9 u07c0u07c1u07c2u07c3u07c4u07c5u07c6u07c7u07c8u07c9 u0966u0967u0968u0969u096au096bu096cu096du096eu096f'




Other shells do not work as bash (hand formatted):



$ for sh in zsh ksh lksh mksh bash; do $sh -c 'a="0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९"; echo "$0 : ${a//[0123456789]}" $sh'; done
zsh : ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९
ksh : ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९
lksh : ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९
mksh : ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९
bash : ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९




The bash sort order is:



$ mkdir test1; cd test1; IFS=$' tn'
$ touch $(echo "$a" | grep -o .)
$ printf '%s' *; echo
߃߇߆߁߂߅߉߄߀߈0٠०۰1١१۱٢2२۲3٣३۳٤4४۴٥5५۵٦6६۶7٧७۷8٨८۸٩9९۹

$ locale
LANG=en_US.utf8
LANGUAGE=
LC_CTYPE="en_US.utf8"
LC_NUMERIC="en_US.utf8"
LC_TIME="en_US.utf8"
LC_COLLATE="en_US.utf8"
LC_MONETARY="en_US.utf8"
LC_MESSAGES="en_US.utf8"
LC_PAPER="en_US.utf8"
LC_NAME="en_US.utf8"
LC_ADDRESS="en_US.utf8"
LC_TELEPHONE="en_US.utf8"
LC_MEASUREMENT="en_US.utf8"
LC_IDENTIFICATION="en_US.utf8"
LC_ALL=


It doesn't seem to be applying the sort order to remove characters.



It shouldn't anyway (IMO) as the characters are being explicitly listed.



So: Why?





Using bash 4.4.12 here. But it fails also with 3.0, 3.2, 4.0, 4.1, 4.4.23, 5.0 but not with 2.0.1 nor 2.0.5. It seems that a change in 3.0 caused the issue.










share|improve this question




















  • 1




    Probably this depend on the LANG setting of your environment?
    – Romeo Ninov
    Nov 23 at 18:38






  • 2




    @RomeoNinov it depends on LC_COLLATE not on LANG
    – mosvy
    Nov 23 at 18:40










  • Yes, LC_COLLATE is changing it. That's why I posted the sort order that bash is using (en_US.utf8 if it needs to be known). But that doesn't match the result anyway. ... . And, if the characters are being explicitly given: Why should the collation be applied?
    – Isaac
    Nov 23 at 18:45






  • 1




    It shouldn't. It's a bug.
    – mosvy
    Nov 23 at 19:12










  • MacOS Sierra 10.12.4, Bash 4.4.12(1)-release and en_US.UTF-8: a='0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९'; b="${a//[0123456789]}"; echo "${#a} ${#b}" outputs 54 44 which is what you would expect. I do see your issue on Ubuntu 17.10, 4.4.12(1)-release and en_US.UTF-8 where the output of my command is 54 34. I spotted a report here for C.UTF-8 but I don't know if the underlying issue is relevant.
    – Dennis Williamson
    Nov 24 at 1:11

















up vote
3
down vote

favorite












On this (It is not intended to be a range, but an explicit list):



$ a='0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९'
$ echo "${a//[0123456789]}"
۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९


Bash is incorrectly (IMO) removing the digits ٠١٢٣٤٥٦٧٨٩ (the second group).





The characters are all different (hand formatted):



$ for c in $(echo "$a" | grep -o .); do printf '\U%04x ' "'$c"; done; echo
U0030 U0031 U0032 U0033 U0034 U0035 U0036 U0037 U0038 U0039
U0660 U0661 U0662 U0663 U0664 U0665 U0666 U0667 U0668 U0669
U06f0 U06f1 U06f2 U06f3 U06f4 U06f5 U06f6 U06f7 U06f8 U06f9
U07c0 U07c1 U07c2 U07c3 U07c4 U07c5 U07c6 U07c7 U07c8 U07c9
U0966 U0967 U0968 U0969 U096a U096b U096c U096d U096e U096f


Which correspond to:



123456789    # Hindu-Arabic Arabic numerals
٠١٢٣٤٥٦٧٨٩ # ARABIC-INDIC
۰۱۲۳۴۵۶۷۸۹ # EXTENDED ARABIC-INDIC/PERSIAN
߀߁߂߃߄߅߆߇߈߉ # NKO DIGIT
०१२३४५६७८९ # DEVANAGARI


To ensure there are no problems with pasting from this website, it is also possible to produce this Unicode content into the a variable using the Unicode escapes:



a=$(echo -e 'u0030u0031u0032u0033u0034u0035u0036u0037u0038u0039 u0660u0661u0662u0663u0664u0665u0666u0667u0668u0669 u06f0u06f1u06f2u06f3u06f4u06f5u06f6u06f7u06f8u06f9 u07c0u07c1u07c2u07c3u07c4u07c5u07c6u07c7u07c8u07c9 u0966u0967u0968u0969u096au096bu096cu096du096eu096f')


Or using the $'...' strings which accept escapes directly:



a=$'u0030u0031u0032u0033u0034u0035u0036u0037u0038u0039 u0660u0661u0662u0663u0664u0665u0666u0667u0668u0669 u06f0u06f1u06f2u06f3u06f4u06f5u06f6u06f7u06f8u06f9 u07c0u07c1u07c2u07c3u07c4u07c5u07c6u07c7u07c8u07c9 u0966u0967u0968u0969u096au096bu096cu096du096eu096f'




Other shells do not work as bash (hand formatted):



$ for sh in zsh ksh lksh mksh bash; do $sh -c 'a="0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९"; echo "$0 : ${a//[0123456789]}" $sh'; done
zsh : ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९
ksh : ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९
lksh : ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९
mksh : ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९
bash : ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९




The bash sort order is:



$ mkdir test1; cd test1; IFS=$' tn'
$ touch $(echo "$a" | grep -o .)
$ printf '%s' *; echo
߃߇߆߁߂߅߉߄߀߈0٠०۰1١१۱٢2२۲3٣३۳٤4४۴٥5५۵٦6६۶7٧७۷8٨८۸٩9९۹

$ locale
LANG=en_US.utf8
LANGUAGE=
LC_CTYPE="en_US.utf8"
LC_NUMERIC="en_US.utf8"
LC_TIME="en_US.utf8"
LC_COLLATE="en_US.utf8"
LC_MONETARY="en_US.utf8"
LC_MESSAGES="en_US.utf8"
LC_PAPER="en_US.utf8"
LC_NAME="en_US.utf8"
LC_ADDRESS="en_US.utf8"
LC_TELEPHONE="en_US.utf8"
LC_MEASUREMENT="en_US.utf8"
LC_IDENTIFICATION="en_US.utf8"
LC_ALL=


It doesn't seem to be applying the sort order to remove characters.



It shouldn't anyway (IMO) as the characters are being explicitly listed.



So: Why?





Using bash 4.4.12 here. But it fails also with 3.0, 3.2, 4.0, 4.1, 4.4.23, 5.0 but not with 2.0.1 nor 2.0.5. It seems that a change in 3.0 caused the issue.










share|improve this question




















  • 1




    Probably this depend on the LANG setting of your environment?
    – Romeo Ninov
    Nov 23 at 18:38






  • 2




    @RomeoNinov it depends on LC_COLLATE not on LANG
    – mosvy
    Nov 23 at 18:40










  • Yes, LC_COLLATE is changing it. That's why I posted the sort order that bash is using (en_US.utf8 if it needs to be known). But that doesn't match the result anyway. ... . And, if the characters are being explicitly given: Why should the collation be applied?
    – Isaac
    Nov 23 at 18:45






  • 1




    It shouldn't. It's a bug.
    – mosvy
    Nov 23 at 19:12










  • MacOS Sierra 10.12.4, Bash 4.4.12(1)-release and en_US.UTF-8: a='0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९'; b="${a//[0123456789]}"; echo "${#a} ${#b}" outputs 54 44 which is what you would expect. I do see your issue on Ubuntu 17.10, 4.4.12(1)-release and en_US.UTF-8 where the output of my command is 54 34. I spotted a report here for C.UTF-8 but I don't know if the underlying issue is relevant.
    – Dennis Williamson
    Nov 24 at 1:11















up vote
3
down vote

favorite









up vote
3
down vote

favorite











On this (It is not intended to be a range, but an explicit list):



$ a='0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९'
$ echo "${a//[0123456789]}"
۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९


Bash is incorrectly (IMO) removing the digits ٠١٢٣٤٥٦٧٨٩ (the second group).





The characters are all different (hand formatted):



$ for c in $(echo "$a" | grep -o .); do printf '\U%04x ' "'$c"; done; echo
U0030 U0031 U0032 U0033 U0034 U0035 U0036 U0037 U0038 U0039
U0660 U0661 U0662 U0663 U0664 U0665 U0666 U0667 U0668 U0669
U06f0 U06f1 U06f2 U06f3 U06f4 U06f5 U06f6 U06f7 U06f8 U06f9
U07c0 U07c1 U07c2 U07c3 U07c4 U07c5 U07c6 U07c7 U07c8 U07c9
U0966 U0967 U0968 U0969 U096a U096b U096c U096d U096e U096f


Which correspond to:



123456789    # Hindu-Arabic Arabic numerals
٠١٢٣٤٥٦٧٨٩ # ARABIC-INDIC
۰۱۲۳۴۵۶۷۸۹ # EXTENDED ARABIC-INDIC/PERSIAN
߀߁߂߃߄߅߆߇߈߉ # NKO DIGIT
०१२३४५६७८९ # DEVANAGARI


To ensure there are no problems with pasting from this website, it is also possible to produce this Unicode content into the a variable using the Unicode escapes:



a=$(echo -e 'u0030u0031u0032u0033u0034u0035u0036u0037u0038u0039 u0660u0661u0662u0663u0664u0665u0666u0667u0668u0669 u06f0u06f1u06f2u06f3u06f4u06f5u06f6u06f7u06f8u06f9 u07c0u07c1u07c2u07c3u07c4u07c5u07c6u07c7u07c8u07c9 u0966u0967u0968u0969u096au096bu096cu096du096eu096f')


Or using the $'...' strings which accept escapes directly:



a=$'u0030u0031u0032u0033u0034u0035u0036u0037u0038u0039 u0660u0661u0662u0663u0664u0665u0666u0667u0668u0669 u06f0u06f1u06f2u06f3u06f4u06f5u06f6u06f7u06f8u06f9 u07c0u07c1u07c2u07c3u07c4u07c5u07c6u07c7u07c8u07c9 u0966u0967u0968u0969u096au096bu096cu096du096eu096f'




Other shells do not work as bash (hand formatted):



$ for sh in zsh ksh lksh mksh bash; do $sh -c 'a="0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९"; echo "$0 : ${a//[0123456789]}" $sh'; done
zsh : ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९
ksh : ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९
lksh : ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९
mksh : ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९
bash : ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९




The bash sort order is:



$ mkdir test1; cd test1; IFS=$' tn'
$ touch $(echo "$a" | grep -o .)
$ printf '%s' *; echo
߃߇߆߁߂߅߉߄߀߈0٠०۰1١१۱٢2२۲3٣३۳٤4४۴٥5५۵٦6६۶7٧७۷8٨८۸٩9९۹

$ locale
LANG=en_US.utf8
LANGUAGE=
LC_CTYPE="en_US.utf8"
LC_NUMERIC="en_US.utf8"
LC_TIME="en_US.utf8"
LC_COLLATE="en_US.utf8"
LC_MONETARY="en_US.utf8"
LC_MESSAGES="en_US.utf8"
LC_PAPER="en_US.utf8"
LC_NAME="en_US.utf8"
LC_ADDRESS="en_US.utf8"
LC_TELEPHONE="en_US.utf8"
LC_MEASUREMENT="en_US.utf8"
LC_IDENTIFICATION="en_US.utf8"
LC_ALL=


It doesn't seem to be applying the sort order to remove characters.



It shouldn't anyway (IMO) as the characters are being explicitly listed.



So: Why?





Using bash 4.4.12 here. But it fails also with 3.0, 3.2, 4.0, 4.1, 4.4.23, 5.0 but not with 2.0.1 nor 2.0.5. It seems that a change in 3.0 caused the issue.










share|improve this question















On this (It is not intended to be a range, but an explicit list):



$ a='0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९'
$ echo "${a//[0123456789]}"
۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९


Bash is incorrectly (IMO) removing the digits ٠١٢٣٤٥٦٧٨٩ (the second group).





The characters are all different (hand formatted):



$ for c in $(echo "$a" | grep -o .); do printf '\U%04x ' "'$c"; done; echo
U0030 U0031 U0032 U0033 U0034 U0035 U0036 U0037 U0038 U0039
U0660 U0661 U0662 U0663 U0664 U0665 U0666 U0667 U0668 U0669
U06f0 U06f1 U06f2 U06f3 U06f4 U06f5 U06f6 U06f7 U06f8 U06f9
U07c0 U07c1 U07c2 U07c3 U07c4 U07c5 U07c6 U07c7 U07c8 U07c9
U0966 U0967 U0968 U0969 U096a U096b U096c U096d U096e U096f


Which correspond to:



123456789    # Hindu-Arabic Arabic numerals
٠١٢٣٤٥٦٧٨٩ # ARABIC-INDIC
۰۱۲۳۴۵۶۷۸۹ # EXTENDED ARABIC-INDIC/PERSIAN
߀߁߂߃߄߅߆߇߈߉ # NKO DIGIT
०१२३४५६७८९ # DEVANAGARI


To ensure there are no problems with pasting from this website, it is also possible to produce this Unicode content into the a variable using the Unicode escapes:



a=$(echo -e 'u0030u0031u0032u0033u0034u0035u0036u0037u0038u0039 u0660u0661u0662u0663u0664u0665u0666u0667u0668u0669 u06f0u06f1u06f2u06f3u06f4u06f5u06f6u06f7u06f8u06f9 u07c0u07c1u07c2u07c3u07c4u07c5u07c6u07c7u07c8u07c9 u0966u0967u0968u0969u096au096bu096cu096du096eu096f')


Or using the $'...' strings which accept escapes directly:



a=$'u0030u0031u0032u0033u0034u0035u0036u0037u0038u0039 u0660u0661u0662u0663u0664u0665u0666u0667u0668u0669 u06f0u06f1u06f2u06f3u06f4u06f5u06f6u06f7u06f8u06f9 u07c0u07c1u07c2u07c3u07c4u07c5u07c6u07c7u07c8u07c9 u0966u0967u0968u0969u096au096bu096cu096du096eu096f'




Other shells do not work as bash (hand formatted):



$ for sh in zsh ksh lksh mksh bash; do $sh -c 'a="0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९"; echo "$0 : ${a//[0123456789]}" $sh'; done
zsh : ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९
ksh : ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९
lksh : ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९
mksh : ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९
bash : ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९




The bash sort order is:



$ mkdir test1; cd test1; IFS=$' tn'
$ touch $(echo "$a" | grep -o .)
$ printf '%s' *; echo
߃߇߆߁߂߅߉߄߀߈0٠०۰1١१۱٢2२۲3٣३۳٤4४۴٥5५۵٦6६۶7٧७۷8٨८۸٩9९۹

$ locale
LANG=en_US.utf8
LANGUAGE=
LC_CTYPE="en_US.utf8"
LC_NUMERIC="en_US.utf8"
LC_TIME="en_US.utf8"
LC_COLLATE="en_US.utf8"
LC_MONETARY="en_US.utf8"
LC_MESSAGES="en_US.utf8"
LC_PAPER="en_US.utf8"
LC_NAME="en_US.utf8"
LC_ADDRESS="en_US.utf8"
LC_TELEPHONE="en_US.utf8"
LC_MEASUREMENT="en_US.utf8"
LC_IDENTIFICATION="en_US.utf8"
LC_ALL=


It doesn't seem to be applying the sort order to remove characters.



It shouldn't anyway (IMO) as the characters are being explicitly listed.



So: Why?





Using bash 4.4.12 here. But it fails also with 3.0, 3.2, 4.0, 4.1, 4.4.23, 5.0 but not with 2.0.1 nor 2.0.5. It seems that a change in 3.0 caused the issue.







bash locale unicode






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 2 days ago









Filipe Brandenburger

6,4101727




6,4101727










asked Nov 23 at 18:33









Isaac

9,70311445




9,70311445








  • 1




    Probably this depend on the LANG setting of your environment?
    – Romeo Ninov
    Nov 23 at 18:38






  • 2




    @RomeoNinov it depends on LC_COLLATE not on LANG
    – mosvy
    Nov 23 at 18:40










  • Yes, LC_COLLATE is changing it. That's why I posted the sort order that bash is using (en_US.utf8 if it needs to be known). But that doesn't match the result anyway. ... . And, if the characters are being explicitly given: Why should the collation be applied?
    – Isaac
    Nov 23 at 18:45






  • 1




    It shouldn't. It's a bug.
    – mosvy
    Nov 23 at 19:12










  • MacOS Sierra 10.12.4, Bash 4.4.12(1)-release and en_US.UTF-8: a='0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९'; b="${a//[0123456789]}"; echo "${#a} ${#b}" outputs 54 44 which is what you would expect. I do see your issue on Ubuntu 17.10, 4.4.12(1)-release and en_US.UTF-8 where the output of my command is 54 34. I spotted a report here for C.UTF-8 but I don't know if the underlying issue is relevant.
    – Dennis Williamson
    Nov 24 at 1:11
















  • 1




    Probably this depend on the LANG setting of your environment?
    – Romeo Ninov
    Nov 23 at 18:38






  • 2




    @RomeoNinov it depends on LC_COLLATE not on LANG
    – mosvy
    Nov 23 at 18:40










  • Yes, LC_COLLATE is changing it. That's why I posted the sort order that bash is using (en_US.utf8 if it needs to be known). But that doesn't match the result anyway. ... . And, if the characters are being explicitly given: Why should the collation be applied?
    – Isaac
    Nov 23 at 18:45






  • 1




    It shouldn't. It's a bug.
    – mosvy
    Nov 23 at 19:12










  • MacOS Sierra 10.12.4, Bash 4.4.12(1)-release and en_US.UTF-8: a='0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९'; b="${a//[0123456789]}"; echo "${#a} ${#b}" outputs 54 44 which is what you would expect. I do see your issue on Ubuntu 17.10, 4.4.12(1)-release and en_US.UTF-8 where the output of my command is 54 34. I spotted a report here for C.UTF-8 but I don't know if the underlying issue is relevant.
    – Dennis Williamson
    Nov 24 at 1:11










1




1




Probably this depend on the LANG setting of your environment?
– Romeo Ninov
Nov 23 at 18:38




Probably this depend on the LANG setting of your environment?
– Romeo Ninov
Nov 23 at 18:38




2




2




@RomeoNinov it depends on LC_COLLATE not on LANG
– mosvy
Nov 23 at 18:40




@RomeoNinov it depends on LC_COLLATE not on LANG
– mosvy
Nov 23 at 18:40












Yes, LC_COLLATE is changing it. That's why I posted the sort order that bash is using (en_US.utf8 if it needs to be known). But that doesn't match the result anyway. ... . And, if the characters are being explicitly given: Why should the collation be applied?
– Isaac
Nov 23 at 18:45




Yes, LC_COLLATE is changing it. That's why I posted the sort order that bash is using (en_US.utf8 if it needs to be known). But that doesn't match the result anyway. ... . And, if the characters are being explicitly given: Why should the collation be applied?
– Isaac
Nov 23 at 18:45




1




1




It shouldn't. It's a bug.
– mosvy
Nov 23 at 19:12




It shouldn't. It's a bug.
– mosvy
Nov 23 at 19:12












MacOS Sierra 10.12.4, Bash 4.4.12(1)-release and en_US.UTF-8: a='0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९'; b="${a//[0123456789]}"; echo "${#a} ${#b}" outputs 54 44 which is what you would expect. I do see your issue on Ubuntu 17.10, 4.4.12(1)-release and en_US.UTF-8 where the output of my command is 54 34. I spotted a report here for C.UTF-8 but I don't know if the underlying issue is relevant.
– Dennis Williamson
Nov 24 at 1:11






MacOS Sierra 10.12.4, Bash 4.4.12(1)-release and en_US.UTF-8: a='0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९'; b="${a//[0123456789]}"; echo "${#a} ${#b}" outputs 54 44 which is what you would expect. I do see your issue on Ubuntu 17.10, 4.4.12(1)-release and en_US.UTF-8 where the output of my command is 54 34. I spotted a report here for C.UTF-8 but I don't know if the underlying issue is relevant.
– Dennis Williamson
Nov 24 at 1:11












1 Answer
1






active

oldest

votes

















up vote
3
down vote













I managed to reproduce this problem on Ubuntu 17.10 (glibc 2.26) and on Ubuntu 18.04 (glibc 2.27), but it seems to be fixed on Ubuntu 18.10 (glibc 2.28)



The problem is with the localedata, more specifically the LC_COLLATE data for en_US.utf8 (actually, that collation data comes from an ISO 14651 file which is included in most locales, so it probably affects all other utf8 locales as well.)



The localedata comes from glibc and the bug seems to be present there (though distros customize that data fairly heavily, so it's possible other distros with glibc <2.28 might not have the issue.)



In fact, the glibc 2.28 announcement starts listing new features with:




The localization data for ISO 14651 is updated to match the 2016
Edition 4 release of the standard, this matches data provided by
Unicode 9.0.0. This update introduces significant improvements to the
collation of Unicode characters.




Looking at the commits, it's a huge overhaul on the localedata, so that's probably what fixed the bug!



In short, the issue with collation of these two symbols (U0030, which is '0', and U0660, which is the Arabic-Indic zero '٠') is that they sort exactly the same, when compared using strcoll(3), which can be demonstrated with this short test using sort (which uses strcoll under the hood):



ubuntu-18.04$ { echo 0; echo -e 'u0660'; echo 0; } | sort
0
٠
0


And on glibc 2.28:



ubuntu-18.10$ { echo 0; echo -e 'u0660'; echo 0; } | sort
0
0
٠


As you can see, on the older glibc, it's not reordering the Arabic-Indic zero '٠', neither before nor after the '0', which proves they collate the same.



Looking at the glibc sources, we can understand why the problem happens.



In the glibc 2.27 sources for ISO 14651, the following definitions can be found:



<U0030> <0>;<BAS>;<MIN>;IGNORE # 171 0
<U0660> <0>;<BAS>;<MIN>;IGNORE
<U06F0> <0>;<PCL>;<MIN>;IGNORE
<U0966> <0>;"<BAS><NUM>";"<MIN><MIN>";IGNORE


So both '0' (u0030) and '٠' (u0660) expand to the exact same sequence (<0>;<BAS>;<MIN>;IGNORE) which means that strcoll will treat them the same. (This also explains why the other characters such as u06f0 and u0966 are not affected, since their expansion is different.)



Looking at the glibc 2.28 sources for ISO 14651, the following definitions are now found:



<U0030> <S0030>;<BASE>;<MIN>;<U0030> % DIGIT ZERO
<U0660> <S0030>;<BASE>;<MIN>;<U0660> % ARABIC-INDIC DIGIT ZERO
<U06F0> <S0030>;<BASE>;<MIN>;<U06F0> % EXTENDED ARABIC-INDIC DIGIT ZERO
<U07C0> <S0030>;<BASE>;<MIN>;<U07C0> % NKO DIGIT ZERO
<U0966> <S0030>;<BASE>;<MIN>;<U0966> % DEVANAGARI DIGIT ZERO


The fourth field is now always filled with the code point itself, which means they will have a defined sort order, even if the first few fields match. While the change for <U0660> was not introduced in this particular commit, its description explains the idea:




[...] putting the code point of the character into the fourth level
instead of “IGNORE”. Without that change, all such characters
would compare equal which would make a wcscoll test case fail.
It is better to have a clearly defined sort order even for characters
like this so it is good to use the code point as a tie-break.




  • localedata/locales/iso14651_t1_common: Use the code point of a
    character in the fourth collation level instead of IGNORE for all
    entries which have IGNORE on all 4 levels.




So hopefully this explains the bug with localedata in glibc <2.28 and the fix in glibc 2.28.





Regarding bash, if you look at the source code, you'll see that it handles a single character (0) in a bracket expression ([0]) the same as if it was a range with the character as both start and end ([0-0]):



cstart = cend = FOLD (cstart);


Then later it compares the current character with that range using RANGECMP:



if (RANGECMP (test, cstart, forcecoll) >= 0 && RANGECMP (test, cend, forcecoll) <= 0)
goto matched;


And then RANGECMP (defined to rangecmp_wc in multi-byte mode) uses wcscoll(3) (which is the multi-byte version of strcoll):



return (wcscoll (s1, s2));


The fact that bash uses a range comparison for a single character (as a shortcut, to share a bit of the code with handling of ranges) makes it so that it accepts all characters that sort the same as well as the original character.



Other shells probably don't have this problem because they do a straight comparison if a range is not involved.



The reason why this issue started appearing on bash 3.0 is that bash 3.0 introduced support for multi-byte (Unicode), which ended up refactoring all this code and probably using locale-aware comparisons, which are connected to the issue.



UPDATE: This issue was reported as a bug to the bash project by @Isaac.





WORKAROUND: If upgrading to a distro that uses glibc 2.28 is unfeasible, a possible workaround is to use LC_COLLATE=C.utf8 or POSIX.utf8 which define a "trivial" sort order where no codepoints will sort the same. Considering the issue is with collation, setting LC_COLLATE only is enough. Testing this workaround on Ubuntu 17.10 and 18.04 showed it was enough to fix this problem.






share|improve this answer























  • In the text "Looking at the glibc 2.27 sources for ISO 14651, the following definitions are now found:" the version number should be 2.28, right? (Can't edit as it's too short of a change)
    – Grisha Levit
    2 days ago










  • @GrishaLevit Indeed! Thanks for spotting that. I just edited it to fix it. Cheers!
    – Filipe Brandenburger
    2 days ago










  • Sorry but the cstart = cend = FOLD (cstart); code only apply to a collating symbol (something written as [ [.cz.] ], that is between [. and .] inside the brackets) not in general to a bracket expression.
    – Isaac
    2 days ago










  • @Isaac I don't think it is, I think [[.cz.]] is handled a few lines above with p = PARSE_COLLSYM (p, &pc);. But that code is pretty complicated, so I'm not 100% sure I found all the right places... I'm still fairly confident that there are some range comparisons for a single character range, since there are cases where cstart = cend, and that would explain why characters that collate the same would look like a "match". That other shells probably implement that differently would explain why they wouldn't be affected by the issue.
    – Filipe Brandenburger
    2 days ago










  • Yes, it seems that it should apply to all characters inside a [ ]. I just not understand why. A collating table should be a total order or the confirmation that a character is absent will fail. If a and b sort equal then a sorted list where a a c doesn't confirm that b is absent in the list as the sort order could have been b a c.
    – Isaac
    2 days ago











Your Answer








StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














 

draft saved


draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f483743%2fwhy-is-bash-removing-other-digits%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
3
down vote













I managed to reproduce this problem on Ubuntu 17.10 (glibc 2.26) and on Ubuntu 18.04 (glibc 2.27), but it seems to be fixed on Ubuntu 18.10 (glibc 2.28)



The problem is with the localedata, more specifically the LC_COLLATE data for en_US.utf8 (actually, that collation data comes from an ISO 14651 file which is included in most locales, so it probably affects all other utf8 locales as well.)



The localedata comes from glibc and the bug seems to be present there (though distros customize that data fairly heavily, so it's possible other distros with glibc <2.28 might not have the issue.)



In fact, the glibc 2.28 announcement starts listing new features with:




The localization data for ISO 14651 is updated to match the 2016
Edition 4 release of the standard, this matches data provided by
Unicode 9.0.0. This update introduces significant improvements to the
collation of Unicode characters.




Looking at the commits, it's a huge overhaul on the localedata, so that's probably what fixed the bug!



In short, the issue with collation of these two symbols (U0030, which is '0', and U0660, which is the Arabic-Indic zero '٠') is that they sort exactly the same, when compared using strcoll(3), which can be demonstrated with this short test using sort (which uses strcoll under the hood):



ubuntu-18.04$ { echo 0; echo -e 'u0660'; echo 0; } | sort
0
٠
0


And on glibc 2.28:



ubuntu-18.10$ { echo 0; echo -e 'u0660'; echo 0; } | sort
0
0
٠


As you can see, on the older glibc, it's not reordering the Arabic-Indic zero '٠', neither before nor after the '0', which proves they collate the same.



Looking at the glibc sources, we can understand why the problem happens.



In the glibc 2.27 sources for ISO 14651, the following definitions can be found:



<U0030> <0>;<BAS>;<MIN>;IGNORE # 171 0
<U0660> <0>;<BAS>;<MIN>;IGNORE
<U06F0> <0>;<PCL>;<MIN>;IGNORE
<U0966> <0>;"<BAS><NUM>";"<MIN><MIN>";IGNORE


So both '0' (u0030) and '٠' (u0660) expand to the exact same sequence (<0>;<BAS>;<MIN>;IGNORE) which means that strcoll will treat them the same. (This also explains why the other characters such as u06f0 and u0966 are not affected, since their expansion is different.)



Looking at the glibc 2.28 sources for ISO 14651, the following definitions are now found:



<U0030> <S0030>;<BASE>;<MIN>;<U0030> % DIGIT ZERO
<U0660> <S0030>;<BASE>;<MIN>;<U0660> % ARABIC-INDIC DIGIT ZERO
<U06F0> <S0030>;<BASE>;<MIN>;<U06F0> % EXTENDED ARABIC-INDIC DIGIT ZERO
<U07C0> <S0030>;<BASE>;<MIN>;<U07C0> % NKO DIGIT ZERO
<U0966> <S0030>;<BASE>;<MIN>;<U0966> % DEVANAGARI DIGIT ZERO


The fourth field is now always filled with the code point itself, which means they will have a defined sort order, even if the first few fields match. While the change for <U0660> was not introduced in this particular commit, its description explains the idea:




[...] putting the code point of the character into the fourth level
instead of “IGNORE”. Without that change, all such characters
would compare equal which would make a wcscoll test case fail.
It is better to have a clearly defined sort order even for characters
like this so it is good to use the code point as a tie-break.




  • localedata/locales/iso14651_t1_common: Use the code point of a
    character in the fourth collation level instead of IGNORE for all
    entries which have IGNORE on all 4 levels.




So hopefully this explains the bug with localedata in glibc <2.28 and the fix in glibc 2.28.





Regarding bash, if you look at the source code, you'll see that it handles a single character (0) in a bracket expression ([0]) the same as if it was a range with the character as both start and end ([0-0]):



cstart = cend = FOLD (cstart);


Then later it compares the current character with that range using RANGECMP:



if (RANGECMP (test, cstart, forcecoll) >= 0 && RANGECMP (test, cend, forcecoll) <= 0)
goto matched;


And then RANGECMP (defined to rangecmp_wc in multi-byte mode) uses wcscoll(3) (which is the multi-byte version of strcoll):



return (wcscoll (s1, s2));


The fact that bash uses a range comparison for a single character (as a shortcut, to share a bit of the code with handling of ranges) makes it so that it accepts all characters that sort the same as well as the original character.



Other shells probably don't have this problem because they do a straight comparison if a range is not involved.



The reason why this issue started appearing on bash 3.0 is that bash 3.0 introduced support for multi-byte (Unicode), which ended up refactoring all this code and probably using locale-aware comparisons, which are connected to the issue.



UPDATE: This issue was reported as a bug to the bash project by @Isaac.





WORKAROUND: If upgrading to a distro that uses glibc 2.28 is unfeasible, a possible workaround is to use LC_COLLATE=C.utf8 or POSIX.utf8 which define a "trivial" sort order where no codepoints will sort the same. Considering the issue is with collation, setting LC_COLLATE only is enough. Testing this workaround on Ubuntu 17.10 and 18.04 showed it was enough to fix this problem.






share|improve this answer























  • In the text "Looking at the glibc 2.27 sources for ISO 14651, the following definitions are now found:" the version number should be 2.28, right? (Can't edit as it's too short of a change)
    – Grisha Levit
    2 days ago










  • @GrishaLevit Indeed! Thanks for spotting that. I just edited it to fix it. Cheers!
    – Filipe Brandenburger
    2 days ago










  • Sorry but the cstart = cend = FOLD (cstart); code only apply to a collating symbol (something written as [ [.cz.] ], that is between [. and .] inside the brackets) not in general to a bracket expression.
    – Isaac
    2 days ago










  • @Isaac I don't think it is, I think [[.cz.]] is handled a few lines above with p = PARSE_COLLSYM (p, &pc);. But that code is pretty complicated, so I'm not 100% sure I found all the right places... I'm still fairly confident that there are some range comparisons for a single character range, since there are cases where cstart = cend, and that would explain why characters that collate the same would look like a "match". That other shells probably implement that differently would explain why they wouldn't be affected by the issue.
    – Filipe Brandenburger
    2 days ago










  • Yes, it seems that it should apply to all characters inside a [ ]. I just not understand why. A collating table should be a total order or the confirmation that a character is absent will fail. If a and b sort equal then a sorted list where a a c doesn't confirm that b is absent in the list as the sort order could have been b a c.
    – Isaac
    2 days ago















up vote
3
down vote













I managed to reproduce this problem on Ubuntu 17.10 (glibc 2.26) and on Ubuntu 18.04 (glibc 2.27), but it seems to be fixed on Ubuntu 18.10 (glibc 2.28)



The problem is with the localedata, more specifically the LC_COLLATE data for en_US.utf8 (actually, that collation data comes from an ISO 14651 file which is included in most locales, so it probably affects all other utf8 locales as well.)



The localedata comes from glibc and the bug seems to be present there (though distros customize that data fairly heavily, so it's possible other distros with glibc <2.28 might not have the issue.)



In fact, the glibc 2.28 announcement starts listing new features with:




The localization data for ISO 14651 is updated to match the 2016
Edition 4 release of the standard, this matches data provided by
Unicode 9.0.0. This update introduces significant improvements to the
collation of Unicode characters.




Looking at the commits, it's a huge overhaul on the localedata, so that's probably what fixed the bug!



In short, the issue with collation of these two symbols (U0030, which is '0', and U0660, which is the Arabic-Indic zero '٠') is that they sort exactly the same, when compared using strcoll(3), which can be demonstrated with this short test using sort (which uses strcoll under the hood):



ubuntu-18.04$ { echo 0; echo -e 'u0660'; echo 0; } | sort
0
٠
0


And on glibc 2.28:



ubuntu-18.10$ { echo 0; echo -e 'u0660'; echo 0; } | sort
0
0
٠


As you can see, on the older glibc, it's not reordering the Arabic-Indic zero '٠', neither before nor after the '0', which proves they collate the same.



Looking at the glibc sources, we can understand why the problem happens.



In the glibc 2.27 sources for ISO 14651, the following definitions can be found:



<U0030> <0>;<BAS>;<MIN>;IGNORE # 171 0
<U0660> <0>;<BAS>;<MIN>;IGNORE
<U06F0> <0>;<PCL>;<MIN>;IGNORE
<U0966> <0>;"<BAS><NUM>";"<MIN><MIN>";IGNORE


So both '0' (u0030) and '٠' (u0660) expand to the exact same sequence (<0>;<BAS>;<MIN>;IGNORE) which means that strcoll will treat them the same. (This also explains why the other characters such as u06f0 and u0966 are not affected, since their expansion is different.)



Looking at the glibc 2.28 sources for ISO 14651, the following definitions are now found:



<U0030> <S0030>;<BASE>;<MIN>;<U0030> % DIGIT ZERO
<U0660> <S0030>;<BASE>;<MIN>;<U0660> % ARABIC-INDIC DIGIT ZERO
<U06F0> <S0030>;<BASE>;<MIN>;<U06F0> % EXTENDED ARABIC-INDIC DIGIT ZERO
<U07C0> <S0030>;<BASE>;<MIN>;<U07C0> % NKO DIGIT ZERO
<U0966> <S0030>;<BASE>;<MIN>;<U0966> % DEVANAGARI DIGIT ZERO


The fourth field is now always filled with the code point itself, which means they will have a defined sort order, even if the first few fields match. While the change for <U0660> was not introduced in this particular commit, its description explains the idea:




[...] putting the code point of the character into the fourth level
instead of “IGNORE”. Without that change, all such characters
would compare equal which would make a wcscoll test case fail.
It is better to have a clearly defined sort order even for characters
like this so it is good to use the code point as a tie-break.




  • localedata/locales/iso14651_t1_common: Use the code point of a
    character in the fourth collation level instead of IGNORE for all
    entries which have IGNORE on all 4 levels.




So hopefully this explains the bug with localedata in glibc <2.28 and the fix in glibc 2.28.





Regarding bash, if you look at the source code, you'll see that it handles a single character (0) in a bracket expression ([0]) the same as if it was a range with the character as both start and end ([0-0]):



cstart = cend = FOLD (cstart);


Then later it compares the current character with that range using RANGECMP:



if (RANGECMP (test, cstart, forcecoll) >= 0 && RANGECMP (test, cend, forcecoll) <= 0)
goto matched;


And then RANGECMP (defined to rangecmp_wc in multi-byte mode) uses wcscoll(3) (which is the multi-byte version of strcoll):



return (wcscoll (s1, s2));


The fact that bash uses a range comparison for a single character (as a shortcut, to share a bit of the code with handling of ranges) makes it so that it accepts all characters that sort the same as well as the original character.



Other shells probably don't have this problem because they do a straight comparison if a range is not involved.



The reason why this issue started appearing on bash 3.0 is that bash 3.0 introduced support for multi-byte (Unicode), which ended up refactoring all this code and probably using locale-aware comparisons, which are connected to the issue.



UPDATE: This issue was reported as a bug to the bash project by @Isaac.





WORKAROUND: If upgrading to a distro that uses glibc 2.28 is unfeasible, a possible workaround is to use LC_COLLATE=C.utf8 or POSIX.utf8 which define a "trivial" sort order where no codepoints will sort the same. Considering the issue is with collation, setting LC_COLLATE only is enough. Testing this workaround on Ubuntu 17.10 and 18.04 showed it was enough to fix this problem.






share|improve this answer























  • In the text "Looking at the glibc 2.27 sources for ISO 14651, the following definitions are now found:" the version number should be 2.28, right? (Can't edit as it's too short of a change)
    – Grisha Levit
    2 days ago










  • @GrishaLevit Indeed! Thanks for spotting that. I just edited it to fix it. Cheers!
    – Filipe Brandenburger
    2 days ago










  • Sorry but the cstart = cend = FOLD (cstart); code only apply to a collating symbol (something written as [ [.cz.] ], that is between [. and .] inside the brackets) not in general to a bracket expression.
    – Isaac
    2 days ago










  • @Isaac I don't think it is, I think [[.cz.]] is handled a few lines above with p = PARSE_COLLSYM (p, &pc);. But that code is pretty complicated, so I'm not 100% sure I found all the right places... I'm still fairly confident that there are some range comparisons for a single character range, since there are cases where cstart = cend, and that would explain why characters that collate the same would look like a "match". That other shells probably implement that differently would explain why they wouldn't be affected by the issue.
    – Filipe Brandenburger
    2 days ago










  • Yes, it seems that it should apply to all characters inside a [ ]. I just not understand why. A collating table should be a total order or the confirmation that a character is absent will fail. If a and b sort equal then a sorted list where a a c doesn't confirm that b is absent in the list as the sort order could have been b a c.
    – Isaac
    2 days ago













up vote
3
down vote










up vote
3
down vote









I managed to reproduce this problem on Ubuntu 17.10 (glibc 2.26) and on Ubuntu 18.04 (glibc 2.27), but it seems to be fixed on Ubuntu 18.10 (glibc 2.28)



The problem is with the localedata, more specifically the LC_COLLATE data for en_US.utf8 (actually, that collation data comes from an ISO 14651 file which is included in most locales, so it probably affects all other utf8 locales as well.)



The localedata comes from glibc and the bug seems to be present there (though distros customize that data fairly heavily, so it's possible other distros with glibc <2.28 might not have the issue.)



In fact, the glibc 2.28 announcement starts listing new features with:




The localization data for ISO 14651 is updated to match the 2016
Edition 4 release of the standard, this matches data provided by
Unicode 9.0.0. This update introduces significant improvements to the
collation of Unicode characters.




Looking at the commits, it's a huge overhaul on the localedata, so that's probably what fixed the bug!



In short, the issue with collation of these two symbols (U0030, which is '0', and U0660, which is the Arabic-Indic zero '٠') is that they sort exactly the same, when compared using strcoll(3), which can be demonstrated with this short test using sort (which uses strcoll under the hood):



ubuntu-18.04$ { echo 0; echo -e 'u0660'; echo 0; } | sort
0
٠
0


And on glibc 2.28:



ubuntu-18.10$ { echo 0; echo -e 'u0660'; echo 0; } | sort
0
0
٠


As you can see, on the older glibc, it's not reordering the Arabic-Indic zero '٠', neither before nor after the '0', which proves they collate the same.



Looking at the glibc sources, we can understand why the problem happens.



In the glibc 2.27 sources for ISO 14651, the following definitions can be found:



<U0030> <0>;<BAS>;<MIN>;IGNORE # 171 0
<U0660> <0>;<BAS>;<MIN>;IGNORE
<U06F0> <0>;<PCL>;<MIN>;IGNORE
<U0966> <0>;"<BAS><NUM>";"<MIN><MIN>";IGNORE


So both '0' (u0030) and '٠' (u0660) expand to the exact same sequence (<0>;<BAS>;<MIN>;IGNORE) which means that strcoll will treat them the same. (This also explains why the other characters such as u06f0 and u0966 are not affected, since their expansion is different.)



Looking at the glibc 2.28 sources for ISO 14651, the following definitions are now found:



<U0030> <S0030>;<BASE>;<MIN>;<U0030> % DIGIT ZERO
<U0660> <S0030>;<BASE>;<MIN>;<U0660> % ARABIC-INDIC DIGIT ZERO
<U06F0> <S0030>;<BASE>;<MIN>;<U06F0> % EXTENDED ARABIC-INDIC DIGIT ZERO
<U07C0> <S0030>;<BASE>;<MIN>;<U07C0> % NKO DIGIT ZERO
<U0966> <S0030>;<BASE>;<MIN>;<U0966> % DEVANAGARI DIGIT ZERO


The fourth field is now always filled with the code point itself, which means they will have a defined sort order, even if the first few fields match. While the change for <U0660> was not introduced in this particular commit, its description explains the idea:




[...] putting the code point of the character into the fourth level
instead of “IGNORE”. Without that change, all such characters
would compare equal which would make a wcscoll test case fail.
It is better to have a clearly defined sort order even for characters
like this so it is good to use the code point as a tie-break.




  • localedata/locales/iso14651_t1_common: Use the code point of a
    character in the fourth collation level instead of IGNORE for all
    entries which have IGNORE on all 4 levels.




So hopefully this explains the bug with localedata in glibc <2.28 and the fix in glibc 2.28.





Regarding bash, if you look at the source code, you'll see that it handles a single character (0) in a bracket expression ([0]) the same as if it was a range with the character as both start and end ([0-0]):



cstart = cend = FOLD (cstart);


Then later it compares the current character with that range using RANGECMP:



if (RANGECMP (test, cstart, forcecoll) >= 0 && RANGECMP (test, cend, forcecoll) <= 0)
goto matched;


And then RANGECMP (defined to rangecmp_wc in multi-byte mode) uses wcscoll(3) (which is the multi-byte version of strcoll):



return (wcscoll (s1, s2));


The fact that bash uses a range comparison for a single character (as a shortcut, to share a bit of the code with handling of ranges) makes it so that it accepts all characters that sort the same as well as the original character.



Other shells probably don't have this problem because they do a straight comparison if a range is not involved.



The reason why this issue started appearing on bash 3.0 is that bash 3.0 introduced support for multi-byte (Unicode), which ended up refactoring all this code and probably using locale-aware comparisons, which are connected to the issue.



UPDATE: This issue was reported as a bug to the bash project by @Isaac.





WORKAROUND: If upgrading to a distro that uses glibc 2.28 is unfeasible, a possible workaround is to use LC_COLLATE=C.utf8 or POSIX.utf8 which define a "trivial" sort order where no codepoints will sort the same. Considering the issue is with collation, setting LC_COLLATE only is enough. Testing this workaround on Ubuntu 17.10 and 18.04 showed it was enough to fix this problem.






share|improve this answer














I managed to reproduce this problem on Ubuntu 17.10 (glibc 2.26) and on Ubuntu 18.04 (glibc 2.27), but it seems to be fixed on Ubuntu 18.10 (glibc 2.28)



The problem is with the localedata, more specifically the LC_COLLATE data for en_US.utf8 (actually, that collation data comes from an ISO 14651 file which is included in most locales, so it probably affects all other utf8 locales as well.)



The localedata comes from glibc and the bug seems to be present there (though distros customize that data fairly heavily, so it's possible other distros with glibc <2.28 might not have the issue.)



In fact, the glibc 2.28 announcement starts listing new features with:




The localization data for ISO 14651 is updated to match the 2016
Edition 4 release of the standard, this matches data provided by
Unicode 9.0.0. This update introduces significant improvements to the
collation of Unicode characters.




Looking at the commits, it's a huge overhaul on the localedata, so that's probably what fixed the bug!



In short, the issue with collation of these two symbols (U0030, which is '0', and U0660, which is the Arabic-Indic zero '٠') is that they sort exactly the same, when compared using strcoll(3), which can be demonstrated with this short test using sort (which uses strcoll under the hood):



ubuntu-18.04$ { echo 0; echo -e 'u0660'; echo 0; } | sort
0
٠
0


And on glibc 2.28:



ubuntu-18.10$ { echo 0; echo -e 'u0660'; echo 0; } | sort
0
0
٠


As you can see, on the older glibc, it's not reordering the Arabic-Indic zero '٠', neither before nor after the '0', which proves they collate the same.



Looking at the glibc sources, we can understand why the problem happens.



In the glibc 2.27 sources for ISO 14651, the following definitions can be found:



<U0030> <0>;<BAS>;<MIN>;IGNORE # 171 0
<U0660> <0>;<BAS>;<MIN>;IGNORE
<U06F0> <0>;<PCL>;<MIN>;IGNORE
<U0966> <0>;"<BAS><NUM>";"<MIN><MIN>";IGNORE


So both '0' (u0030) and '٠' (u0660) expand to the exact same sequence (<0>;<BAS>;<MIN>;IGNORE) which means that strcoll will treat them the same. (This also explains why the other characters such as u06f0 and u0966 are not affected, since their expansion is different.)



Looking at the glibc 2.28 sources for ISO 14651, the following definitions are now found:



<U0030> <S0030>;<BASE>;<MIN>;<U0030> % DIGIT ZERO
<U0660> <S0030>;<BASE>;<MIN>;<U0660> % ARABIC-INDIC DIGIT ZERO
<U06F0> <S0030>;<BASE>;<MIN>;<U06F0> % EXTENDED ARABIC-INDIC DIGIT ZERO
<U07C0> <S0030>;<BASE>;<MIN>;<U07C0> % NKO DIGIT ZERO
<U0966> <S0030>;<BASE>;<MIN>;<U0966> % DEVANAGARI DIGIT ZERO


The fourth field is now always filled with the code point itself, which means they will have a defined sort order, even if the first few fields match. While the change for <U0660> was not introduced in this particular commit, its description explains the idea:




[...] putting the code point of the character into the fourth level
instead of “IGNORE”. Without that change, all such characters
would compare equal which would make a wcscoll test case fail.
It is better to have a clearly defined sort order even for characters
like this so it is good to use the code point as a tie-break.




  • localedata/locales/iso14651_t1_common: Use the code point of a
    character in the fourth collation level instead of IGNORE for all
    entries which have IGNORE on all 4 levels.




So hopefully this explains the bug with localedata in glibc <2.28 and the fix in glibc 2.28.





Regarding bash, if you look at the source code, you'll see that it handles a single character (0) in a bracket expression ([0]) the same as if it was a range with the character as both start and end ([0-0]):



cstart = cend = FOLD (cstart);


Then later it compares the current character with that range using RANGECMP:



if (RANGECMP (test, cstart, forcecoll) >= 0 && RANGECMP (test, cend, forcecoll) <= 0)
goto matched;


And then RANGECMP (defined to rangecmp_wc in multi-byte mode) uses wcscoll(3) (which is the multi-byte version of strcoll):



return (wcscoll (s1, s2));


The fact that bash uses a range comparison for a single character (as a shortcut, to share a bit of the code with handling of ranges) makes it so that it accepts all characters that sort the same as well as the original character.



Other shells probably don't have this problem because they do a straight comparison if a range is not involved.



The reason why this issue started appearing on bash 3.0 is that bash 3.0 introduced support for multi-byte (Unicode), which ended up refactoring all this code and probably using locale-aware comparisons, which are connected to the issue.



UPDATE: This issue was reported as a bug to the bash project by @Isaac.





WORKAROUND: If upgrading to a distro that uses glibc 2.28 is unfeasible, a possible workaround is to use LC_COLLATE=C.utf8 or POSIX.utf8 which define a "trivial" sort order where no codepoints will sort the same. Considering the issue is with collation, setting LC_COLLATE only is enough. Testing this workaround on Ubuntu 17.10 and 18.04 showed it was enough to fix this problem.







share|improve this answer














share|improve this answer



share|improve this answer








edited 2 hours ago

























answered 2 days ago









Filipe Brandenburger

6,4101727




6,4101727












  • In the text "Looking at the glibc 2.27 sources for ISO 14651, the following definitions are now found:" the version number should be 2.28, right? (Can't edit as it's too short of a change)
    – Grisha Levit
    2 days ago










  • @GrishaLevit Indeed! Thanks for spotting that. I just edited it to fix it. Cheers!
    – Filipe Brandenburger
    2 days ago










  • Sorry but the cstart = cend = FOLD (cstart); code only apply to a collating symbol (something written as [ [.cz.] ], that is between [. and .] inside the brackets) not in general to a bracket expression.
    – Isaac
    2 days ago










  • @Isaac I don't think it is, I think [[.cz.]] is handled a few lines above with p = PARSE_COLLSYM (p, &pc);. But that code is pretty complicated, so I'm not 100% sure I found all the right places... I'm still fairly confident that there are some range comparisons for a single character range, since there are cases where cstart = cend, and that would explain why characters that collate the same would look like a "match". That other shells probably implement that differently would explain why they wouldn't be affected by the issue.
    – Filipe Brandenburger
    2 days ago










  • Yes, it seems that it should apply to all characters inside a [ ]. I just not understand why. A collating table should be a total order or the confirmation that a character is absent will fail. If a and b sort equal then a sorted list where a a c doesn't confirm that b is absent in the list as the sort order could have been b a c.
    – Isaac
    2 days ago


















  • In the text "Looking at the glibc 2.27 sources for ISO 14651, the following definitions are now found:" the version number should be 2.28, right? (Can't edit as it's too short of a change)
    – Grisha Levit
    2 days ago










  • @GrishaLevit Indeed! Thanks for spotting that. I just edited it to fix it. Cheers!
    – Filipe Brandenburger
    2 days ago










  • Sorry but the cstart = cend = FOLD (cstart); code only apply to a collating symbol (something written as [ [.cz.] ], that is between [. and .] inside the brackets) not in general to a bracket expression.
    – Isaac
    2 days ago










  • @Isaac I don't think it is, I think [[.cz.]] is handled a few lines above with p = PARSE_COLLSYM (p, &pc);. But that code is pretty complicated, so I'm not 100% sure I found all the right places... I'm still fairly confident that there are some range comparisons for a single character range, since there are cases where cstart = cend, and that would explain why characters that collate the same would look like a "match". That other shells probably implement that differently would explain why they wouldn't be affected by the issue.
    – Filipe Brandenburger
    2 days ago










  • Yes, it seems that it should apply to all characters inside a [ ]. I just not understand why. A collating table should be a total order or the confirmation that a character is absent will fail. If a and b sort equal then a sorted list where a a c doesn't confirm that b is absent in the list as the sort order could have been b a c.
    – Isaac
    2 days ago
















In the text "Looking at the glibc 2.27 sources for ISO 14651, the following definitions are now found:" the version number should be 2.28, right? (Can't edit as it's too short of a change)
– Grisha Levit
2 days ago




In the text "Looking at the glibc 2.27 sources for ISO 14651, the following definitions are now found:" the version number should be 2.28, right? (Can't edit as it's too short of a change)
– Grisha Levit
2 days ago












@GrishaLevit Indeed! Thanks for spotting that. I just edited it to fix it. Cheers!
– Filipe Brandenburger
2 days ago




@GrishaLevit Indeed! Thanks for spotting that. I just edited it to fix it. Cheers!
– Filipe Brandenburger
2 days ago












Sorry but the cstart = cend = FOLD (cstart); code only apply to a collating symbol (something written as [ [.cz.] ], that is between [. and .] inside the brackets) not in general to a bracket expression.
– Isaac
2 days ago




Sorry but the cstart = cend = FOLD (cstart); code only apply to a collating symbol (something written as [ [.cz.] ], that is between [. and .] inside the brackets) not in general to a bracket expression.
– Isaac
2 days ago












@Isaac I don't think it is, I think [[.cz.]] is handled a few lines above with p = PARSE_COLLSYM (p, &pc);. But that code is pretty complicated, so I'm not 100% sure I found all the right places... I'm still fairly confident that there are some range comparisons for a single character range, since there are cases where cstart = cend, and that would explain why characters that collate the same would look like a "match". That other shells probably implement that differently would explain why they wouldn't be affected by the issue.
– Filipe Brandenburger
2 days ago




@Isaac I don't think it is, I think [[.cz.]] is handled a few lines above with p = PARSE_COLLSYM (p, &pc);. But that code is pretty complicated, so I'm not 100% sure I found all the right places... I'm still fairly confident that there are some range comparisons for a single character range, since there are cases where cstart = cend, and that would explain why characters that collate the same would look like a "match". That other shells probably implement that differently would explain why they wouldn't be affected by the issue.
– Filipe Brandenburger
2 days ago












Yes, it seems that it should apply to all characters inside a [ ]. I just not understand why. A collating table should be a total order or the confirmation that a character is absent will fail. If a and b sort equal then a sorted list where a a c doesn't confirm that b is absent in the list as the sort order could have been b a c.
– Isaac
2 days ago




Yes, it seems that it should apply to all characters inside a [ ]. I just not understand why. A collating table should be a total order or the confirmation that a character is absent will fail. If a and b sort equal then a sorted list where a a c doesn't confirm that b is absent in the list as the sort order could have been b a c.
– Isaac
2 days ago


















 

draft saved


draft discarded



















































 


draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f483743%2fwhy-is-bash-removing-other-digits%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

サソリ

広島県道265号伴広島線

Accessing regular linux commands in Huawei's Dopra Linux