积极答复者
string.Compare("A","a")为什么等于1

问题
答案
-
string.Compare(StrA,StrB)函数返回值
一、Less than zero StrA precedes StrB in the sort order
二、Zero StrA occurs in the same position as StrB in the sort order
三、Greater than zero StrA follows StrB in the sort order
函数返回值小于零还是大于零,关键问题确定sort order是什么,是Unicode编码,还是其他。从 .Net 文章中的解释,sort order很可能是参考权重表制定的。
Win10版本权重Weights由4个字节(脚本成员SM、字母权重AW、变音权重DW、大写权重CW)组成,
脚本成员和字母权重共同构成2个字节的Unicode Weights,东亚语言环境中变音权重也被提升至UW中,成为3字节权重。
Unicode Weights是sort order判断的先决条件,其次依据变音权重,第三看大写权重。
由下图可见,字符"A"和"a"的UW及变音权重都为"0E 02 02",即两个比较字符串第一个位置上字符权重值相同返回Zero,
但是再比较CW值,"A"为12,"a"为2,2 precedes 12, 12 follows 2, 因此在不忽略大小写的情况下,string.Compare("A","a")返回值是大于零的。- 已标记为答案 朋友的朋友 2021年1月27日 3:24
全部回复
-
Hi 朋友的朋友,
String.Compare方法是用于比较两个字符串的位置,若第一个字符串在第二个字符串前面,则返回数字1,反之,则返回0,如果两者位置相同,则返回。
根据unicode的编码顺序,我们可以查到A的位置是0034,而a的位置是0066。
上面的图片你可以在维基百科中搜索unicode字符列表来了解具体字符的位置。
Best Regards,
Jack
MSDN Community Support
Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.- 已建议为答案 ThankfulHeartModerator 2020年11月9日 10:15
- 已标记为答案 朋友的朋友 2021年1月21日 0:06
- 取消答案标记 朋友的朋友 2021年1月21日 0:36
-
; Weights are 4 bytes: SM, AW, DW & CW. Together the SM + AW are "Unicode Weight"
; SM - Script Member
; AW - Alphabetic Weight
; DW - Diacritic Weight
; CW - Case Weight
;
; For the most part, strings are sorted with all of the "Unicode Weights" being primary, then
; the diacritic and case weights as being secondary and tertiary weights. (a few more extra
; weights follow that).
;
; The sort keys generated are the "Unicode Weight" of the entire string, delimited by 01, then
; the Diacritic (or secondary) weight, then another 01 delimiter, then the Case (or tertiary)
; weight. The code also adds two more sections, again delimited by 01. Weights are 0 terminated. -
string.Compare(StrA,StrB)函数返回值
一、Less than zero StrA precedes StrB in the sort order
二、Zero StrA occurs in the same position as StrB in the sort order
三、Greater than zero StrA follows StrB in the sort order
函数返回值小于零还是大于零,关键问题确定sort order是什么,是Unicode编码,还是其他。从 .Net 文章中的解释,sort order很可能是参考权重表制定的。
Win10版本权重Weights由4个字节(脚本成员SM、字母权重AW、变音权重DW、大写权重CW)组成,
脚本成员和字母权重共同构成2个字节的Unicode Weights,东亚语言环境中变音权重也被提升至UW中,成为3字节权重。
Unicode Weights是sort order判断的先决条件,其次依据变音权重,第三看大写权重。
由下图可见,字符"A"和"a"的UW及变音权重都为"0E 02 02",即两个比较字符串第一个位置上字符权重值相同返回Zero,
但是再比较CW值,"A"为12,"a"为2,2 precedes 12, 12 follows 2, 因此在不忽略大小写的情况下,string.Compare("A","a")返回值是大于零的。- 已标记为答案 朋友的朋友 2021年1月27日 3:24
-
有点儿好奇,你们的这些表都是哪儿来的?
在微软的官网上有Windows 10 Sorting Weight Table文件下载,大小约为20MB,开头部分源代码如下:
; Unisort.txt ; This file contains all the NLS sorting data. It is UTF-8 for some of the comments ; ; The initial table is our default table (nominally GUID 00000001-57EE-1E5C-00B4-D0000BB1E11E) ; Weights are 4 bytes: SM, AW, DW & CW. Together the SM + AW are "Unicode Weight" ; SM - Script Member ; AW - Alphabetic Weight ; DW - Diacritic Weight ; CW - Case Weight ; ; For the most part, strings are sorted with all of the "Unicode Weights" being primary, then ; the diacritic and case weights as being secondary and tertiary weights. (a few more extra ; weights follow that). ; ; The sort keys generated are the "Unicode Weight" of the entire string, delimited by 01, then ; the Diacritic (or secondary) weight, then another 01 delimiter, then the Case (or tertiary) ; weight. The code also adds two more sections, again delimited by 01. Weights are 0 terminated. ; ; So for Äpple the weights would be 0e 02 0e 7e 0e 7e 0e 48 0e 21 01 13 01 12 01 01 00 ; 0e 02 0e 7e 0e 7e 0e 48 0e 21 "apple" Unicode Weight ; 01 delimeter ; 13 "ä" diacritic weight ; 01 ; 12 "A" case weight ; 01 01 00 (two sections with no weight in this string and terminator) ; ; Special cases: ; Our sort keys are delimited by 1 and terminated by 0, so no weights by use 0 or 1. ; In these tables, 0 0 0 0 is used as a flag for "no weight". Those characters are effectively ; ignored by NLS. ; ; East Asian locales may have "3 byte weights". In that case, DW is promoted to be included ; in UW, so UW then is SM, AW, and DW. This only happens if 192 <= SW <= 239. ; PUA are always 3 byte weights ; ; SM values: ; 0 - reserved for no weight code points ; 1 - flag for non-spacing marks, those weights are added to the previous base character's weights ; (Note: DW are a minimum of 2, so all characters have 2 added to their DW, however ; non-spacing characters presume the base already had that 2, so these weights are only ; the additional diacritic difference, without the base 2 ; example: A has a DW of 2, ̈ has a DW of 17, so Ä has a DW of 19 (2 + 17)) ; 2 - Unused ??? reserved because our minimums are 2 (Does SortGetSortKey generate any 2s?) ; 3 - ; 14 - Latin script ; ; AW values: ; 0-1 - reserved for delimiter/terminator ; 2 - "first letter in script". Eg: the AW for A. ; ; DW Values: ; 0-1 - reserved for delimiter/terminator ; ; CW Values: ; 0-1 - reserved for delimiter/terminator ; 0x01 bit - Full Width (if set). (1 == Full Width, 0 == Half/Normal Width) ; 0x02 bit - Set by default, can be cleared in some case (if another higher bit is set) ; 0x04 bit - Super/Subscript? ; 0x08 bit - ; 0x10 bit - Upper Case. (16 == upper case, 0 == lower case) ; 0x20 bit - ; 0x40/0x80 bits - Reserved for nlstrans, which uses these as flags for characters that may compress ; ; Flags to NLSTrans ; After the GUID ; HAS_3_BYTE_WEIGHTS - This ID has 3 byte weights. Must be tagged on the EXCEPTION AND on the COMPRESSION ; LINGUISTIC_CASING - Tagged (only) on the EXCEPTION table that applies to the linguistic case. ; Should also have another untagged EXCEPTION table for non-linguistic casing. ; SORTKEY DEFAULT ; Characters in this table are sorted by: SM, AW, DW, CW, Codepoint # ; Weightless code points (these will be ignored when comparing strings) 0x0000 0 0 0 0 ;<control> 0x00ad 0 0 0 0 ;Soft Hyphen 0x034f 0 0 0 0 ;Combining Grapheme Joiner 0x0640 0 0 0 0 ;Arabic Tatweel 0x0ecc 0 0 0 0 ;Lao Cancellation Mark 0x1806 0 0 0 0 ;Mongolian Todo Soft Hyphen 0x180b 0 0 0 0 ;Mongolian Free Variation Selector One 0x180c 0 0 0 0 ;Mongolian Free Variation Selector Two 0x180d 0 0 0 0 ;Mongolian Free Variation Selector Three 0x200c 0 0 0 0 ;Zero Width Non-joiner 0x200d 0 0 0 0 ;Zero Width Joiner 0x200e 0 0 0 0 ;Left-to-right Mark 0x200f 0 0 0 0 ;Right-to-left Mark 0x202a 0 0 0 0 ;Left-to-right Embedding 0x202b 0 0 0 0 ;Right-to-left Embedding 0x202c 0 0 0 0 ;Pop Directional Formatting 0x202d 0 0 0 0 ;Left-to-right Override 0x202e 0 0 0 0 ;Right-to-left Override 0x2060 0 0 0 0 ;Word Joiner 0x2061 0 0 0 0 ;Function Application 0x2062 0 0 0 0 ;Invisible Times 0x2063 0 0 0 0 ;Invisible Separator 0x2064 0 0 0 0 ;INVISIBLE PLUS 0x206a 0 0 0 0 ;Inhibit Symmetric Swapping 0x206b 0 0 0 0 ;Activate Symmetric Swapping 0x206c 0 0 0 0 ;Inhibit Arabic Form Shaping 0x206d 0 0 0 0 ;Activate Arabic Form Shaping 0x206e 0 0 0 0 ;National Digit Shapes 0x206f 0 0 0 0 ;Nominal Digit Shapes 0x3190 0 0 0 0 ;Ideographic Annotation Linking Mark 0x3191 0 0 0 0 ;Ideographic Annotation Reverse Mark 0xdb40 0 0 0 0 ;Variation Selector High Surrogate in range 0xE0100-0xE01EF 0xfe00 0 0 0 0 ;Variation Selector-1 0xfe01 0 0 0 0 ;Variation Selector-2 0xfe02 0 0 0 0 ;Variation Selector-3 0xfe03 0 0 0 0 ;Variation Selector-4 0xfe04 0 0 0 0 ;Variation Selector-5 0xfe05 0 0 0 0 ;Variation Selector-6 0xfe06 0 0 0 0 ;Variation Selector-7 0xfe07 0 0 0 0 ;Variation Selector-8 0xfe08 0 0 0 0 ;Variation Selector-9 0xfe09 0 0 0 0 ;Variation Selector-10 0xfe0a 0 0 0 0 ;Variation Selector-11 0xfe0b 0 0 0 0 ;Variation Selector-12 0xfe0c 0 0 0 0 ;Variation Selector-13 0xfe0d 0 0 0 0 ;Variation Selector-14 0xfe0e 0 0 0 0 ;Variation Selector-15 0xfe0f 0 0 0 0 ;Variation Selector-16 0xfeff 0 0 0 0 ;ZWNBSP - Byte Order Mark 0xfff9 0 0 0 0 ;Interlinear Annotation Anchor 0xfffa 0 0 0 0 ;Interlinear Annotation Separator 0xfffb 0 0 0 0 ;Interlinear Annotation Terminator 0xfffc 0 0 0 0 ;Object Replacement Character 0xfffd 0 0 0 0 ;Replacement Character ; 1 - flag for non-spacing marks, these weights are added to the previous base character's weights ; (Note: DW are a minimum of 2, so all characters have 2 added to their DW, however ; non-spacing characters presume the base already had that 2, so these weights are only ; the additional diacritic difference, without the base 2 ; example: A has a DW of 2, ̈ has a DW of 17, so Ä has a DW of 19 (2 + 17)) ; We ignore leading/trailing DWs of 2 when we construct the sort key in order to save memory. ;CP SCRIPT ALPHA DIACRITIC CASING COMMENT ;0x0b70 had a DW of 0 in sort version 00060304, but we ignore leading/trailing DWs <= 2. Hence we changed the DW to 3 in version 00060305. 0x0b70 1 0 3 0 ;Odia Isshar ;The following six characters had a DW of 1 in sort version 00060304, but we ignore leading/trailing DWs <= 2. Hence we changed the DW to 3 in version 00060305. 0x303c 1 0 3 0 ;Masu Mark 0x303d 1 0 3 0 ;Part Alternation Mark 0x303e 1 0 3 0 ;Ideographic Variation Indicator 0x3099 1 0 3 0 ;Non-Spacing Kana Daku-On 0x309b 1 0 3 0 ;Kana Daku-On 0xff9e 1 0 3 0 ;Halfwidth Kana Daku-On ;0x05a2 had a DW of 2 in sort version 00060304, but we ignore leading/trailing DWs <= 2. Hence we changed the DW to 3 in version 00060305. 0x05a2 1 0 3 0 ;Hebrew Accent Atnah Hafukh ;The following characters had a DW of 2 in sort version 00060304, but we ignore leading/trailing DWs <= 2. ;Nonspacing marks in the same script are already taking DW 3, hence we changed the DW to 4 in version 00060305. 0x309a 1 0 4 0 ;Non-Spacing Kana HanDaku-On 0x309c 1 0 4 0 ;Kana HanDaku-On 0xff9f 1 0 4 0 ;Halfwidth Kana HanDaku-On ;The following five characters had a DW of 3 prior to updating the above DWs. 0x030d 1 0 3 0 ;Non-Spacing Vertical Line Above 0x0e48 1 0 3 0 ;Thai Tone Mai Ek 0x1a74 1 0 3 0 ;TAI THAM SIGN MAI KANG 0x1c37 1 0 3 0 ;LEPCHA SIGN NUKTA 0xa92b 1 0 3 0 ;KAYAH LI TONE PLOPHU 0x0591 1 0 4 0 ;Hebrew Accent Etnahta 0x09bc 1 0 4 0 ;Bengali Sign Nukta 0x0a3c 1 0 4 0 ;Gurmukhi Sign Nukta 0x0b3c 1 0 4 0 ;Odia Sign Nukta 0x0cbc 1 0 4 0 ;Kannada Sign Nukta 0x0cd5 1 0 4 0 ;Kannada Length Mark 0x0e49 1 0 4 0 ;Thai Tone Mai Tho 0x1a75 1 0 4 0 ;TAI THAM SIGN TONE-1 0xa92c 1 0 4 0 ;KAYAH LI TONE CALYA 0x0592 1 0 5 0 ;Hebrew Accent Tipeha 0x0cd6 1 0 5 0 ;Kannada Ai Length Mark 0x0e4a 1 0 5 0 ;Thai Tone Mai Tri 0x1a76 1 0 5 0 ;TAI THAM SIGN TONE-2 0xa92d 1 0 5 0 ;KAYAH LI TONE CALYA PLOPHU 0x0593 1 0 6 0 ;Hebrew Accent Dehi 0x093c 1 0 6 0 ;Devanagari Sign Nukta 0x0a71 1 0 6 0 ;Gurmukhi Addak 0x0e4b 1 0 6 0 ;Thai Tone Mai Chattawa 0x1a77 1 0 6 0 ;TAI THAM SIGN KHUEN TONE-3 0x0594 1 0 7 0 ;Hebrew Accent Mahapakh 0x0971 1 0 7 0 ;DEVANAGARI SIGN HIGH SPACING DOT 0x0e4d 1 0 7 0 ;Thai Nikkhahit 0x1a78 1 0 7 0 ;TAI THAM SIGN KHUEN TONE-4 0x302a 1 0 7 0 ;Ideographic Level Tone Mark 0x302e 1 0 7 0 ;Hangul Single Dot Tone Mark 0x0595 1 0 8 0 ;Hebrew Accent Yetiv 0x0e47 1 0 8 0 ;Thai Vowel Sign Mai Tai Khu 0x1a79 1 0 8 0 ;TAI THAM SIGN KHUEN TONE-5 0x302b 1 0 8 0 ;Ideographic Rising Tone Mark 0x302f 1 0 8 0 ;Hangul Double Dot Tone Mark 0x0596 1 0 9 0 ;Hebrew Accent Tevir 0x0e4c 1 0 9 0 ;Thai Thanthakhat 0x1a7a 1 0 9 0 ;TAI THAM SIGN RA HAAM 0x302c 1 0 9 0 ;Ideographic Departing Tone Mark 0x0597 1 0 10 0 ;Hebrew Accent Munah 0x07EB 1 0 10 0 ;NKO COMBINING SHORT HIGH TONE 0x0951 1 0 10 0 ;Devanagari Stress Sign Udatta 0x0a01 1 0 10 0 ;Gurmukhi Sign Adak Bindi 0x0a02 1 0 10 0 ;Gurmukhi Sign Bindi 0x1a7b 1 0 10 0 ;TAI THAM SIGN MAI SAM 0x0598 1 0 11 0 ;Hebrew Accent Merkha 0x07EC 1 0 11 0 ;NKO COMBINING SHORT LOW TONE 0x0abc 1 0 11 0 ;Gujarati Sign Nukta 0x0f84 1 0 11 0 ;Tibetan Mark Halanta 0x1a7c 1 0 11 0 ;TAI THAM SIGN KHUEN-LUE KARAN 0x2d7f 1 0 11 0 ;TIFINAGH CONSONANT JOINER 0x302d 1 0 11 0 ;Ideographic Entering Tone Mark 0x02b9 1 0 12 0 ;Modifier Prime 0x0301 1 0 12 0 ;Non-Spacing Acute Accent 0x0341 1 0 12 0 ;Non-Spacing Acute Tone Mark 0x0599 1 0 12 0 ;Hebrew Accent Merkha Kefula 0x07ED 1 0 12 0 ;NKO COMBINING SHORT RISING TONE 0x0952 1 0 12 0 ;Devanagari Stress Sign Anudatta 0x1a7f 1 0 12 0 ;TAI THAM COMBINING CRYPTOGRAMMIC DOT 0x0300 1 0 13 0 ;Non-Spacing Grave Accent 0x0340 1 0 13 0 ;Non-Spacing Grave Tone Mark 0x059a 1 0 13 0 ;Hebrew Accent Darga 0x07EE 1 0 13 0 ;NKO COMBINING LONG DESCENDING TONE 0x0c55 1 0 13 0 ;Telugu Length Mark 0x0f71 1 0 13 0 ;Tibetan Vowel Sign Aa 0x0307 1 0 14 0 ;Non-Spacing Dot Above 0x059b 1 0 14 0 ;Hebrew Accent Yerah Ben Yomo 0x07EF 1 0 14 0 ;NKO COMBINING LONG HIGH TONE 0x0953 1 0 14 0 ;Devanagari Grave Accent 0x0a70 1 0 14 0 ;Gurmukhi Tippi 0x0c56 1 0 14 0 ;Telugu Ai Length Mark 0x059c 1 0 15 0 ;Hebrew Accent Segol 0x07F0 1 0 15 0 ;NKO COMBINING LONG LOW TONE 0x0f39 1 0 15 0 ;Tibetan Mark Tsa -phru 0x0302 1 0 16 0 ;Non-Spacing Circumflex 0x059d 1 0 16 0 ;Hebrew Accent Shalshelet 0x07F1 1 0 16 0 ;NKO COMBINING LONG RISING TONE 0x0954 1 0 16 0 ;Devanagari Acute Accent 0x0308 1 0 17 0 ;Non-Spacing Diaeresis 0x059e 1 0 17 0 ;Hebrew Accent Zaqef Qatan 0x07F2 1 0 17 0 ;NKO COMBINING NASALIZATION MARK 0x0f7f 1 0 17 0 ;Tibetan Sign Rnam Bcad 0x030c 1 0 18 0 ;Non-Spacing Caron (Hacek) 0x059f 1 0 18 0 ;Hebrew Accent Zaqef Gadol 0x0306 1 0 19 0 ;Non-Spacing Breve 0x05a0 1 0 19 0 ;Hebrew Accent Revia 0x07F3 1 0 19 0 ;NKO COMBINING DOUBLE DOT ABOVE 0x0f85 1 0 19 0 ;Tibetan Mark Paluta 0x035d 1 0 20 0 ;Combining Double Breve 0x0304 1 0 21 0 ;Non-Spacing Macron 0x035e 1 0 21 0 ;Combining Double Macron 0x035f 1 0 21 0 ;Combining Double Macron Below 0x05a1 1 0 21 0 ;Hebrew Accent Zarqa 0x0f88 1 0 21 0 ;Tibetan Sign Lce Tsa Can