none
string.Compare("A","a")为什么等于1 RRS feed

Answers

  • string.Compare(StrA,StrB)函数返回值
    一、Less than zero StrA precedes StrB in the sort order
    二、Zero StrA occurs in the same position as StrB in the sort order
    三、Greater than zero StrA follows StrB in the sort order

    函数返回值小于零还是大于零,关键问题确定sort order是什么,是Unicode编码,还是其他。

    从 .Net 文章中的解释,sort order很可能是参考权重表制定的。
    Win10版本权重Weights由4个字节(脚本成员SM、字母权重AW、变音权重DW、大写权重CW)组成,
    脚本成员和字母权重共同构成2个字节的Unicode Weights,东亚语言环境中变音权重也被提升至UW中,成为3字节权重。
    Unicode Weights是sort order判断的先决条件,其次依据变音权重,第三看大写权重。

    由下图可见,字符"A"和"a"的UW及变音权重都为"0E 02 02",即两个比较字符串第一个位置上字符权重值相同返回Zero,
    但是再比较CW值,"A"为12,"a"为2,2 precedes 12, 12 follows 2, 因此在不忽略大小写的情况下,string.Compare("A","a")返回值是大于零的。

    Tuesday, 26 January 2021 5:53 AM

All replies

  • Hi 朋友的朋友,

    String.Compare方法是用于比较两个字符串的位置,若第一个字符串在第二个字符串前面,则返回数字1,反之,则返回0,如果两者位置相同,则返回。

    根据unicode的编码顺序,我们可以查到A的位置是0034,而a的位置是0066。

    上面的图片你可以在维基百科中搜索unicode字符列表来了解具体字符的位置。

    Best Regards,

    Jack


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.

    Monday, 2 November 2020 2:04 AM
    Moderator
  • 多谢,明白了,原来用的是Unicode顺序。受别的语言的影响,一直弄成了Ascii码值大小。
    Thursday, 21 January 2021 12:08 AM
  • 我又看了看MSDN文档,感觉有点儿晕了,它好像说的是A在B之前,返回小于0:
    Thursday, 21 January 2021 12:36 AM
  • String.Compare方法是不能比较两个字符串的位置,

    而是比较两个字符串中对应位置的字符,

    若每个对应位置的字符相同,则String.Compare函数返回0值。

    Thursday, 21 January 2021 1:42 AM
  • 感谢您的答复,等于0的情况已经比较明确了。现在不明确的是,就单个字符而言,在使用这个方法的时候,什么情况下返回1,什么情况下返回-1。
    Thursday, 21 January 2021 2:03 AM
  • 函数返回值1或者-1,是由编译程序决定的,建议直接查找和分析Compare函数的源代码,比参考任何文档资料都靠谱。
    Thursday, 21 January 2021 8:47 AM
  • ; Weights are 4 bytes: SM, AW, DW & CW.  Together the SM + AW are "Unicode Weight"
    ;   SM - Script Member
    ;   AW - Alphabetic Weight
    ;   DW - Diacritic Weight
    ;   CW - Case Weight
    ;
    ; For the most part, strings are sorted with all of the "Unicode Weights" being primary, then
    ; the diacritic and case weights as being secondary and tertiary weights.  (a few more extra
    ; weights follow that).
    ;
    ; The sort keys generated are the "Unicode Weight" of the entire string, delimited by 01, then
    ; the Diacritic (or secondary) weight, then another 01 delimiter, then the Case (or tertiary)
    ; weight.  The code also adds two more sections, again delimited by 01.  Weights are 0 terminated.
    Monday, 25 January 2021 7:45 AM
  • string.Compare(StrA,StrB)函数返回值
    一、Less than zero StrA precedes StrB in the sort order
    二、Zero StrA occurs in the same position as StrB in the sort order
    三、Greater than zero StrA follows StrB in the sort order

    函数返回值小于零还是大于零,关键问题确定sort order是什么,是Unicode编码,还是其他。

    从 .Net 文章中的解释,sort order很可能是参考权重表制定的。
    Win10版本权重Weights由4个字节(脚本成员SM、字母权重AW、变音权重DW、大写权重CW)组成,
    脚本成员和字母权重共同构成2个字节的Unicode Weights,东亚语言环境中变音权重也被提升至UW中,成为3字节权重。
    Unicode Weights是sort order判断的先决条件,其次依据变音权重,第三看大写权重。

    由下图可见,字符"A"和"a"的UW及变音权重都为"0E 02 02",即两个比较字符串第一个位置上字符权重值相同返回Zero,
    但是再比较CW值,"A"为12,"a"为2,2 precedes 12, 12 follows 2, 因此在不忽略大小写的情况下,string.Compare("A","a")返回值是大于零的。

    Tuesday, 26 January 2021 5:53 AM
  • 有点儿好奇,你们的这些表都是哪儿来的?
    Wednesday, 27 January 2021 3:26 AM
  • 有点儿好奇,你们的这些表都是哪儿来的?

    在微软的官网上有Windows 10 Sorting Weight Table文件下载,大小约为20MB,开头部分源代码如下:

    ; Unisort.txt
    ; This file contains all the NLS sorting data.  It is UTF-8 for some of the comments
    ;
    ; The initial table is our default table (nominally GUID 00000001-57EE-1E5C-00B4-D0000BB1E11E)
    ; Weights are 4 bytes: SM, AW, DW & CW.  Together the SM + AW are "Unicode Weight"
    ;   SM - Script Member
    ;   AW - Alphabetic Weight
    ;   DW - Diacritic Weight
    ;   CW - Case Weight
    ;
    ; For the most part, strings are sorted with all of the "Unicode Weights" being primary, then
    ; the diacritic and case weights as being secondary and tertiary weights.  (a few more extra
    ; weights follow that).
    ;
    ; The sort keys generated are the "Unicode Weight" of the entire string, delimited by 01, then
    ; the Diacritic (or secondary) weight, then another 01 delimiter, then the Case (or tertiary)
    ; weight.  The code also adds two more sections, again delimited by 01.  Weights are 0 terminated.
    ;
    ; So for Äpple the weights would be 0e 02 0e 7e 0e 7e 0e 48 0e 21 01 13 01 12 01 01 00
    ;	0e 02 0e 7e 0e 7e 0e 48 0e 21    "apple" Unicode Weight
    ;   01                               delimeter
    ;   13                               "ä" diacritic weight
    ;   01
    ;   12                               "A" case weight
    ;   01 01 00                         (two sections with no weight in this string and terminator)
    ;
    ; Special cases:
    ;   Our sort keys are delimited by 1 and terminated by 0, so no weights by use 0 or 1.
    ;   In these tables, 0 0 0 0 is used as a flag for "no weight".  Those characters are effectively
    ;   ignored by NLS.
    ;
    ;   East Asian locales may have "3 byte weights".  In that case, DW is promoted to be included
    ;        in UW, so UW then is SM, AW, and DW.  This only happens if 192 <= SW <= 239.
    ;   PUA are always 3 byte weights
    ;
    ; SM values:
    ;   0 - reserved for no weight code points
    ;   1 - flag for non-spacing marks, those weights are added to the previous base character's weights
    ;       (Note: DW are a minimum of 2, so all characters have 2 added to their DW, however
    ;        non-spacing characters presume the base already had that 2, so these weights are only
    ;        the additional diacritic difference, without the base 2
    ;        example: A has a DW of 2, ̈  has a DW of 17, so Ä has a DW of 19 (2 + 17))
    ;   2 - Unused ??? reserved because our minimums are 2 (Does SortGetSortKey generate any 2s?)
    ;   3 - 
    ;   14 - Latin script
    ;
    ; AW values:
    ;   0-1 - reserved for delimiter/terminator
    ;   2 - "first letter in script".  Eg: the AW for A.
    ;
    ; DW Values:
    ;   0-1 - reserved for delimiter/terminator
    ;
    ; CW Values:
    ;   0-1 - reserved for delimiter/terminator
    ;   0x01 bit - Full Width (if set). (1 == Full Width, 0 == Half/Normal Width)
    ;   0x02 bit - Set by default, can be cleared in some case (if another higher bit is set)
    ;   0x04 bit - Super/Subscript?
    ;   0x08 bit -
    ;   0x10 bit - Upper Case.  (16 == upper case, 0 == lower case)
    ;   0x20 bit -
    ;   0x40/0x80 bits - Reserved for nlstrans, which uses these as flags for characters that may compress
    ;
    ; Flags to NLSTrans
    ; After the GUID
    ;     HAS_3_BYTE_WEIGHTS - This ID has 3 byte weights.  Must be tagged on the EXCEPTION AND on the COMPRESSION
    ;     LINGUISTIC_CASING  - Tagged (only) on the EXCEPTION table that applies to the linguistic case.
    ;                          Should also have another untagged EXCEPTION table for non-linguistic casing.
    ;
    SORTKEY
    
    	DEFAULT				; Characters in this table are sorted by: SM, AW, DW, CW, Codepoint #
    
    ; Weightless code points (these will be ignored when comparing strings)
    0x0000	0	0	0	0	;<control>
    0x00ad	0	0	0	0	;Soft Hyphen
    0x034f	0	0	0	0	;Combining Grapheme Joiner
    0x0640	0	0	0	0	;Arabic Tatweel
    0x0ecc	0	0	0	0	;Lao Cancellation Mark
    0x1806	0	0	0	0	;Mongolian Todo Soft Hyphen
    0x180b	0	0	0	0	;Mongolian Free Variation Selector One
    0x180c	0	0	0	0	;Mongolian Free Variation Selector Two
    0x180d	0	0	0	0	;Mongolian Free Variation Selector Three
    0x200c	0	0	0	0	;Zero Width Non-joiner
    0x200d	0	0	0	0	;Zero Width Joiner
    0x200e	0	0	0	0	;Left-to-right Mark
    0x200f	0	0	0	0	;Right-to-left Mark
    0x202a	0	0	0	0	;Left-to-right Embedding
    0x202b	0	0	0	0	;Right-to-left Embedding
    0x202c	0	0	0	0	;Pop Directional Formatting
    0x202d	0	0	0	0	;Left-to-right Override
    0x202e	0	0	0	0	;Right-to-left Override
    0x2060	0	0	0	0	;Word Joiner
    0x2061	0	0	0	0	;Function Application
    0x2062	0	0	0	0	;Invisible Times
    0x2063	0	0	0	0	;Invisible Separator
    0x2064	0	0	0	0	;INVISIBLE PLUS
    0x206a	0	0	0	0	;Inhibit Symmetric Swapping
    0x206b	0	0	0	0	;Activate Symmetric Swapping
    0x206c	0	0	0	0	;Inhibit Arabic Form Shaping
    0x206d	0	0	0	0	;Activate Arabic Form Shaping
    0x206e	0	0	0	0	;National Digit Shapes
    0x206f	0	0	0	0	;Nominal Digit Shapes
    0x3190	0	0	0	0	;Ideographic Annotation Linking Mark
    0x3191	0	0	0	0	;Ideographic Annotation Reverse Mark
    0xdb40	0	0	0	0	;Variation Selector High Surrogate in range 0xE0100-0xE01EF
    0xfe00	0	0	0	0	;Variation Selector-1
    0xfe01	0	0	0	0	;Variation Selector-2
    0xfe02	0	0	0	0	;Variation Selector-3
    0xfe03	0	0	0	0	;Variation Selector-4
    0xfe04	0	0	0	0	;Variation Selector-5
    0xfe05	0	0	0	0	;Variation Selector-6
    0xfe06	0	0	0	0	;Variation Selector-7
    0xfe07	0	0	0	0	;Variation Selector-8
    0xfe08	0	0	0	0	;Variation Selector-9
    0xfe09	0	0	0	0	;Variation Selector-10
    0xfe0a	0	0	0	0	;Variation Selector-11
    0xfe0b	0	0	0	0	;Variation Selector-12
    0xfe0c	0	0	0	0	;Variation Selector-13
    0xfe0d	0	0	0	0	;Variation Selector-14
    0xfe0e	0	0	0	0	;Variation Selector-15
    0xfe0f	0	0	0	0	;Variation Selector-16
    0xfeff	0	0	0	0	;ZWNBSP - Byte Order Mark
    0xfff9	0	0	0	0	;Interlinear Annotation Anchor
    0xfffa	0	0	0	0	;Interlinear Annotation Separator
    0xfffb	0	0	0	0	;Interlinear Annotation Terminator
    0xfffc	0	0	0	0	;Object Replacement Character
    0xfffd	0	0	0	0	;Replacement Character
    
    ;   1 - flag for non-spacing marks, these weights are added to the previous base character's weights
    ;       (Note: DW are a minimum of 2, so all characters have 2 added to their DW, however
    ;        non-spacing characters presume the base already had that 2, so these weights are only
    ;        the additional diacritic difference, without the base 2
    ;        example: A has a DW of 2, ̈  has a DW of 17, so Ä has a DW of 19 (2 + 17))
    ;        We ignore leading/trailing DWs of 2 when we construct the sort key in order to save memory.
    
    ;CP	SCRIPT	ALPHA	DIACRITIC	CASING	COMMENT
    ;0x0b70 had a DW of 0 in sort version 00060304, but we ignore leading/trailing DWs <= 2. Hence we changed the DW to 3 in version 00060305.
    0x0b70	1	0	3	0	;Odia Isshar
    ;The following six characters had a DW of 1 in sort version 00060304, but we ignore leading/trailing DWs <= 2. Hence we changed the DW to 3 in version 00060305.
    0x303c	1	0	3	0	;Masu Mark
    0x303d	1	0	3	0	;Part Alternation Mark
    0x303e	1	0	3	0	;Ideographic Variation Indicator
    0x3099	1	0	3	0	;Non-Spacing Kana Daku-On
    0x309b	1	0	3	0	;Kana Daku-On
    0xff9e	1	0	3	0	;Halfwidth Kana Daku-On
    ;0x05a2 had a DW of 2 in sort version 00060304, but we ignore leading/trailing DWs <= 2. Hence we changed the DW to 3 in version 00060305.
    0x05a2	1	0	3	0	;Hebrew Accent Atnah Hafukh
    ;The following characters had a DW of 2 in sort version 00060304, but we ignore leading/trailing DWs <= 2. 
    ;Nonspacing marks in the same script are already taking DW 3, hence we changed the DW to 4 in version 00060305.
    0x309a	1	0	4	0	;Non-Spacing Kana HanDaku-On
    0x309c	1	0	4	0	;Kana HanDaku-On
    0xff9f	1	0	4	0	;Halfwidth Kana HanDaku-On
    ;The following five characters had a DW of 3 prior to updating the above DWs.
    0x030d	1	0	3	0	;Non-Spacing Vertical Line Above 
    0x0e48	1	0	3	0	;Thai Tone Mai Ek 
    0x1a74	1	0	3	0	;TAI THAM SIGN MAI KANG 
    0x1c37	1	0	3	0	;LEPCHA SIGN NUKTA 
    0xa92b	1	0	3	0	;KAYAH LI TONE PLOPHU 
    0x0591	1	0	4	0	;Hebrew Accent Etnahta 
    0x09bc	1	0	4	0	;Bengali Sign Nukta 
    0x0a3c	1	0	4	0	;Gurmukhi Sign Nukta 
    0x0b3c	1	0	4	0	;Odia Sign Nukta 
    0x0cbc	1	0	4	0	;Kannada Sign Nukta 
    0x0cd5	1	0	4	0	;Kannada Length Mark 
    0x0e49	1	0	4	0	;Thai Tone Mai Tho 
    0x1a75	1	0	4	0	;TAI THAM SIGN TONE-1 
    0xa92c	1	0	4	0	;KAYAH LI TONE CALYA 
    0x0592	1	0	5	0	;Hebrew Accent Tipeha 
    0x0cd6	1	0	5	0	;Kannada Ai Length Mark 
    0x0e4a	1	0	5	0	;Thai Tone Mai Tri 
    0x1a76	1	0	5	0	;TAI THAM SIGN TONE-2 
    0xa92d	1	0	5	0	;KAYAH LI TONE CALYA PLOPHU 
    0x0593	1	0	6	0	;Hebrew Accent Dehi 
    0x093c	1	0	6	0	;Devanagari Sign Nukta 
    0x0a71	1	0	6	0	;Gurmukhi Addak 
    0x0e4b	1	0	6	0	;Thai Tone Mai Chattawa 
    0x1a77	1	0	6	0	;TAI THAM SIGN KHUEN TONE-3 
    0x0594	1	0	7	0	;Hebrew Accent Mahapakh 
    0x0971	1	0	7	0	;DEVANAGARI SIGN HIGH SPACING DOT 
    0x0e4d	1	0	7	0	;Thai Nikkhahit 
    0x1a78	1	0	7	0	;TAI THAM SIGN KHUEN TONE-4 
    0x302a	1	0	7	0	;Ideographic Level Tone Mark 
    0x302e	1	0	7	0	;Hangul Single Dot Tone Mark 
    0x0595	1	0	8	0	;Hebrew Accent Yetiv 
    0x0e47	1	0	8	0	;Thai Vowel Sign Mai Tai Khu 
    0x1a79	1	0	8	0	;TAI THAM SIGN KHUEN TONE-5 
    0x302b	1	0	8	0	;Ideographic Rising Tone Mark 
    0x302f	1	0	8	0	;Hangul Double Dot Tone Mark 
    0x0596	1	0	9	0	;Hebrew Accent Tevir 
    0x0e4c	1	0	9	0	;Thai Thanthakhat 
    0x1a7a	1	0	9	0	;TAI THAM SIGN RA HAAM 
    0x302c	1	0	9	0	;Ideographic Departing Tone Mark 
    0x0597	1	0	10	0	;Hebrew Accent Munah 
    0x07EB	1	0	10	0	;NKO COMBINING SHORT HIGH TONE 
    0x0951	1	0	10	0	;Devanagari Stress Sign Udatta 
    0x0a01	1	0	10	0	;Gurmukhi Sign Adak Bindi 
    0x0a02	1	0	10	0	;Gurmukhi Sign Bindi 
    0x1a7b	1	0	10	0	;TAI THAM SIGN MAI SAM 
    0x0598	1	0	11	0	;Hebrew Accent Merkha 
    0x07EC	1	0	11	0	;NKO COMBINING SHORT LOW TONE 
    0x0abc	1	0	11	0	;Gujarati Sign Nukta 
    0x0f84	1	0	11	0	;Tibetan Mark Halanta 
    0x1a7c	1	0	11	0	;TAI THAM SIGN KHUEN-LUE KARAN 
    0x2d7f	1	0	11	0	;TIFINAGH CONSONANT JOINER 
    0x302d	1	0	11	0	;Ideographic Entering Tone Mark 
    0x02b9	1	0	12	0	;Modifier Prime 
    0x0301	1	0	12	0	;Non-Spacing Acute Accent 
    0x0341	1	0	12	0	;Non-Spacing Acute Tone Mark 
    0x0599	1	0	12	0	;Hebrew Accent Merkha Kefula 
    0x07ED	1	0	12	0	;NKO COMBINING SHORT RISING TONE 
    0x0952	1	0	12	0	;Devanagari Stress Sign Anudatta 
    0x1a7f	1	0	12	0	;TAI THAM COMBINING CRYPTOGRAMMIC DOT 
    0x0300	1	0	13	0	;Non-Spacing Grave Accent 
    0x0340	1	0	13	0	;Non-Spacing Grave Tone Mark 
    0x059a	1	0	13	0	;Hebrew Accent Darga 
    0x07EE	1	0	13	0	;NKO COMBINING LONG DESCENDING TONE 
    0x0c55	1	0	13	0	;Telugu Length Mark 
    0x0f71	1	0	13	0	;Tibetan Vowel Sign Aa 
    0x0307	1	0	14	0	;Non-Spacing Dot Above 
    0x059b	1	0	14	0	;Hebrew Accent Yerah Ben Yomo 
    0x07EF	1	0	14	0	;NKO COMBINING LONG HIGH TONE 
    0x0953	1	0	14	0	;Devanagari Grave Accent 
    0x0a70	1	0	14	0	;Gurmukhi Tippi 
    0x0c56	1	0	14	0	;Telugu Ai Length Mark 
    0x059c	1	0	15	0	;Hebrew Accent Segol 
    0x07F0	1	0	15	0	;NKO COMBINING LONG LOW TONE 
    0x0f39	1	0	15	0	;Tibetan Mark Tsa -phru 
    0x0302	1	0	16	0	;Non-Spacing Circumflex 
    0x059d	1	0	16	0	;Hebrew Accent Shalshelet 
    0x07F1	1	0	16	0	;NKO COMBINING LONG RISING TONE 
    0x0954	1	0	16	0	;Devanagari Acute Accent 
    0x0308	1	0	17	0	;Non-Spacing Diaeresis 
    0x059e	1	0	17	0	;Hebrew Accent Zaqef Qatan 
    0x07F2	1	0	17	0	;NKO COMBINING NASALIZATION MARK 
    0x0f7f	1	0	17	0	;Tibetan Sign Rnam Bcad 
    0x030c	1	0	18	0	;Non-Spacing Caron (Hacek) 
    0x059f	1	0	18	0	;Hebrew Accent Zaqef Gadol 
    0x0306	1	0	19	0	;Non-Spacing Breve 
    0x05a0	1	0	19	0	;Hebrew Accent Revia 
    0x07F3	1	0	19	0	;NKO COMBINING DOUBLE DOT ABOVE 
    0x0f85	1	0	19	0	;Tibetan Mark Paluta 
    0x035d	1	0	20	0	;Combining Double Breve 
    0x0304	1	0	21	0	;Non-Spacing Macron 
    0x035e	1	0	21	0	;Combining Double Macron 
    0x035f	1	0	21	0	;Combining Double Macron Below 
    0x05a1	1	0	21	0	;Hebrew Accent Zarqa 
    0x0f88	1	0	21	0	;Tibetan Sign Lce Tsa Can 
    

    Wednesday, 27 January 2021 5:38 AM
  • 好的,感谢!
    Wednesday, 27 January 2021 10:57 AM