How Sorting Order Depends on Runtime and Operating System
This blog post was originally posted on JetBrains .NET blog.
In Rider, we have unit tests that enumerate files in your project and dump a sorted list of these files. In one of our test projects, we had the following files: jquery-1.4.1.js
, jquery-1.4.1.min.js
, jquery-1.4.1-vsdoc.js
. On Windows, .NET Framework, .NET Core, and Mono produce the same sorted list:
jquery-1.4.1.js
jquery-1.4.1.min.js
jquery-1.4.1-vsdoc.js
On Unix, Mono also produces the same list, so we had a consistent list of files across all environments. However, once we migrated to .NET Core, we discovered that the sorting order had changed to:
jquery-1.4.1-vsdoc.js
jquery-1.4.1.js
jquery-1.4.1.min.js
After a quick investigation, we realized that the problem was related to the .
and -
symbols. The example above can be simplified to the following minimal repro:
var list = new List<string> { "a.b", "a-b" };
Console.WriteLine(string.Join(" ", list.OrderBy(x => x)));
.NET Framework, Mono, and .NET Core+Windows print a.b a-b
to the output. However, .NET Core on Unix thinks that a-b
is smaller than a.b
, and prints a-b a.b
. Thus, the sorting order depends on the runtime and operating system that you use.
In our codebase, we fixed this problem with the help of StringComparer.Ordinal
. Instead of list.OrderBy(x => x)
, in the example above we would write list.OrderBy(x => x, StringComparer.Ordinal)
. This guarantees a consistent string order that doesn’t depend on the environment.
We also started to wonder about the other kinds of string sorting “phenomena” we might find by switching between runtimes and operating systems. Let’s find out!
Collecting more data
We took a simple set of characters.-'!a
and built all possible two-character combinations from them:var chars = ".-'!a".ToCharArray();
var strings = new List<string>();
for (int i = 0; i < chars.Length; i++)
for (int j = 0; j < chars.Length; j++)
strings.Add(chars[i].ToString() + chars[j]);
Next, we compared these combinations to each other on different combinations of runtimes (.NET Framework, .NET Core, Mono) and operating systems (Windows, Linux, macOS):
using (var writer = new StreamWriter(filename))
{
foreach (var a in strings)
foreach (var b in strings)
writer.WriteLine(a.CompareTo(b));
}
We discovered three different cases in which the CompareTo
results are not consistent. To illustrate them, we took 4 string pairs from each group and built the following diagram for you:
In the previous post where we discussed socket implementations in different environments, we showed the source code for all relevant cases. This time, we suggest you do this exercise yourself. Try digging into the source code of all runtimes to find explanations for the above picture. For a bonus challenge, do your own experiments with CultureInfo.CurrentCulture
and learn more about how the sorting order depends on the system locale. It would be great if you could share your findings with the community! To give you further inspiration for this kind of research, we want to show you a few more interesting facts.
More tricky cases
Sorting order can be pretty tricky, even if you are only working within one environment. A great example of unexpected behavior can be found in this StackOverflow question, where developers discuss the following code snippet:"+".CompareTo("-")
Returns: 1
"+1".CompareTo("-1")
Returns: -1
As you can see, "+"
is greater than "-"
while "+1"
is lesser than "-1"
. The best answer quotes the following paragraph from Microsoft Docs:
The comparison uses the current culture to obtain culture-specific information such as casing rules and the alphabetic order of individual characters. For example, a culture could specify that certain combinations of characters be treated as a single character, or uppercase and lowercase characters be compared in a particular way, or that the sorting order of a character depends on the characters that precede or follow it.If we continue to read the documentation, we will see that there are overloads of
string.Compare
that take System.Globalization.CompareOptions
as one of the arguments. Here is the most common overload:public static int Compare(string strA, string strB, CultureInfo culture, CompareOptions options);
The CompareOptions flag enum defines the string comparison rules. Here are the most interesting values:
- IgnoreKanaType: Indicates that the string comparison must ignore the Kana type. Kana type refers to Japanese hiragana and katakana characters, which represent phonetic sounds in the Japanese language. Hiragana is used for native Japanese expressions and words, while katakana is used for words borrowed from other languages, such as "computer" or "Internet". A phonetic sound can be expressed in both hiragana and katakana. If this value is selected, the hiragana character for one sound is considered equal to the katakana character for the same sound.
- IgnoreNonSpace: Indicates that the string comparison must ignore non-spacing combining characters, such as diacritics. The Unicode Standard defines combining characters as characters that are combined with base characters to produce a new character. Non-spacing combining characters do not occupy a spacing position by themselves when rendered.
- IgnoreSymbols: Indicates that the string comparison must ignore symbols, such as white-space characters, punctuation, currency symbols, the percent sign, mathematical symbols, the ampersand, and so on.
- IgnoreWidth: Indicates that the string comparison must ignore character width. For example, Japanese katakana characters can be written as full-width or half-width. If this value is selected, the katakana characters written as full-width are considered equal to the same characters written as half-width.
- Ordinal: Indicates that the string comparison must use the successive Unicode UTF-16 encoded values of the string (code unit by code unit comparison), leading to a fast comparison, but one that is culture-insensitive. A string starting with a code unit XXXX16 comes before a string starting with YYYY16, if XXXX16 is less than YYYY16. This value cannot be combined with other CompareOptions values and must be used alone.
- StringSort: Indicates that the string comparison must use the string sort algorithm. In a string sort, the hyphen and the apostrophe, as well as other non-alphanumeric symbols, come before alphanumeric characters.
Globalization invariant mode
In .NET Core 2.0+, there is a feature called Globalization invariant mode, which uses theOrdinal
sorting rule for all string comparisons by default. It can be enabled if you set the DOTNET_SYSTEM_GLOBALIZATION_INVARIANT
environment variable to true
or 1
. Let's enable this mode and run examples from the previous section:Console.WriteLine(string.Compare("-", "+"));
Console.WriteLine(string.Compare("-x", "+x"));
Now it prints a new result:
2
2
Some developers may think that it’s a good idea to always enable this by default to avoid problems with inconsistent sorting. Note that in this mode, you will get poor globalization support: a lot of features will be affected, including all CultureInfo
-specific logic, string operations, internationalized domain names (IDN) support, and even time zone display names on Linux. If you want to enable it, carefully read the documentation first.
It’s worth mentioning that if you don’t control the environment of your application, there is a chance that users will enable it manually. This could significantly affect any .NET Core application!
Conclusion
Here are a few practical recommendations that can help you avoid tricky and painful bugs in the future:- If you want to achieve consistent string comparison across different runtimes and operating systems, always use
StringComparer.Ordinal
. - If you don't use
StringComparer.Ordinal
, always keep in mind that the sorting order may depend on runtime, operating system, current culture, and environment variables. - Try to do your own experiments and learn more about sorting rules in .NET. This time we decided to leave out the detailed explanations and instead encourage you to explore them for yourself. After all, this is the best way to learn something new and improve your programming skills!