Posts / How Sorting Order Depends on Runtime and Operating System


This blog post was originally posted on JetBrains .NET blog.

In Rider, we have unit tests that enumerate files in your project and dump a sorted list of these files. In one of our test projects, we had the following files: jquery-1.4.1.js, jquery-1.4.1.min.js, jquery-1.4.1-vsdoc.js. On Windows, .NET Framework, .NET Core, and Mono produce the same sorted list:

jquery-1.4.1.js
jquery-1.4.1.min.js
jquery-1.4.1-vsdoc.js

On Unix, Mono also produces the same list, so we had a consistent list of files across all environments. However, once we migrated to .NET Core, we discovered that the sorting order had changed to:

jquery-1.4.1-vsdoc.js
jquery-1.4.1.js
jquery-1.4.1.min.js

After a quick investigation, we realized that the problem was related to the . and - symbols. The example above can be simplified to the following minimal repro:

var list = new List<string> { "a.b", "a-b" };
Console.WriteLine(string.Join(" ", list.OrderBy(x => x)));

.NET Framework, Mono, and .NET Core+Windows print a.b a-b to the output. However, .NET Core on Unix thinks that a-b is smaller than a.b, and prints a-b a.b. Thus, the sorting order depends on the runtime and operating system that you use. In our codebase, we fixed this problem with the help of StringComparer.Ordinal. Instead of list.OrderBy(x => x), in the example above we would write list.OrderBy(x => x, StringComparer.Ordinal). This guarantees a consistent string order that doesn’t depend on the environment. We also started to wonder about the other kinds of string sorting “phenomena” we might find by switching between runtimes and operating systems. Let’s find out!

Collecting more data

We took a simple set of characters .-'!a and built all possible two-character combinations from them:
var chars = ".-'!a".ToCharArray();
var strings = new List<string>();
for (int i = 0; i < chars.Length; i++)
    for (int j = 0; j < chars.Length; j++)
        strings.Add(chars[i].ToString() + chars[j]);

Next, we compared these combinations to each other on different combinations of runtimes (.NET Framework, .NET Core, Mono) and operating systems (Windows, Linux, macOS):

using (var writer = new StreamWriter(filename))
{
    foreach (var a in strings)
        foreach (var b in strings)
             writer.WriteLine(a.CompareTo(b));
}

We discovered three different cases in which the CompareTo results are not consistent. To illustrate them, we took 4 string pairs from each group and built the following diagram for you:


In the previous post where we discussed socket implementations in different environments, we showed the source code for all relevant cases. This time, we suggest you do this exercise yourself. Try digging into the source code of all runtimes to find explanations for the above picture. For a bonus challenge, do your own experiments with CultureInfo.CurrentCulture and learn more about how the sorting order depends on the system locale. It would be great if you could share your findings with the community! To give you further inspiration for this kind of research, we want to show you a few more interesting facts.

More tricky cases

Sorting order can be pretty tricky, even if you are only working within one environment. A great example of unexpected behavior can be found in this StackOverflow question, where developers discuss the following code snippet:
"+".CompareTo("-")
Returns: 1

"+1".CompareTo("-1")
Returns: -1

As you can see, "+" is greater than "-" while "+1" is lesser than "-1". The best answer quotes the following paragraph from Microsoft Docs:

The comparison uses the current culture to obtain culture-specific information such as casing rules and the alphabetic order of individual characters. For example, a culture could specify that certain combinations of characters be treated as a single character, or uppercase and lowercase characters be compared in a particular way, or that the sorting order of a character depends on the characters that precede or follow it.
If we continue to read the documentation, we will see that there are overloads of string.Compare that take System.Globalization.CompareOptions as one of the arguments. Here is the most common overload:
public static int Compare(string strA, string strB, CultureInfo culture, CompareOptions options);

The CompareOptions flag enum defines the string comparison rules. Here are the most interesting values:

Try to play with these values, and find examples of string lists that can be sorted differently depending on the above flags. This kind of experiment is a great way to learn more about runtime and to become more aware of pitfalls related to string sorting. This is not the end of our adventure, however. There is one more global option that can completely change the behavior of string comparison!

Globalization invariant mode

In .NET Core 2.0+, there is a feature called Globalization invariant mode, which uses the Ordinal sorting rule for all string comparisons by default. It can be enabled if you set the DOTNET_SYSTEM_GLOBALIZATION_INVARIANT environment variable to true or 1. Let's enable this mode and run examples from the previous section:
Console.WriteLine(string.Compare("-", "+"));
Console.WriteLine(string.Compare("-x", "+x"));

Now it prints a new result:

2
2

Some developers may think that it’s a good idea to always enable this by default to avoid problems with inconsistent sorting. Note that in this mode, you will get poor globalization support: a lot of features will be affected, including all CultureInfo-specific logic, string operations, internationalized domain names (IDN) support, and even time zone display names on Linux. If you want to enable it, carefully read the documentation first. It’s worth mentioning that if you don’t control the environment of your application, there is a chance that users will enable it manually. This could significantly affect any .NET Core application!

Conclusion

Here are a few practical recommendations that can help you avoid tricky and painful bugs in the future: