How to Sort Unicode Strings Alphabetically in Python :

How to Sort Unicode Strings Alphabetically in Python
blow post content copied from  Real Python
click here to view original post

Sorting strings in Python is something that you can almost take for granted if you do it all the time. Yet, it can become surprisingly tricky considering all the edge cases lurking in the vast Unicode standard. Understanding Unicode isn’t an easy feat, so prepare yourself for a whirlwind tour of surprising edge cases and effective ways of dealing with them.

Python pioneered and popularized a robust sorting algorithm called Timsort, which now ships with several major programming languages, including Java, Rust, and Swift. When you call sorted() or list.sort(), Python uses this algorithm under the surface to rearrange elements. As long as your sequence contains comparable elements of compatible types, then you can sort numbers, strings, and other data types in the expected order:

>>> math_constants = [6.28, 2.72, 3.14]
>>> sorted(math_constants)
[2.72, 3.14, 6.28]

>>> fruits = ["orange", "banana", "lemon", "apple"]
>>> sorted(fruits)
['apple', 'banana', 'lemon', 'orange']

>>> people = [("John", "Doe"), ("Anna", "Smith")]
>>> sorted(people)
[('Anna', 'Smith'), ('John', 'Doe')]

Unless you specify otherwise, Python sorts these elements by value in ascending order, comparing them pairwise. Each pair must contain elements that are comparable using either the less than (<) or greater than (>) operator. Sometimes, these comparison operators are undefined for specific data types like complex numbers or between two distinct types. In these cases, a comparison will fail:

>>> 4 + 2j < 2 + 4j
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '<' not supported between instances of 'complex' and 'complex'

>>> 3.14 < "orange"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '<' not supported between instances of 'float' and 'str'

In most cases, you’ll be working with homogeneous sequences comprising elements of the same type, so you’ll rarely run into this problem in the wild. However, things may start to fall apart once you throw strings with non-Latin characters into the mix, such as letters with diacritics or accents:

>>> polish_names = ["Zbigniew", "Ludmiła", "Żaneta", "Łukasz"]
>>> sorted(polish_names)
['Ludmiła', 'Zbigniew', 'Łukasz', 'Żaneta']

In this example, you run into a common challenge associated with sorting strings. When a string contains characters whose ordinal values extend beyond the usual ASCII range, then you might get unexpected results like what you have here. The name Łukasz ends up after Zbigniew, even though the letter Ł (pronounced like a w sound in English) occurs earlier than Z in the Polish alphabet. Why’s that?

Python sorts strings lexicographically by comparing Unicode code points of the individual characters from left to right. It just so happens that the letter Ł has a higher ordinal value in Unicode than the letter Z, making it greater than any of the Latin letters:

>>> ord("Ł")

>>> ord("Z")

>>> any("Ł" < letter for letter in "ABCDEFGHIJKLMNOPQRSTUVWXYZ")

Code points are unique yet arbitrary identifiers of Unicode characters, and they don’t inherently agree with the alphabetical order of spoken languages. So, lexicographic sorting won’t be appropriate for languages other than English.

Different cultures follow different rules for sorting strings, even when they share the same alphabet or parts of it with other cultures. For example, the ch digraph is considered two separate letters (c and h) in Polish, but it becomes a stand-alone letter placed between h and i in the Czech alphabet. This is known as a contraction in Unicode.

Moreover, the sorting order can sometimes differ within the same culture, depending on the context. For example, most German phone books tend to treat letters with an umlaut (ä, ö, ü) similarly to the ae, oe, and ue letter combinations. However, other countries overwhelmingly treat these letters the same as their Latin counterparts (a, o, u).

There’s no universally correct way to sort Unicode strings. You need to tell Python which rules to apply to get the desired ordering. So, how do you sort Unicode strings alphabetically in Python?

How to Sort Strings Using the Unicode Collation Algorithm (UCA)

The problem of sorting Unicode strings isn’t unique to Python. It’s a common challenge in any programming language or database. To address it, Technical Report #10 in the Unicode Technical Standard (UTS) describes the collation of Unicode strings, which is a consistent way of comparing two strings to establish their sorting order.

The Unicode Collation Algorithm (UCA) assigns a hierarchy of numeric weights to each character, allowing the creation of binary sort keys that account for accents and other special cases. These keys are defined at four levels that determine various features of a character:

  1. Primary: The base letter
  2. Secondary: The accents
  3. Tertiary: The letter case
  4. Quaternary: Other features

Later in this tutorial, you’ll learn how to leverage these weight levels to customize the Unicode collation algorithm by, for example, ignoring the letter case in case-insensitive sorting.

While the UCA supplies the Default Unicode Collation Element Table (DUCET), you should generally customize this default collation table to the specific needs of a particular language and application. It’s virtually impossible to ensure the desired sort order for all languages using only one character table. Therefore, software libraries implementing the UCA usually rely on the Common Locale Data Repository (CLDR) to provide such customization.

This repository contains several XML documents with language-specific information. For example, the collation rules for sorting text in Polish explain the relationship between the letters z, ź, and ż in both uppercase and lowercase:

Read the full article at »

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

October 11, 2023 at 07:30PM
Click here for more details...

The original post is available in Real Python by
this post has been published as it is through automation. Automation script brings all the top bloggers post under a single umbrella.
The purpose of this blog, Follow the top Salesforce bloggers and collect all blogs in a single place through automation.