How to Sort Unicode Strings Alphabetically in Python :

October 11, 2023 October 11, 2023

How to Sort Unicode Strings Alphabetically in Python
by:
blow post content copied from Real Python
click here to view original post

Sorting strings in Python is something that you can almost take for granted if you do it all the time. Yet, it can become surprisingly tricky considering all the edge cases lurking in the vast Unicode standard. Understanding Unicode isn’t an easy feat, so prepare yourself for a whirlwind tour of surprising edge cases and effective ways of dealing with them.

Note: In this tutorial, you’ll often use the term Unicode string to mean any string with non-Latin letters and characters like emoji. However, strings consisting of Basic Latin letters also fall into this category because the underlying ASCII table is a subset of Unicode.

Python pioneered and popularized a robust sorting algorithm called Timsort, which now ships with several major programming languages, including Java, Rust, and Swift. When you call sorted() or list.sort(), Python uses this algorithm under the surface to rearrange elements. As long as your sequence contains comparable elements of compatible types, then you can sort numbers, strings, and other data types in the expected order:

>>>

>>> math_constants = [6.28, 2.72, 3.14]
>>> sorted(math_constants)
[2.72, 3.14, 6.28]

>>> fruits = ["orange", "banana", "lemon", "apple"]
>>> sorted(fruits)
['apple', 'banana', 'lemon', 'orange']

>>> people = [("John", "Doe"), ("Anna", "Smith")]
>>> sorted(people)
[('Anna', 'Smith'), ('John', 'Doe')]

Unless you specify otherwise, Python sorts these elements by value in ascending order, comparing them pairwise. Each pair must contain elements that are comparable using either the less than (<) or greater than (>) operator. Sometimes, these comparison operators are undefined for specific data types like complex numbers or between two distinct types. In these cases, a comparison will fail:

>>>

>>> 4 + 2j < 2 + 4j
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '<' not supported between instances of 'complex' and 'complex'

>>> 3.14 < "orange"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '<' not supported between instances of 'float' and 'str'

In most cases, you’ll be working with homogeneous sequences comprising elements of the same type, so you’ll rarely run into this problem in the wild. However, things may start to fall apart once you throw strings with non-Latin characters into the mix, such as letters with diacritics or accents:

>>>

>>> polish_names = ["Zbigniew", "Ludmiła", "Żaneta", "Łukasz"]
>>> sorted(polish_names)
['Ludmiła', 'Zbigniew', 'Łukasz', 'Żaneta']

In this example, you run into a common challenge associated with sorting strings. When a string contains characters whose ordinal values extend beyond the usual ASCII range, then you might get unexpected results like what you have here. The name Łukasz ends up after Zbigniew, even though the letter Ł (pronounced like a w sound in English) occurs earlier than Z in the Polish alphabet. Why’s that?

Python sorts strings lexicographically by comparing Unicode code points of the individual characters from left to right. It just so happens that the letter Ł has a higher ordinal value in Unicode than the letter Z, making it greater than any of the Latin letters:

>>>

>>> ord("Ł")
321

>>> ord("Z")
90

>>> any("Ł" < letter for letter in "ABCDEFGHIJKLMNOPQRSTUVWXYZ")
False

Code points are unique yet arbitrary identifiers of Unicode characters, and they don’t inherently agree with the alphabetical order of spoken languages. So, lexicographic sorting won’t be appropriate for languages other than English.

Different cultures follow different rules for sorting strings, even when they share the same alphabet or parts of it with other cultures. For example, the ch digraph is considered two separate letters (c and h) in Polish, but it becomes a stand-alone letter placed between h and i in the Czech alphabet. This is known as a contraction in Unicode.

Note: Language rules occasionally change. For instance, the Spanish alphabet treated the ch and ll digraphs as single letters until 1994, when the official language regulation institution, RAE, declared them separate.

At some point, the accent ordering in French changed from backward to forward everywhere in the world except for Canada, which still sticks to the old tradition:

	Accent Ordering
France	cote, coté, côte, côté
Canada	cote, côte, coté, côté

In modern French, words are compared from left to right. The earliest letter with an accent, which is ô in the example above, corresponds to a greater sort key, pushing the word to the end of the list. On the other hand, in Canada, words are compared backward from right to left, so the last accent’s position determines the final order.

Depending on their geographic location worldwide, individuals speaking the same language may expect a slightly different sorting order.

Moreover, the sorting order can sometimes differ within the same culture, depending on the context. For example, most German phone books tend to treat letters with an umlaut (ä, ö, ü) similarly to the ae, oe, and ue letter combinations. However, other countries overwhelmingly treat these letters the same as their Latin counterparts (a, o, u).

There’s no universally correct way to sort Unicode strings. You need to tell Python which rules to apply to get the desired ordering. So, how do you sort Unicode strings alphabetically in Python?

Get Your Code: Click here to download the free sample code that shows you how to sort Unicode strings alphabetically with Python.

How to Sort Strings Using the Unicode Collation Algorithm (UCA)

The problem of sorting Unicode strings isn’t unique to Python. It’s a common challenge in any programming language or database. To address it, Technical Report #10 in the Unicode Technical Standard (UTS) describes the collation of Unicode strings, which is a consistent way of comparing two strings to establish their sorting order.

The Unicode Collation Algorithm (UCA) assigns a hierarchy of numeric weights to each character, allowing the creation of binary sort keys that account for accents and other special cases. These keys are defined at four levels that determine various features of a character:

Primary: The base letter
Secondary: The accents
Tertiary: The letter case
Quaternary: Other features

Later in this tutorial, you’ll learn how to leverage these weight levels to customize the Unicode collation algorithm by, for example, ignoring the letter case in case-insensitive sorting.

Note: This hierarchical nature of character weights allows you to compare sort keys incrementally to increase performance. It also helps conserve memory. For example, some implementations of the UCA take advantage of the trie data structure to efficiently store and retrieve the weights for a given string.

While the UCA supplies the Default Unicode Collation Element Table (DUCET), you should generally customize this default collation table to the specific needs of a particular language and application. It’s virtually impossible to ensure the desired sort order for all languages using only one character table. Therefore, software libraries implementing the UCA usually rely on the Common Locale Data Repository (CLDR) to provide such customization.

This repository contains several XML documents with language-specific information. For example, the collation rules for sorting text in Polish explain the relationship between the letters z, ź, and ż in both uppercase and lowercase:

Read the full article at https://realpython.com/python-sort-unicode-strings/ »

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

October 11, 2023 at 07:30PM
Click here for more details...

=============================
The original post is available in Real Python by
this post has been published as it is through automation. Automation script brings all the top bloggers post under a single umbrella.
The purpose of this blog, Follow the top Salesforce bloggers and collect all blogs in a single place through automation.
============================

Python

Python Reader

How to Sort Unicode Strings Alphabetically in Python :

How to Sort Strings Using the Unicode Collation Algorithm (UCA)

Read the full article at https://realpython.com/python-sort-unicode-strings/ »

Post a Comment