How does the sorting work in a table?

stefano.menci · March 22, 2018, 3:43pm

How does the tables.order_by("field", ascending=True) work?
I was expecting an alphabetical sorting, but I was wrong. It looks like it strips out all the non alphanumeric characters and is case insensitive. Am I correct?

This is a list of values sorted by the sorted() function:

ABC  DEF 12
ABC  DEF 32
ABC (DEF) 14
ABC (DEF) 34
ABC DEF 11
ABC DEF 31
ABC [DEF] 13
ABC [DEF] 33
ABC.DEF 15
ABC.DEF 35
abc def 21

This is the same list stored in a table and sorted by table.search(tables.order_by("field", ascending=True)):

ABC DEF 11
ABC  DEF 12
ABC [DEF] 13
ABC (DEF) 14
ABC.DEF 15
abc def 21
ABC DEF 31
ABC  DEF 32
ABC [DEF] 33
ABC (DEF) 34
ABC.DEF 35

navigate · March 22, 2018, 5:15pm

I think you may be correct about it being case sensitive.

Assuming you want to use the table in an app someplace, consider using something like

values = app_tables.search(some cirteria)
values = sorted(values, key =lambda x: x.lower().replace(‘[’,‘’))

stefano.menci · March 22, 2018, 9:00pm

My approach is similar, I am just converting more characters:

ORDER_BY_TABLE = {ord(c): None for c in r' !@#$%^&*()_+{}|[]\\`~,<.>/?;:\'"'}
def order_by(txt):
    return txt.lower().translate(ORDER_BY_TABLE)
values = sorted(values, key=lambda x: order_by(x))

The problem is that in the tests made so far my approach works, but without documentation I don’t know if I am doing the right thing or if I’m just lucky. The only way to know it is the hard way: wait for the next crash and fix it.

meredydd · March 27, 2018, 4:29pm

Yep - just to confirm, we use PostgreSQL’s default locale-aware string sorting, which is case-insensitive and only uses spaces and punctuation to break ties - it’s what you’d get if you did an ORDER BY query on a string column.

(I’m open to a feature request for a flag to enable strict “C locale” ordering like Python does. Something like app_tables.foo.search(tables.order_by('foo'), tables.LITERAL_STRING_SORT, ...))

stefano.menci · March 28, 2018, 5:02pm

Thanks for the clarification.

Sorting with localization in Python requires PyICU or other dependencies.
Right now I’m dealing with simple texts and I can force it to be in the ASCII range, so I will try to live with my home made approach and with the current sorting of Anvil.

But the home made approach described above fails because it doesn’t break ties and because the order of PostgreSQL is different from Python even in the ASCII range.

Here is something more reliable:

REMOVE_PUNCTUATION = {ord(c): None for c in r' !"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'}
PUNCTUATION_ORDER = str.maketrans(
    r"""`^~<=>| _-,;:!?/.'"()[]{}@$*\&#%+0123456789aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ""",
    r""" !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~""")

def order_by(txt):
    try:
        return txt.lower().translate(REMOVE_PUNCTUATION), txt.translate(PUNCTUATION_ORDER)
    except AttributeError:
        return txt  # txt is not a string

And here is how to sort a Python list of dictionaries and an Anvil table over one or more columns and get the same result:

columns = ['col1', 'col2', 'col3']

excel_key = lambda row: [order_by(row[col]) for col in columns]
anvil_key = tuple(tables.order_by(col, ascending=True) for col in columns)

rows_in_excel = sorted(rows_in_excel, key=excel_key)
rows_in_anvil = anvil_table.search(*anvil_key)

p.colbert · April 9, 2018, 10:20pm

Opinion: each sort order provided should be provided in both places: the database, and the programming language. This allows order-dependent operations to move seamlessly between database and programming language, as the need arises, with no confusion, and no additional code.

The principle of least surprise says that existing Python operators <, >, ==, !=,. etc. should agree with the default ordering. This does not preclude having other, database-provided orderings. They would be provided to Python code via equivalent Anvil-supplied string-comparison functions.

If I understand correctly, Unicode is the Web’s preferred character set, and UTF-8 its preferred encoding. Therefore, I personally would be least surprised if the ordering was consistent with UTF-8.

From what I can tell, it is also the ordering with the least overhead, as there should no indirection (lookup tables) involved, but a direct C memcmp() against the unaltered UTF-8 byte streams.

But am I missing something?

stefano.menci · April 9, 2018, 10:33pm

I absolutely agree!

But after spending some time figuring out the function described above, I realized that is not that simple. For example PostgreSQL doesn’t just sort by whatever character set / encoding you decide to pick. It first removes spaces and other stuff, then orders, then breaks ties with whatever had been removed. Well, kinda. That’s why my order_by function returns a tuple, not just one string.

I have never used PostgreSQL, but I don’t think we are the first ones to have this problem. I hope that there is already a library out there that does the job that I tried to do and it does it better. While I wait for the Anvil staff to show us how to use that library (or to make it if doesn’t exist), I will try to live in the ASCII range.