Page 1 of 1

"TermStrWidth" utility – measure the number of cells a string occupies in the CLI.

Posted: 06 Aug 2024 16:09
by aGerman
What?
This little tool measures the display width of strings in the Windows Terminal / Console.
It is basically a copy of the recently revised code for this purpose in Microsoft's open source repository. All credits go to their Terminal team.
I used this code in order to actually get the same results as the internally performed measurements. For more information refer to the comments in the C source file which is part of the attached zip archive.
Note that at the time of writing this, the new algorithm is only merged into the Terminal/Console code but not yet part of any official release (apart from recent Canary builds of the Terminal). So, please be patient if not all of your test cases give already the right results. I might be a little premature with this utility.

When?
Use it in UTF-8 encoded scripts if you want to perform some kind of tabulation or centering of text where it is necessary to know upfront how much space a string is going to occupy in the window. As I don't know whether or not new console features are back-ported to Windows 10, I assume only Windows 11 will get full support. Likewise use the Windows Terminal.

How?
Just pass the string to measure as argument to the tool:

Code: Select all

TermStrWidth.exe "Test ✅"'
A list of widths is written if several strings are passed with the values in a new line each.
Use a FOR /F loop to process the output.

Code: Select all

for /f %%i in ('TermStrWidth.exe "Test ✅"') do echo width: %%i

Why?
Most of the people I'm seeing here in the forum, incl. myself, are used to write Latin or Cyrillic text. Due to historical reasons we often still use a single-byte charset and thoughtlessly make the assumption that
1 character
== 1 byte
== 1 code point
== 1 glyph or grapheme cluster
== 1 column for the printed glyph or terminal cluster
However, this is pointless for half of the people in the world. Think about Chinese, Japanese, Korean, Devanagari, or Arabic.
Now that most text editors default to UTF-8 and we are used to include all kind of symbols and emoji in our text, the aforementioned assumption also doesn't make sense for the other half of people any longer.
In *nix world, where UTF-8 is a quasi standard since forever, this has been considered for a long time. C functions wcwidth() and wcswidth() were introduced with the POSIX.1-2001 standard to measure the display width. Although this is far better than what we had on Windows so far, those functions only try to measure the width based on a single code point each, which is clearly insufficient.
The Windows Terminal now contains an algorithm that takes the string context into account. This tool contains the same algorithm and preprocessed data tables. I doubt it's perfect. But due to the lack of any standardization we can't even evaluate how close to perfection it actually is. At least a proposal was already submitted to the UTC ( see https://www.unicode.org/L2/L2023/23107- ... -suppt.pdf ) of how this should be ideally implemented.

Steffen

Test scripts included.
TermStrWidth_v1.1.zip
(15.39 KiB) Downloaded 598 times

EDIT: I also created a repo on GitHub to provide a C interface for people that are interested. The released tool over there doesn't contain the GCC hack to shrink the binary though.
https://github.com/german-one/wtswidth- ... ring-width

Re: "TermStrWidth" utility – measure the number of cells a string occupies in the CLI.

Posted: 07 Aug 2024 00:13
by miskox
Thank you Steffen. Very good read. It is time (long overdue?) for this change to happen.

Saso

Re: "TermStrWidth" utility – measure the number of cells a string occupies in the CLI.

Posted: 10 Aug 2024 08:46
by aGerman
Thanks Saso!

As this is about predicting the appearance of strings I should have provided some examples and screenshots. So in the attached zip is a little code where the pictures below are based on.
Tested on Win 11. The font is Cascadia Code.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

*** New behavior: ***
Recent canary builds of the terminal already contain the uppdated code (version 1.22 is where I ultimately expect to find it in 2 or 3 months):
term_canary.png
term_canary.png (483.5 KiB) Viewed 20471 times


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

*** Old behavior: ***
Right now Terminal v.1.20 is still the stable build. Look at where the closing quotes show up. As you can see the internal width measurement is still wrong in this version (Devanagari and the man raising his hand in particular):
term_1.20.png
term_1.20.png (494.15 KiB) Viewed 20471 times


The console still has some more quirks. Joining the male symbol to the gender-neutral person emoji doesn't work here:
console.png
console.png (47.01 KiB) Viewed 20471 times
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


Some explanations about what certain strings contain. First column is UTF-8 bytes in HEX.

Code: Select all

6 code points in हिन्दी:
E0A4B9 | U+0939 | Devanagari Letter Ha
E0A4BF | U+093F | Devanagari Vowel Sign I
E0A4A8 | U+0928 | Devanagari Letter Na
E0A58D | U+094D | Devanagari Sign Virama
E0A4A6 | U+0926 | Devanagari Letter Da
E0A580 | U+0940 | Devanagari Vowel Sign Ii

5 code points in 🙋🏻‍♂️:
F09F998B | U+1F64B | Happy Person Raising One Hand
F09F8FBB | U+1F3FB | Emoji Modifier Fitzpatrick Type-1-2 (a.k.a. Light Skin Tone)
E2808D   | U+200D  | Zero Width Joiner (ZWJ)
E29982   | U+2642  | Male Sign
EFB88F   | U+FE0F  | Variation Selector-16

2 code points in ❤️:
E29DA4 | U+2764 | Heavy Black Heart
EFB88F | U+FE0F | Variation Selector-16
Steffen

(test script now included in the zip file of the first post)