Improvements to my tools supporting multiple encodings

Message

jfl · #1 Post by **jfl** » 02 Jun 2021 15:23

Hello,

I'm still trying to improve the automatic support for multiple text file encodings in my System Tools library.
Two things need improvements now:

The new Windows Terminal can now display all Unicode characters, including those beyond the 16-bits plane 0.
I thought that my method of using UTF-16 for output to the console would automatically support that... But it does not
On the other hand, when (and only when) the current code page is 65001, writing UTF-8 text in binary mode works. Even even with emoticons in plane 1, like "" = "\U1F44D" = "\xF0\x9F\x91\x8D".
If anybody has ideas on why UTF-16 does not work (Ex "" = "\U1F44D" = "\uD83D\uDC4D") I'm interested!

I already had support for files both in the Windows system encoding (Ex: Code page 1252 in US versions of Windows), and in the UTF-8 encoding.
I'm now adding code for distinguishing between the above two, and also ASCII, UTF-16, UTF-32, and binary files.

For testing this file encoding detection, I wrote a new tool called encoding.exe, that displays the encoding it detects for the files we specify.
I've tested it on my own system (A US version of Windows, despite my living in France) with good results.
I'd appreciate if other forum users could give it a try on their own text files, and tell me if the results are correct.
Especially if they have non-US versions of Windows! And even more so if it's a non-latin script!

Encoding.exe will eventually be included in future releases of my System Tools library.
For now, I've put a beta version there: http://jf.larvoire.free.fr/progs/encoding.exe

Usage:

encoding [OPTIONS] [PATHNAME [...]]

It also supports wildcards. Ex:

Code: Select all

C:\JFL\Temp>encoding t*.txt
UTF-8   t.txt
ASCII   t0.txt
ASCII   t1.txt
UTF-16  t16.txt
UTF-8   t2.txt
ASCII   t20.txt
UTF-8   t3.txt
Windows t4.txt
ASCII   t5.txt
ASCII   t6.txt
ASCII   t7.txt
ASCII   t8.txt
ASCII   tab.txt
ASCII   tab2.txt
ASCII   tabs.txt
UTF-8   tb.bat   .txt
ASCII   temp.txt
Windows test.txt
Windows test1.txt
UTF-16  test16.txt
ASCII   test2.txt
UTF-8   test8.txt
UTF-8   test_u8.txt
ASCII   tW.txt

C:\JFL\Temp>dump t1.txt

Offset    00           04           08           0C           0   4    8   C
--------  -----------  -----------  -----------  -----------  -------- --------
00000000  66 69 72 73  74 0D 0A                               first

C:\JFL\Temp>dump t2.txt 0 40

Offset    00           04           08           0C           0   4    8   C
--------  -----------  -----------  -----------  -----------  -------- --------
00000000  4C 65 20 70  61 73 73 61  67 65 20 64  65 20 6C 61  Le passa ge de la
00000010  20 53 61 76  6F 79 61 72  64 65 0D 0A  6C 61 20 6C   Savoyar de  la l
00000020  61 0D 0A 0D  0A 43 27 65  73 74 20 6C  65 20 70 6F  a    C'e st le po
00000030  69 6E 74 20  63 6C C3 A9  20 64 65 20  6C 61 20 74  int cl��  de la t

C:\JFL\Temp>dump t16.txt

Offset    00           04           08           0C           0   4    8   C
--------  -----------  -----------  -----------  -----------  -------- --------
00000000  FF FE 41 00  3D D8 09 DE  42 00                     ��A =� � B

C:\JFL\Temp>

Thanks for any feedback,

Jean-François

#2 Post by **aGerman** » 03 Jun 2021 05:16

Jean-François,

Meanwhile the Windows console is almost ready for emojis. However, right now they are still not supported. Related issue:
https://github.com/microsoft/terminal/issues/190

You will likely have luck using the new Windows Terminal. Microsoft is about to give you the opportunity to enable it as default console host. Development is still ongoing though. See the first bullet point in the 1.9 pre-release notes.
https://github.com/microsoft/terminal/releases

Steffen

#3 Post by **aGerman** » 03 Jun 2021 10:32

Tried encoding.exe. My default ANSI CP is 1252 and the OEM CP 850

"a.txt" Windows-1251 encoded file

Code: Select all

Выбрать из массива элемент, называемый опорным. Это может быть любой из элементов массива. От выбора опорного элемента не зависит корректность алгоритма, но в отдельных случаях может сильно зависеть его эффективность (см. ниже).

"b.txt" CP936 encoded

Code: Select all

挑选基准值：從數列中挑出一個元素，稱為「基準」

"c.txt" CP850 encoded

Code: Select all

Zunächst wird die zu sortierende Liste in zwei Teillisten ("linke" und "rechte" Teilliste) getrennt. Dazu wählt Quicksort ein sogenanntes Pivotelement aus der Liste aus.

Script:

Code: Select all

@echo off &setlocal
encoding -com -v "a.txt"
echo(
encoding -libx -v "a.txt"
echo(
encoding -com -v "b.txt"
echo(
encoding -libx -v "b.txt"
echo(
encoding -com -v "c.txt"
echo(
encoding -libx -v "c.txt"
echo(
pause

Output:

Code: Select all

Read 228 input bytes.
Windows' IMultiLanguage2 COM API detected CP: 1251
CP1251  a.txt

Read 228 input bytes.
MsvcLibX detected input type: Unrecognized encoding, possibly binary
Binary  a.txt

Read 46 input bytes.
Windows' IMultiLanguage2 COM API detected CP: 936
CP936   b.txt

Read 46 input bytes.
MsvcLibX detected input type: Windows system code page 1252
Windows b.txt

Read 169 input bytes.
Windows' IMultiLanguage2 COM API detected CP: 936
CP936   c.txt

Read 169 input bytes.
MsvcLibX detected input type: Windows system code page 1252
Windows c.txt

Drücken Sie eine beliebige Taste . . .

Conclusion: The COM interface works surprisingly well. However, it's still nothing you should ever rely on as the last example proves.
(FWIW I have no daubt that it is able to detect UTF-8 and UTF-16 correctly. That's the reason why I didn't even try it.)

Steffen

#4 Post by **aGerman** » 06 Jun 2021 06:02

Jean-François,

I spent some hours to figure out how to improve the encoding detection. The ability of IMultiLanguage2::DetectInputCodepage to distinguish between different SBCS and DBCS is quite promising. Attached you'll find the compiled tool, source code and test files. I guess distinguishing between ANSI codepages and the related OEM codepages will still be quite error-prone. E.g. CP 850 is recognized as 1252 even if I made a hint to prefer 850 (while 866 is recognized as 866 rather than 1251

).
Maybe you can make some use of it. I'm still of two minds whether encoding detection makes actually sense...

Steffen

(Note: I copied text for the test files from Wikipedia. Adding the license to the file content would have made them useless for the tests. Link to the licese text: https://en.wikipedia.org/wiki/Wikipedia ... ed_License)

#5 Post by **aGerman** » 20 Jun 2021 05:00

Jean-François,

Not sure if you're even interested anymore.
I refactored the code a little. Also a couple more test files are now in the updated zip archive above.
UTF-32, UTF-16 (1), and UTF-7 detection is improved while UTF-8 detection is weakened (2). Further I take into account that we sometimes know the origin of the text in order to define a preference (3).
It's likely that I reinvented the wheel at some point. I intentionally didn't look into your code for the evaluation of the data. Hopefully there is some new idea you can reuse.

Steffen

(1) The IsTextUnicode function can cause a well known bug, called "Bush hid the facts". Fortunately working around it is not too difficult.

(2) In the original code UTF-8 has been validated. However, we should rather report UTF-8 for text that looks like UTF-8 because it's certainly meant to be UTF-8 in this case. Rejecting it due to an invalid sequence would lead to trying other encodings which would cause a wrong result. Validating is a task for renderers and converters which flag an invalid sequence with replacement characters. That's not the purpose of this tool though.

(3) DetectInputCodepage makes quite a good job with the separation of different single and double byte charsets. However, some encodings are just too similar in the manner the bytes are distributed. And even if this method supports passing flags and a preferred codepage, it doesn't seem to make any use of such information. Based on the tests I made I decided to write another helper function which retrospectively checks if the preferred codepage fits to the codepage reported by DetectInputCodepage. However, I'm sure the list of relations is still incomplete and needs further investigation.

// EDIT: I made a mess with the codepage IDs. LE codepage IDs have to end with 0 and BE codepages with 1. Corrected now ...

jfl · #6 Post by **jfl** » 20 Jun 2021 11:55

Hi Steffen,

Yes, I'm definitely still interested, thanks for sharing your work like this.
It's just that I have had less time for working on this lately, as it's the high season for one of my other passions. (Here's what I was doing on Wednesday instead of working

.)
I'll resume serious work on these encoding issues in a few weeks.

About the IMultiLanguage2 API: My own tests had given disappointing results.
As this seems to be working well for you, I'll have to double-check my code, to see if I might have made a mistake somewhere.

Regards,

Jean-François

#7 Post by **aGerman** » 20 Jun 2021 12:42

Oh I know the landscape is so beautiful where you live. And I envy you for your paragliding passion. These days I'm trapped within my own four walls due to heavy hay fever. Probably I just forgot that there's still something like "outside"

So, yeah, definitely forget about programming and rather enjoy the summer in your free time!

As this seems to be working well for you, I'll have to double-check my code, to see if I might have made a mistake somewhere.

Use DetectInputCodepage only for SBCS and DBCS. It isn't applicable for UTF-32 and UTF-16. And it will even fail to detect UTF-8 and UTF-7. So, you need more than just DetectInputCodepage for this task.

Steffen

DosTips.com

Improvements to my tools supporting multiple encodings

Improvements to my tools supporting multiple encodings

Re: Improvements to my tools supporting multiple encodings

Re: Improvements to my tools supporting multiple encodings

Re: Improvements to my tools supporting multiple encodings

Re: Improvements to my tools supporting multiple encodings

Re: Improvements to my tools supporting multiple encodings

Re: Improvements to my tools supporting multiple encodings