UTF-8 bug

Message

carlos · #16 Post by **carlos** » 06 Mar 2019 15:28

I will try to fix it in a new coming soon new utility for batch that will improve it.
AGerman please can you help me? How can I determine if the input buffer used in MultibyteToWidechar have incomplete codepoints?

#17 Post by **aGerman** » 06 Mar 2019 16:51

Carlos

I'll explain the math behind here in the thread but the C implementation rather via PM or e-mail because it's quite off topic in the forum.

There are 5 rules that help you:
- UTF-8 characters are limited to a length of 4 code units (bytes).
- ASCII characters (7 low bits used, Most Significant Bit always 0) consist of only one code unit.
- Multi-byte UTF-8 code units always have the MSB set to 1.
- The first code unit of a multi-byte character has both the MSB and the second highest bit set to 1.
- The next code units of a multi-byte character have the MSB set to 1 but the second highest bit set to 0.

To check a byte you have two possibilities:
1) Use Logical Right Shift (that is, you have to cast the char type to an unsigned value). Shift 7 bits to get either 0 (ASCII) or 1 (multi-byte). Shift 6 bits to get either 3 (for the first code unit) or 2 (for the next code units) in a multi-byte character.
2) Use bit masks and bitwise AND for the same tests.

The zip file I uploaded in post #7 contains a C source code where I already implemented the test, along with the removal of the BOM. If you have any questions then just get back to me via PM.

Steffen

jfl · #18 Post by **jfl** » 12 Mar 2019 03:33

Arriving a bit late in this thread, but hopefully with a few interesting links:

The console always works in UCS-2 mode, whatever the code page you're using.
That is is records 16-bits Unicode version 1 characters in each cell. Ascii or non-ascii is irrelevant: Ascii is just the first 128 Unicode characters.
But with just 16 bits, it cannot record/display characters with 17 to 20 bits defined in subsequent Unicode versions.
Microsoft is well aware of that, and is currently redesigning the console to resolve this issue. They've published a very interesting document about their current work on this subject there:
https://blogs.msdn.microsoft.com/comman ... xt-buffer/

They're also aware of the BOM issue, and so hopefully this should be fixed when they deliver the redesigned console (as explained in their blog post above) later this year (?).

The problem with the spurious characters appearing in the middle of the text is different. I think it's completely independent of the above console limitations, but is a bug in the UTF-8 to Unicode conversion code in the console output handler.
The good news, is that the console team, contrary to its cmd.exe colleagues, has a GitHub site for managing issues in their code. I've opened a new bug there, referencing this thread:
https://github.com/Microsoft/console/issues/386

carlos · #19 Post by **carlos** » 12 Mar 2019 04:56

Thanks aGerman. I'm currently working with Jason Hood on a new utility that include a feature for solve this bug.
The feature currently is done.
We are working on the other features and revisions before I publish this.
I hope publish the utility in two weeks.

#20 Post by **aGerman** » 12 Mar 2019 05:37

Very interesting

I wasn't aware that Microsoft tracks the Console issues on GitHub. Thanks for reporting this bug Jean-François. I'll contribute to the discussion when I'm back home.

Steffen

DosTips.com

UTF-8 bug

Re: UTF-8 bug

Re: UTF-8 bug

Re: UTF-8 bug

Re: UTF-8 bug

Re: UTF-8 bug