UTF-8 bug

Message

#1 Post by **aGerman** » 04 Mar 2019 12:22

Not sure if that is an old bug. I'm only able to test on Win10. I tried on my computer at home and at work. Both show the bug.
It appears if you TYPE a text file in a window where the code page was changed to 65001 (UTF-8).

: utf8_bug.png (6.86 KiB) Viewed 13642 times

The BOM has not been removed but shows up as blank. Furthermore the code units are not interpreted correctly. Thus, it displays crap on several positions in the output even if the entire file consists only of lines containing 5 sequences ü€ each. Test files:

bug_test.zip: (422 Bytes) Downloaded 575 times

I managed to overcome this bug in my console programs. Unfortunately I can't change the behavior of the windows commands. They don't take care of the boundaries of multi-byte characters

Steffen

#2 Post by **penpen** » 04 Mar 2019 17:14

aGerman wrote: ↑
04 Mar 2019 12:22
The BOM has not been removed but shows up as blank.

That part is no bug. Since Unicode 2.0 (i'm unsure, could also be v1.1) the utf-8 bom is the character "ZERO WIDTH NO-BREAK SPACE".
When using rasterfonts you are allowed to use a non zero width bo-break space, although it is recommended to do otherwise.

penpen

#3 Post by **aGerman** » 04 Mar 2019 17:52

Okay, that might not be a bug in terms of the specification but it clearly doesn't meet my expectations of a good implementation

However, the malformed output should have been avoided. It's pretty straight forward to check whether a UTF-8 code unit belongs to a multi-byte character. There is no good reason to interrupt the output before such a character has been entirely written to the console buffer

Steffen

#4 Post by **penpen** » 04 Mar 2019 19:05

After viewing the output, it might just be a console display error. The additional characters are all bom values probably indicating how many unknown character parts are found:
The first part at the end of the type command's read buffer may contain multiple bytes, while the following are single bytes (therefore for example the character '€' sometimes seems to be replaced by two, and sometimes by 3 boms).
I guess that the stream object itself handles it that way to return to a valid stream state, but the console didn't know what happened probably removing malformed (=parts of) utf-8 codepints.

I tested that theory changing the batch to the following:

Code: Select all

@echo off &setlocal
chcp 65001
>"bug_output.txt" type "bug_test.txt"
chcp 850

I repeated it multiple times, and there were never errors in the resulting file.
(Although i know that is no final proof.)

penpen

#5 Post by **aGerman** » 05 Mar 2019 02:50

No, there are no BOM characters in the middle of the text. That are just converted UTF-8 characters that have been incompletely read. (I'm able to reproduce this behavior in C.) And yes, there is no problem if you redirect it to a file because in this case no conversion is in place and thus, it doesn't matter what time a missing code unit is written as long as in the end all code units are written.

I have a pretty good idea of what happens. It might get a little off topic, but I guess it's helpful for understanding.

First of all, as you know the console window buffer consists of character cells. Their content is specified in a CHAR_INFO structure for each cell separately. Basically only character value and color are the information that the structure is able to hold.

Code: Select all

typedef struct _CHAR_INFO {
  union {
    WCHAR UnicodeChar;
    CHAR  AsciiChar;
  } Char;
  WORD  Attributes;
} CHAR_INFO, *PCHAR_INFO;

As you can see the character value is saved in the Char union. In a union all members (UnicodeChar and AsciiChar in this case) share the same memory space. That means the width is 2 bytes which is the width of the UnicodeChar member. (Just as a side note - thats the reason why a console window will technically never be able to support Unicode entirely. Only the UCS-2 subset.) So a character cell contains only one single value to represent the character. I think now it's getting quite clear that UTF-8 values can't be used directly to represent characters in the console window. They have to be converted to either their single-byte representation in an ANSI/OEM code page or 16 bit wide-char values. It's not surprising that the UnicodeChar member is used in the end (I did some tests to prove it).
To summarize - UTF-8 input needs to be converted to UTF-16 (actually only UCS-2) for the console window buffer.

How would you be able to reproduce the bug? Quite simple. The implementation of the console window reads a certain amount of bytes from standard input but it doesn't care of the content read. Thus, the BOM is in the buffered bytes and the buffer might end somewhere in the middle of a multi-byte UTF-8 character. This chunk of bytes will be converted to UTF-16 (most likely using function MultiByteToWideChar). The UTF-8 BOM will be converted to the UTF-16 BOM and shows up as a blank later on. The end of the chunk might be converted to crap, as well as the beginning of the next chunk. That's what you are facing in the output.

Steffen

#6 Post by **penpen** » 05 Mar 2019 07:17

aGerman wrote: ↑
05 Mar 2019 02:50
No, there are no BOM characters in the middle of the text. That are just converted UTF-8 characters that have been incompletely read.

You were right, it was the "REPLACEMENT CHARACTER" (EF BF BD around ~1-2 am it was too similar for me to the bom = EF BB BF).

aGerman wrote: ↑
05 Mar 2019 02:50
I think now it's getting quite clear that UTF-8 values can't be used directly to represent characters in the console window. They have to be converted to either their single-byte representation in an ANSI/OEM code page or 16 bit wide-char values.

As far as i know (i once used visual studio 6.0 to trace into, unless they didn't change that) the console also can't use any other codepage directly, it supports ucs-2 only. All values are converted to that format.

It seems only single and double byte character sets seem to not suffer from this bug (only multibyte ones):

Code: Select all

@echo off
setlocal enableExtensions enableDelayedExpansion

call :init
call :test 932
call :init
call :test 65001
goto :eof

:test
chcp %~1
:next
set "test=!test:~1!"
set /A "i-=1"
@echo(%test% | (
	set /p "input1="
	set /p "input2="
	call echo(%%input1:~%i%%%,%%input2%%
)
if not "!test:~1018!" == "" goto :next
goto :eof

:init
>nul chcp 1250
set "i=1023"
set "test=aaaaaaaa"
set "test=!test:a=%test:a=aaaa%!"
set "test=!test:a=aaaa!ü€"
>nul chcp 932
set "test=%test%‚³"
goto :eof

Result:

Code: Select all

Z:\>bug_test.bat
Aktive Codepage: 932.
a,u?さ
au,?さ
au?,さ
au?さ,
au?さ,
au?さ ,%input2%
au?さ ,%input2%
au?さ ,%input2%
au?さ ,%input2%
Aktive Codepage: 65001.
a,ü€さ
a�,�€さ
aü,€さ
aü�,��さ
aü�,�さ
aü€,さ
aü€�,��
aü€�,�
aü€さ,

I'm unsure why DBCSs seem not to suffer from this bug... .

penpen

#7 Post by **aGerman** » 05 Mar 2019 11:33

penpen wrote: ↑
05 Mar 2019 07:17
"REPLACEMENT CHARACTER"

That sounds exactly like WideCharToMultiByte() and MultiByteToWideChar() behave. Unsupported characters are either converted to their approximated ASCII character (like ü to u) or replaced with a question mark (ANSI) or the Unicode replacement character (UTF-x).

penpen wrote: ↑
05 Mar 2019 07:17
As far as i know ... the console also can't use any other codepage directly, it supports ucs-2 only.

Interesting. In that case the AsciiChar member still exists for the ...A() functions that internally convert the characters to UTF-16.

penpen wrote: ↑
05 Mar 2019 07:17
I'm unsure why DBCSs seem not to suffer from this bug... .

Guess why. Buffers are usually declared with an even size. Thus, an even number of bytes will always fit.

It's fun to figure out how the CMD internally works. Unfortunately that doesn't help those who are facing this bug. It's up to Microsoft to fix it. The only thing I can offer is a little tool that works around it.
( EDIT Removed the zip file and replaced it with the download link were the tool will be updated: )
https://sourceforge.net/projects/teew/
Actually the purpose of this utility is a tee-like behavior. But if you omit the file name it writes to the standard output only. E.g.

Code: Select all

chcp 65001
type "utf-8.txt"|TeeW

Steffen

#8 Post by **penpen** » 05 Mar 2019 15:06

aGerman wrote: ↑
05 Mar 2019 11:33

penpen wrote: ↑
05 Mar 2019 07:17
I'm unsure why DBCSs seem not to suffer from this bug... .
Guess why. Buffers are usually declared with an even size. Thus, an even number of bytes will always fit.

In DBCS like Microsoft Windows code page 932 there are also single byte characters (such as 'a'), but as seen above, i couldn't split that 2-byte-value into 2 bytes although there were uneven amounts of 'a' characters.
That is irritating because they must have cared for such an issue... but forgotten to use that for utf-8... .

aGerman wrote: ↑
05 Mar 2019 11:33
It's fun to figure out how the CMD internally works. Unfortunately that doesn't help those who are facing this bug. It's up to Microsoft to fix it. The only thing I can offer is a little tool that works around it.
TeeW.zip

Agreed.

But shouldn't findstr also solve that problem (for files < 2 GiB)?

Code: Select all

type "bug_test.txt" | findstr "^"
findstr "^" "bug_test.txt"

Nevertheless: Nice tool (although it (64 bit version) seems to buffer (findstr "^"|teew), which is suboptimal for some cases.
(But to be fair since win95 i haven't seen a non buffering T implementation... lost once... and thought internet doesn't forget... .)

penpen

#9 Post by **aGerman** » 05 Mar 2019 15:48

penpen wrote: ↑
05 Mar 2019 15:06
That is irritating because they must have cared for such an issue... but forgotten to use that for utf-8... .

I agree. Besides of that code page 932 is a MBCS according to your explanation. DBCSs would always use two bytes. Although I'm not familiar with 932 in particular.

penpen wrote: ↑
05 Mar 2019 15:06
But shouldn't findstr also solve that problem (for files < 2 GiB)?

I didn't even think about findstr to be honest.

penpen wrote: ↑
05 Mar 2019 15:06
seems to buffer

Yes. The utility tries to read 4096 bytes into a buffer at once. Up to 3 bytes more in case of UTF-8 multi-byte as last character. I convert the chunk to UTF-16 for the output. Then I use threads to write the output which only makes sense if text is written to the standard output and to one or more files. Originally it was not intended to write to the console window only

The reason why I decided to do it this way is that writing to the console window is terribly slow. For the sake of not wasting time I write to the files in parallel. But creating threads for every single character would have been more expensive as you can imagine

Steffen

#10 Post by **penpen** » 05 Mar 2019 18:17

aGerman wrote: ↑
05 Mar 2019 15:48

penpen wrote: ↑
05 Mar 2019 15:06
That is irritating because they must have cared for such an issue... but forgotten to use that for utf-8... .
I agree. Besides of that code page 932 is a MBCS according to your explanation. DBCSs would always use two bytes. Although I'm not familiar with 932 in particular.

I hear alot that DBCS always use 2 bytes and i don't know where that came from, but i would guess someone confused "described" with "encoded".

Only trust the definition given by the author (Microsoft) of the term "DBCS", which is pretty clear:

Some characters in a DBCS, including the digits and letters used for writing English, have single-byte code values. Other characters, such as Chinese ideographs or Japanese kanji, have double-byte code values.

(See: https://docs.microsoft.com/en-us/window ... acter-sets.)

penpen

#11 Post by **aGerman** » 05 Mar 2019 18:57

penpen wrote: ↑
05 Mar 2019 18:17
i would guess someone confused "described" with "encoded".

Seems so. The definitions differ even between the German and the English Wikipedia descriptions. However, as I already understood CP 932 uses single bytes for the ASCII characters which is the important fact. So as you said, Microsoft just failed to implement a proper multi-byte handling for UTF-8. At least on Win10. I can't remember if I've seen this bug on XP or Win7.

Steffen

carlos · #12 Post by **carlos** » 06 Mar 2019 09:43

Also occurs on windows 7.
The solution would be remove the bom from the the utf-8 text when cmd call to MultiByteToWideChar ?

#13 Post by **aGerman** » 06 Mar 2019 10:47

That only avoids the occurrence of the BOM in the console output. It still doesn't fix the corrupted characters though. They need to make sure that always all code units of the last UTF-8 character are read into the buffer before they convert the buffered substring.

Steffen

#14 Post by **penpen** » 06 Mar 2019 10:51

WinXP was even more buggy.
In order to not crash the default WinXP consoIe, i first had to change the batch to:

Code: Select all

@echo off &setlocal
(chcp 65001 & type "bug_test.txt"  & chcp 850)

That's (a part of) the output i got:

Code: Select all

┬┤ÔòùÔöÉÔö£ÔòØ├ö├®┬╝Ôö£ÔòØ├ö├®┬╝Ôö£ÔòØ├ö├®┬╝Ôö£ÔòØ├ö├®┬╝Ôö£ÔòØ├ö├®┬╝
Ôö£ÔòØ├ö├®├®┬╝
Ôö£ÔòØ├ö├®┬╝Ôö£ÔòØ├ö├®┬╝Ôö£ÔòØ├ö├®┬╝Ôö£ÔòØ├ö├®┬╝Ôö£ÔòØ├ö├®┬╝
Ôö£ÔòØ├ö├®┬╝├ö├®┬╝
Ôö£ÔòØ├ö├®┬╝Ôö£ÔòØ├ö├®┬╝Ôö£ÔòØ├ö├®┬╝Ôö£ÔòØ├ö├®┬╝Ôö£ÔòØ├ö├®┬╝
Ôö£ÔòØ├ö├®ÔòØ├ö├®┬╝
Ôö£ÔòØ├ö├®┬╝Ôö£ÔòØ├ö├®┬╝Ôö£ÔòØ├ö├®┬╝Ôö£ÔòØ├ö├®┬╝Ôö£ÔòØ├ö├®┬╝
Ôö£ÔòØ Ôö£ÔòØ├ö├®┬╝
Ôö£ÔòØ├ö├®┬╝Ôö£ÔòØ├ö├®┬╝Ôö£ÔòØ├ö├®┬╝Ôö£ÔòØ├ö├®┬╝Ôö£ÔòØ├ö├®┬╝
Ôö£ ┬╝Ôö£ÔòØ├ö├®┬╝
Ôö£ÔòØ├ö├®┬╝Ôö£ÔòØ├ö├®┬╝Ôö£ÔòØ├ö├®┬╝Ôö£ÔòØ├ö├®┬╝Ôö£ÔòØ├ö├®┬╝
O ├®┬╝Ôö£ÔòØ├ö├®┬╝
Ôö£ÔòØ├ö├®┬╝òØ├®┬╝
Ôö£ÔòØ├ö├®┬╝Ôö£ÔòØ├ö├®┬╝Ôö£ÔòØ├ö├®┬╝Ôö£ÔòØ├ö├®┬╝Ôö£ÔòØ├ö├®┬╝
Ôö£ÔòØ├ö├®┬╝├ö├®┬╝
Ôö£ÔòØ├ö├®┬╝Ôö£ÔòØ├ö├®┬╝Ôö£ÔòØ├ö├®┬╝Ôö£ÔòØ├ö├®┬╝Ôö£ÔòØ├ö├®┬╝
Ôö£ÔòØ├ö├®ÔòØ├ö├®┬╝
Ôö£ÔòØ├ö├®┬╝Ôö£ÔòØ├ö├®┬╝Ôö£ÔòØ├ö├®┬╝Ôö£ÔòØ├ö├®┬╝Ôö£ÔòØ├ö├®┬╝
Ôö£ÔòØ Ôö£ÔòØ├ö├®┬╝
Ôö£ÔòØ├ö├®┬╝Ôö£ÔòØ├ö├®┬╝Ôö£ÔòØ├ö├®┬╝Ôö£ÔòØ├ö├®┬╝Ôö£ÔòØ├ö├®┬╝
Ôö£ ┬╝Ôö£ÔòØ├ö├®┬╝

Finally i think i found how to explain the output (WinXp console uses utf-8 twice:

Code: Select all

Textfile glyphs                 : ü€
encoded in utf-8                : C3 BC, E2 82 AC
interpreted as cp 850 codepoints: C3, BC, E2, 82, AC
displayed as cp 850 glyphs      : ├╝Ôé¼
mapped to Unicode codepoints    : U+251C, U+255D, U+D4, U+E9, U+BC
encoded in utf-8                : E2 94 9C, E2 95 9D, C3 94, C3 A9, C2 BC
interpreted as cp 850 codepoints: E2 94 9C, E2 95 9D, C3 94, C3 A9, C2 BC
displayed as cp 850 glyphs      : Ôö£ÔòØ├ö├®┬╝

I think it uses cp 850 because that is the default dos codepage set in registry - but i'm not sure.

If i read the output right, then the bug is applied two times (on each utf-8 encoding step) in WinXP... .
(

I wonder if you could get the bug applied three times in WinXP if using cp 932 as default dos codepage... .)

Sidenote:
When redirecting to a file, here on WinXP also no error occurs, like in win10.

penpen

#15 Post by **aGerman** » 06 Mar 2019 11:04

That means instead of defined behavior (fails always on XP) they "improved" it to undefined behavior (fails only sometimes in a few hundred characters on Win10). Great

Steffen

DosTips.com

UTF-8 bug

UTF-8 bug

Re: UTF-8 bug

Re: UTF-8 bug

Re: UTF-8 bug

Re: UTF-8 bug

Re: UTF-8 bug

Re: UTF-8 bug

Re: UTF-8 bug

Re: UTF-8 bug

Re: UTF-8 bug

Re: UTF-8 bug

Re: UTF-8 bug

Re: UTF-8 bug

Re: UTF-8 bug

Re: UTF-8 bug