Starting with Windows Vista, this function fully conforms with the Unicode 4.1 specification for UTF-8 and UTF-16. The function used on earlier operating systems encodes or decodes lone surrogate halves or mismatched surrogate pairs. Code written in earlier versions of Windows that rely on this behavior to encode random non-text binary data might run into problems. However, code that uses this function on valid UTF-8 strings will behave the same way as on earlier Windows operating systems.
CONVERTCP.exe - Convert text from one code page to another
Moderator: DosItHelp
Re: CONVERTCP.exe - Convert text from one code page to another
Convertcp support for xp should be affected for this next (appeared in the documentation of the MultiByteToWideChar function) ?
Re: CONVERTCP.exe - Convert text from one code page to another
Yes I know this bug In other words, WideCharToMultiByte converts to CESU-8 instead of UTF-8 on XP. Fortunately this isn't quite relevant most of the time. It's rather seldom that you'll find surrogate pairs in UTF-16 natural language text. Some CJK characters require it.
Actually the conversion from UTF-16 to UTF-8 and vice versa is simple math. Though it requires the intermediate conversion to UTF-32 to avoid CESU-8. I already developed the code for that but still I'm struggling to implement it into CONVERTCP since the hand-rolled conversion would certainly not be as performant as MultiByteToWideChar and WideCharToMultiByte. And it would only affect the conversion from/to UTF-8. I still need the API functions for other charsets.
Steffen
Actually the conversion from UTF-16 to UTF-8 and vice versa is simple math. Though it requires the intermediate conversion to UTF-32 to avoid CESU-8. I already developed the code for that but still I'm struggling to implement it into CONVERTCP since the hand-rolled conversion would certainly not be as performant as MultiByteToWideChar and WideCharToMultiByte. And it would only affect the conversion from/to UTF-8. I still need the API functions for other charsets.
Steffen
Re: CONVERTCP.exe - Convert text from one code page to another
Thanks Steffen for the explanation. I never be hear about CESU-8. It would be nice convertcp have code for prevent this bug ocurrs, else I think should be appear in the documentation that utf-8 conversion should not 100% reliable in xp., thus the supported for xp is not real. Really support for xp would produce the same utf-8 conversion if you run convertcp on xp or win 10.
Re: CONVERTCP.exe - Convert text from one code page to another
OK, I'll implement it in one of the next versions. This requires extensive code profiling beforehand. I'm afraid it would destroy the performance of the tool otherwise
Re: CONVERTCP.exe - Convert text from one code page to another
Maybe specific code should run only on xp thus the performance to the others platforms will be not affected
Re: CONVERTCP.exe - Convert text from one code page to another
Steffen: thanks again for all the updates. If required you can remove the XP support (I will use the latest version supported by XP (as mentioned I only need CP852 <-> CP1250 conversion)).
As it looked like for a simple project...
Saso
As it looked like for a simple project...
Saso
Re: CONVERTCP.exe - Convert text from one code page to another
/L (version 7.1 on XP 32-bit PRO) returns this:aGerman wrote: ↑24 Dec 2019 06:21I received the information that option /l is broken on XP. The update to v7.1 is supposed to fix that. Although I have to wait for feedback since I can't test on XP anymore.
Virustotal scans of version 7.1:
x86: https://www.virustotal.com/gui/file/860 ... /detection
x64: https://www.virustotal.com/gui/file/3b4 ... /detection
Steffen
Code: Select all
Code | Supported As | Description
Page ID| Input Stream >511 MB |
-------+----------------------+--------------------------------------------------
37 | Yes | 37 (IBM EBCDIC - U.S./Canada)
437 | Yes | 437 (OEM - United States)
500 | Yes | 500 (IBM EBCDIC - International)
737 | Yes | 737 (OEM - Greek 437G)
775 | Yes | 775 (OEM - Baltic)
850 | Yes | 850 (OEM - Multilingual Latin I)
852 | Yes | 852 (OEM - Latin II)
855 | Yes | 855 (OEM - Cyrillic)
857 | Yes | 857 (OEM - Turkish)
860 | Yes | 860 (OEM - Portuguese)
861 | Yes | 861 (OEM - Icelandic)
863 | Yes | 863 (OEM - Canadian French)
865 | Yes | 865 (OEM - Nordic)
866 | Yes | 866 (OEM - Russian)
869 | Yes | 869 (OEM - Modern Greek)
874 | Yes | 874 (ANSI/OEM - Thai)
875 | Yes | 875 (IBM EBCDIC - Modern Greek)
932 | No | 932 (ANSI/OEM - Japanese Shift-JIS)
936 | No | 936 (ANSI/OEM - Simplified Chinese GBK)
949 | No | 949 (ANSI/OEM - Korean)
950 | No | 950 (ANSI/OEM - Traditional Chinese Big5)
1026 | Yes | 1026 (IBM EBCDIC - Turkish (Latin-5))
1200 | Yes | 1200 (UTF-16 Little Endian Byte Order)
1201 | Yes | 1201 (UTF-16 Big Endian Byte Order)
1250 | Yes | 1250 (ANSI - Central Europe)
1251 | Yes | 1251 (ANSI - Cyrillic)
1252 | Yes | 1252 (ANSI - Latin I)
1253 | Yes | 1253 (ANSI - Greek)
1254 | Yes | 1254 (ANSI - Turkish)
1255 | Yes | 1255 (ANSI - Hebrew)
1256 | Yes | 1256 (ANSI - Arabic)
1257 | Yes | 1257 (ANSI - Baltic)
1258 | Yes | 1258 (ANSI/OEM - Viet Nam)
10000 | Yes | 10000 (MAC - Roman)
10006 | Yes | 10006 (MAC - Greek I)
10007 | Yes | 10007 (MAC - Cyrillic)
10010 | Yes | 10010 (MAC - Romania)
10017 | Yes | 10017 (MAC - Ukraine)
10029 | Yes | 10029 (MAC - Latin II)
10079 | Yes | 10079 (MAC - Icelandic)
10081 | Yes | 10081 (MAC - Turkish)
10082 | Yes | 10082 (MAC - Croatia)
12000 | Yes | 12000 (UTF-32 Little Endian Byte Order)
12001 | Yes | 12001 (UTF-32 Big Endian Byte Order)
20127 | Yes | 20127 (US-ASCII)
20261 | No | 20261 (T.61)
20866 | Yes | 20866 (Russian - KOI8)
21866 | Yes | 21866 (Ukrainian - KOI8-U)
28591 | Yes | 28591 (ISO 8859-1 Latin I)
28592 | Yes | 28592 (ISO 8859-2 Central Europe)
28594 | Yes | 28594 (ISO 8859-4 Baltic)
28595 | Yes | 28595 (ISO 8859-5 Cyrillic)
28597 | Yes | 28597 (ISO 8859-7 Greek)
28599 | Yes | 28599 (ISO 8859-9 Latin 5)
28605 | Yes | 28605 (ISO 8859-15 Latin 9)
65000 | No | 65000 (UTF-7)
65001 | Yes | 65001 (UTF-8)
Re: CONVERTCP.exe - Convert text from one code page to another
Thanks Saso! The French user who reported this bug also confirmed that it has been fixed now.
As long as the performance doesn't suffer, I'll try to support XP.
Currently I'm working on Carlos' suggestion to override the UTF-8 bug of the XP API functions. I got my own U8ToU16 and U16ToU8 functions taking the same time as Microsoft's MultiByteToWideChar. But WideCharToMultiByte is still ~30% faster than my U16ToU8. I guess Microsoft used some ASM magic that I'm not able to beat using C Probably I will end up branching the code depending on the OS version.
Steffen
As long as the performance doesn't suffer, I'll try to support XP.
Currently I'm working on Carlos' suggestion to override the UTF-8 bug of the XP API functions. I got my own U8ToU16 and U16ToU8 functions taking the same time as Microsoft's MultiByteToWideChar. But WideCharToMultiByte is still ~30% faster than my U16ToU8. I guess Microsoft used some ASM magic that I'm not able to beat using C Probably I will end up branching the code depending on the OS version.
Steffen
Re: CONVERTCP.exe - Convert text from one code page to another
I incorporated custom conversions from UTF-8 to UTF-16 and vice versa because of buggy API functions on XP. (See the discussion above.) Finally it was not necessary to determine the Windows version and branch the code. The speed of my own functions is now comparable. And if used along with option /v they perform better than the API functions.
Furthermore passing 0 for the default ANSI code page was broken on v7.0 and v7.1. That's fixed now.
Virustotal scans of version 7.2:
x86: https://www.virustotal.com/gui/file/d3c ... /detection
x64: https://www.virustotal.com/gui/file/f30 ... /detection
Steffen
Furthermore passing 0 for the default ANSI code page was broken on v7.0 and v7.1. That's fixed now.
Virustotal scans of version 7.2:
x86: https://www.virustotal.com/gui/file/d3c ... /detection
x64: https://www.virustotal.com/gui/file/f30 ... /detection
Steffen
Re: CONVERTCP.exe - Convert text from one code page to another
Wow, really nice work Steffen.
Thanks for the update, now it a very strong software.
Thanks for the update, now it a very strong software.
Re: CONVERTCP.exe - Convert text from one code page to another
Wow. That was fast. Thanks!
Saso
Saso
Re: CONVERTCP.exe - Convert text from one code page to another
I revised the validation of incoming UTF-8. Not sure if I already caught every invalid byte sequence in the previous version.
Virustotal scans of version 7.3:
x86: https://www.virustotal.com/gui/file/4f0 ... /detection
x64: https://www.virustotal.com/gui/file/f1a ... /detection
Steffen
Virustotal scans of version 7.3:
x86: https://www.virustotal.com/gui/file/4f0 ... /detection
x64: https://www.virustotal.com/gui/file/f1a ... /detection
Steffen
Re: CONVERTCP.exe - Convert text from one code page to another
FYI: first version (.cpp) has 131 lines (including some 50 lines of comments and help). Version 7.2 (.c) has (7.3 source is not available yet) has 1573 lines.
Saso
Saso
Re: CONVERTCP.exe - Convert text from one code page to another
Oh, thanks for the reminder! I must have done something wrong when I uploaded the source file.
FWIW 1596 lines
Steffen
FWIW 1596 lines
Steffen
Re: CONVERTCP.exe - Convert text from one code page to another
Conversions between UTF-8 and UTF-16 usually require to convert to an intermediate UTF-32 code unit (same as the Unicode code point). But ASCII characters already have the same value in both UTF-8 and UTF-16. Thus, it can be converted directly which improves the performance for text with latin characters that includes a lot of ASCII. Especially for English which is ASCII only.
Virustotal scans of version 7.4:
x86: https://www.virustotal.com/gui/file/4bd ... /detection
x64: https://www.virustotal.com/gui/file/ed2 ... /detection
Steffen
Virustotal scans of version 7.4:
x86: https://www.virustotal.com/gui/file/4bd ... /detection
x64: https://www.virustotal.com/gui/file/ed2 ... /detection
Steffen