CONVERTCP.exe - Convert text from one code page to another

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Message
Author
BatchGuy
Posts: 8
Joined: 13 Nov 2019 08:35

Re: CONVERTCP.exe - Convert text from one code page to another

#106 Post by BatchGuy » 10 Feb 2020 12:17

Steffen,

does ConvertCP have a built-in feature which allows to know which version it is?
Can't find anything of the kind in the tool's syntax, but maybe there's an easy workaround?
The "ConvertCP.exe" file which I occasionally use, could be any version...
It might be easier to check the version of my copy than to download the most recent online version to make sure that I accidentally didn't forget to fetch your latest release!

BG

aGerman
Expert
Posts: 4678
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#107 Post by aGerman » 10 Feb 2020 16:25

Either run CONVERTCP /? and find it in the first line of the help message, or look it up in the Details list of the file properties.

Steffen

BatchGuy
Posts: 8
Joined: 13 Nov 2019 08:35

Re: CONVERTCP.exe - Convert text from one code page to another

#108 Post by BatchGuy » 10 Feb 2020 17:24

Waw, hard to believe I missed that! :oops:
The 1st line in my DOS box is always "Usage" (which is the 3rd output line for /?), but still...

Thx,

BG

c160704
Posts: 2
Joined: 27 Nov 2014 06:34

Re: CONVERTCP.exe - Convert text from one code page to another

#109 Post by c160704 » 21 Jul 2020 01:57

Hi Steffen,

I used your tool to convert the code page of an old file I had made from a previous pc that had saved in a format my current pc didn't have the code page for. It worked a charm - thank you very much!

I'm not sure what I would be able to do if I couldn't remember or guess the original files code page though. I was wondering what people use to find that out.

aGerman
Expert
Posts: 4678
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#110 Post by aGerman » 21 Jul 2020 11:12

Thanks for your kind feedback!
c160704 wrote:
21 Jul 2020 01:57
I'm not sure what I would be able to do if I couldn't remember or guess the original files code page though. I was wondering what people use to find that out.
The wiki contains a paragraph about that.
https://sourceforge.net/p/convertcp/wiki/Home/#c4ac
As you can see there is already quite a bit you can do to figure it out. Although if we're talking about single-byte codepages it's for sure a pain to figure it out.
So, my suggestion in such cases is to use a HEX editor. If a character shows up as a wrong glyph in a normal text editor and you know what glyph it should actually be, you can use the HEX editor to get the value of the byte in charge. Unfortunately your task is to manually search the codepages for the one with the right glyph representation of this byte value. I know there are plenty of codepages, but probably only a few of them are reasonable for your environment/language.
Since I don't know which codepage was the culprit in your case I just provide an example in my mother tongue:
If I execute the PAUSE command in a cmd window, the output looks like that:
Drücken Sie eine beliebige Taste . . .
As you can see the third character isn't ASCII. If I redirect the output into a file ...

Code: Select all

pause > test.txt
... and open test.txt in a text editor like notepad then I'm seeing ...
Drcken Sie eine beliebige Taste . . .
Well, of course I already know that this is because the default OEM codepage used in the console window is 850, while the default ANSI codepage used in notepad is 1252 on my computer. However, to guide you through the steps to go if you don't already know it I'll pretend to see the content of this file the first time in my life.

I want to see the "ü" because I know this is the right glyph in the word "Drücken" but it doesn't show up. If I open test.txt in a HEX editor then I get this:
44 72 81 63 6B 65 6E 20 53 69 65 20 65 69 6E 65 20 62 65 6C 69 65 62 69 67 65 20 54 61 73 74 65 20 2E 20 2E 20 2E 20 0D 0A
Third byte should represent the "ü". Its value is 81.

What's next? First you should look for codepages that are likely to be used for German language. You can use the list provided by Microsoft
https://docs.microsoft.com/en-us/window ... dentifiers
or you can execute CONVERTCP /L to get the list of installed codepages on your computer.
So in my case I found
850 OEM Multilingual Latin 1; Western European (DOS) and
1252 ANSI Latin 1; Western European (Windows)
being reasonable.

Next step could be to just try them out using CONVERTCP or you might walk through the actual character maps. Let's just do the latter in order to give you an idea of how to find the right codepage by hand.
Taking 1252 first:
https://en.wikipedia.org/wiki/Windows-1 ... racter_set
We are looking for byte 81. Going to row 8_ and column _1 leads us to ... nothing. This means no character is represented by byte 81 in codepage 1252. So, this cant be the codepage we are looking for.
OK, let's try 850 next:
https://en.wikipedia.org/wiki/Code_page ... racter_set
Now we repeat - going to row 8_ and column _1 leads us to "ü".
This indicates that test.txt is CP850 encoded. And indeed this is the case here.

The conclusion is, to avoid this pain always use an encoding for your text files that fully supports Unicode, such like UTF-8 or UTF-16 :wink:

Steffen

jfl
Posts: 226
Joined: 26 Oct 2012 06:40
Location: Saint Hilaire du Touvet, France
Contact:

Re: CONVERTCP.exe - Convert text from one code page to another

#111 Post by jfl » 03 Aug 2020 07:21

@aGerman
Hi Steffen,
In my own code page conversion tool conv.exe, I use a heuristic for selecting a default output code page, that works very well in practice:
  • If stdout is a console, then switch it to 16-bits mode, and output UTF-16. (This allows displaying all Unicode characters, whatever the current code page.)
  • Else if stdout is a pipe, then use the current console code page, from GetConsoleOutputCP(). (All Microsoft tools expect that encoding for data passed in pipes.)
  • Else it's a file. Use the default system code page, from GetACP(). (This was the default encoding used by GUI applications like Notepad.)
And for the input code page, the heuristic for selecting a default is:
  • Check if the input is valid UTF-8 or UTF-16. If so, use that. (UTF-8 is becoming a more and more common encoding, and it's easy to recognize reliably.)
  • Else if stdin is a console or a pipe, then use the current console code page, from GetConsoleOutputCP(). (Same reason as for stdout)
  • Else it's a file. Use the default system code page, from GetACP(). (Same reason as for stdout)
This allows all sorts of combinations of commands, maximizing the chances to get the right output by default, without having to specify code pages manually.
For example, these four commands display French text correctly, whereas the type command gets it right in only one case:

Code: Select all

C:\JFL\Temp>chcp
Active code page: 437

C:\JFL\Temp>type test.txt
Jean-Franτois habite α Grenoble

C:\JFL\Temp>type test.txt | conv
Jean-François habite à Grenoble

C:\JFL\Temp>conv test.txt
Jean-François habite à Grenoble

C:\JFL\Temp>conv < test.txt
Jean-François habite à Grenoble

C:\JFL\Temp>conv < test.txt | 2clip

C:\JFL\Temp>1clip
Jean-François habite à Grenoble

C:\JFL\Temp>chcp 1252
Active code page: 1252

C:\JFL\Temp>type test.txt
Jean-François habite à Grenoble

C:\JFL\Temp>type test.txt | conv
Jean-François habite à Grenoble

C:\JFL\Temp>conv test.txt
Jean-François habite à Grenoble

C:\JFL\Temp>conv < test.txt
Jean-François habite à Grenoble

C:\JFL\Temp>conv < test.txt | 2clip

C:\JFL\Temp>1clip
Jean-François habite à Grenoble

C:\JFL\Temp>chcp 65001
Active code page: 65001

C:\JFL\Temp>type test.txt
Jean-Fran�ois habite � Grenoble

C:\JFL\Temp>type test.txt | conv
Jean-François habite à Grenoble

C:\JFL\Temp>conv test.txt
Jean-François habite à Grenoble

C:\JFL\Temp>conv < test.txt
Jean-François habite à Grenoble

C:\JFL\Temp>conv < test.txt | 2clip

C:\JFL\Temp>1clip
Jean-François habite à Grenoble

C:\JFL\Temp>
And (not clearly visible above) the content of the clipboard, when using the 2clip command, is correct in all cases too.
The text file was created in Notepad, and is encoded in the system code page (1252 on my system), which is different from the default console code page (437 on my system).

aGerman
Expert
Posts: 4678
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#112 Post by aGerman » 03 Aug 2020 10:59

Hi Jean-François,

It wouldn't have helped c160704 to define defaults since they didn't know the encoding of the incoming file :lol:
However, I also thought a lot about defaults. Thank's for your suggestions :!: Definitely worth to consider again.
FWIW It's not that CONVERCP doesn't support defaults. You can pass 0 for the ACP and 1 for the OEMCP since the Windows conversion functions support them out of the box. But, yes, you still have to explicitely pass them to CONVERTCP.

Just lets go through your suggestions step by step:
If stdout is a console, then switch it to 16-bits mode, and output UTF-16. (This allows displaying all Unicode characters, whatever the current code page.)
Good idea. You still have to condition the console window for UTF-16 output and restore the old behavior if you're done. But that's at least reasonable.

Else if stdout is a pipe, then use the current console code page, from GetConsoleOutputCP(). (All Microsoft tools expect that encoding for data passed in pipes.)
Yes probably you're right.

Else it's a file. Use the default system code page, from GetACP(). (This was the default encoding used by GUI applications like Notepad.)
Oh. I would avoid that like the plague. Even Microsoft realized that ANSI codepages are evil and they changed the behavior of notepad on Win10 to save files in UTF-8 (without BOM) by default, rather than using the default ANSI codepage. Thus, I'd really favor an encoding that fully supports Unicode, preferably UTF-8. I think we shouldn't make the same mistakes that Microsoft made in the past. In the best case conversion tools like our's should become obsolete one day :wink:

And for the input code page, the heuristic for selecting a default is:
Check if the input is valid UTF-8 or UTF-16. If so, use that. (UTF-8 is becoming a more and more common encoding, and it's easy to recognize reliably.)
If the input has a BOM then it's easy indeed. Otherwise I have my doubts.
- In case of UTF-8 all ASCII characters still stay the same. It might be easy for any other language than English. But English is ASCII only and it may take thousands of characters before the first non-ASCII character appears.
- In case of UTF-16 you can check for the typical alternately appearing zero bytes. But in CJK languages you'll be out of luck.

Serously, how do you handle this in your source code :?:

Else if stdin is a console or a pipe, then use the current console code page, from GetConsoleOutputCP(). (Same reason as for stdout)
OK, but at least for the pipe you're not doing it. I tested a few minutes ago and it seems you're using the ACP along with conv :?

Else it's a file. Use the default system code page, from GetACP(). (Same reason as for stdout)
Way too vague IMHO. Especially since most text editors use UTF-8 by default nowadays. I'm really not sure if I would even consider to allow a default here.


Looking forward to your feedback

Steffen

jfl
Posts: 226
Joined: 26 Oct 2012 06:40
Location: Saint Hilaire du Touvet, France
Contact:

Re: CONVERTCP.exe - Convert text from one code page to another

#113 Post by jfl » 07 Aug 2020 03:17

Hi Steffen,
You still have to condition the console window for UTF-16 output and restore the old behavior if you're done.
It's the stdout file that I switch to 16-bits mode, not the console. This is done using the C library function _setmode(fileno(stdout), _O_WTEXT).
I suppose that the cleanup is done automatically when the program exits, because I don't do any cleanup myself.
I'd really favor an encoding that fully supports Unicode, preferably UTF-8.
Agreed, I need to change that in my code, at least for Windows 10.
One possible refinement would be to obey Notepad's own default encoding, stored in the registry in HKCU\Software\Microsoft\Notepad\iDefaultEncoding.
If the input has a BOM then it's easy indeed. Otherwise I have my doubts.
Indeed the BOM is a strong hint... But unfortunately it's being abandoned. :-(
Fortunately, a byte stream with bytes from \x80 to \xFF can be reliably validated as UTF-8 or not. The probability of a false positive is not 0, but it's very very low. The more non-ASCII bytes, the lower the probability.
it may take thousands of characters before the first non-ASCII character appears.
Correct. It's a weakness of my conv program, which reads the whole file in memory. So it can't convert huge files that don't fit.
Again we're talking about heuristics for choosing defaults. We know we have to accept a (hopefully small) proportion of errors.
Using a large buffer, and testing the first part of the file, would give a good default in most cases.
In case of UTF-16 you can check for the typical alternately appearing zero bytes. But in CJK languages you'll be out of luck.
Correct. I have very little experience with CJK languages, but I hope that there are occasional spaces or digits that may help.
Note that the only UTF-16 files I usually have to deal with are Windows own log files, and registry exports. These contain mostly ASCII anyway, even on CJK versions of Windows.
how do you handle this in your source code
Well, actually only the BOM detection is implemented in conv.c. The UTF-8 / UTF-16 validation has been on my to-do list for years, but it's never made it to the top. :roll:

Finally note that I did experiment with the COM API IMultiLanguage2::DetectInputCodepage()... But the results were very poor: It's slower, and wrong more often than my current heuristics. :)
OK, but at least for the pipe you're not doing it. I tested a few minutes ago and it seems you're using the ACP along with conv
Yes I do! Notice how the two non-ASCII characters are changed when conv pipes that ANSI file to dump.exe, versus dumping the file directly:

Code: Select all

C:\JFL\Temp>chcp
Active code page: 437

C:\JFL\Temp>conv test.txt
Jean-François habite à Grenoble

C:\JFL\Temp>dump test.txt

Offset    00           04           08           0C           0   4    8   C
--------  -----------  -----------  -----------  -----------  -------- --------
00000000  4A 65 61 6E  2D 46 72 61  6E E7 6F 69  73 20 68 61  Jean-Fra n�ois ha
00000010  62 69 74 65  20 E0 20 47  72 65 6E 6F  62 6C 65 0D  bite � G renoble
00000020  0A

C:\JFL\Temp>conv test.txt | dump

Offset    00           04           08           0C           0   4    8   C
--------  -----------  -----------  -----------  -----------  -------- --------
00000000  4A 65 61 6E  2D 46 72 61  6E 87 6F 69  73 20 68 61  Jean-Fra n�ois ha
00000010  62 69 74 65  20 85 20 47  72 65 6E 6F  62 6C 65 0D  bite � G renoble
00000020  0A

C:\JFL\Temp>codepage 437
Code page 437: OEM - United States (SBCS) ASCII-compatible
   80 Ç  90 É  A0 á  B0 ░    C0 └  D0 ╨  E0 α  F0 ≡
   81 ü  91 æ  A1 í  B1 ▒    C1 ┴  D1 ╤  E1 ß  F1 ±
   82 é  92 Æ  A2 ó  B2 ▓    C2 ┬  D2 ╥  E2 Γ  F2 ≥
   83 â  93 ô  A3 ú  B3 │    C3 ├  D3 ╙  E3 π  F3 ≤
   84 ä  94 ö  A4 ñ  B4 ┤    C4 ─  D4 ╘  E4 Σ  F4 ⌠
   85 à  95 ò  A5 Ñ  B5 ╡    C5 ┼  D5 ╒  E5 σ  F5 ⌡
   86 å  96 û  A6 ª  B6 ╢    C6 ╞  D6 ╓  E6 µ  F6 ÷
   87 ç  97 ù  A7 º  B7 ╖    C7 ╟  D7 ╫  E7 τ  F7 ≈
   88 ê  98 ÿ  A8 ¿  B8 ╕    C8 ╚  D8 ╪  E8 Φ  F8 °
   89 ë  99 Ö  A9 ⌐  B9 ╣    C9 ╔  D9 ┘  E9 Θ  F9 ∙
   8A è  9A Ü  AA ¬  BA ║    CA ╩  DA ┌  EA Ω  FA ·
   8B ï  9B ¢  AB ½  BB ╗    CB ╦  DB █  EB δ  FB √
   8C î  9C £  AC ¼  BC ╝    CC ╠  DC ▄  EC ∞  FC ⁿ
   8D ì  9D ¥  AD ¡  BD ╜    CD ═  DD ▌  ED φ  FD ²
   8E Ä  9E ₧  AE «  BE ╛    CE ╬  DE ▐  EE ε  FE ■
   8F Å  9F ƒ  AF »  BF ┐    CF ╧  DF ▀  EF ∩  FF  

C:\JFL\Temp>codepage 1252
Code page 1252: ANSI - Latin I (SBCS) ASCII-compatible
   80 €  90   A0    B0 °    C0 À  D0 Ð  E0 à  F0 ð
   81   91 ‘  A1 ¡  B1 ±    C1 Á  D1 Ñ  E1 á  F1 ñ
   82 ‚  92 ’  A2 ¢  B2 ²    C2 Â  D2 Ò  E2 â  F2 ò
   83 ƒ  93 “  A3 £  B3 ³    C3 Ã  D3 Ó  E3 ã  F3 ó
   84 „  94 ”  A4 ¤  B4 ´    C4 Ä  D4 Ô  E4 ä  F4 ô
   85 …  95 •  A5 ¥  B5 µ    C5 Å  D5 Õ  E5 å  F5 õ
   86 †  96 –  A6 ¦  B6 ¶    C6 Æ  D6 Ö  E6 æ  F6 ö
   87 ‡  97 —  A7 §  B7 ·    C7 Ç  D7 ×  E7 ç  F7 ÷
   88 ˆ  98 ˜  A8 ¨  B8 ¸    C8 È  D8 Ø  E8 è  F8 ø
   89 ‰  99 ™  A9 ©  B9 ¹    C9 É  D9 Ù  E9 é  F9 ù
   8A Š  9A š  AA ª  BA º    CA Ê  DA Ú  EA ê  FA ú
   8B ‹  9B ›  AB «  BB »    CB Ë  DB Û  EB ë  FB û
   8C Œ  9C œ  AC ¬  BC ¼    CC Ì  DC Ü  EC ì  FC ü
   8D   9D   AD ­  BD ½    CD Í  DD Ý  ED í  FD ý
   8E Ž  9E ž  AE ®  BE ¾    CE Î  DE Þ  EE î  FE þ
   8F   9F Ÿ  AF ¯  BF ¿    CF Ï  DF ß  EF ï  FF ÿ

C:\JFL\Temp>
I'm really not sure if I would even consider to allow a default here.
All I can tell is that it works well for me in most cases.

penpen
Expert
Posts: 2009
Joined: 23 Jun 2013 06:15
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#114 Post by penpen » 07 Aug 2020 08:03

jfl wrote:
07 Aug 2020 03:17
You still have to condition the console window for UTF-16 output and restore the old behavior if you're done.
It's the stdout file that I switch to 16-bits mode, not the console. This is done using the C library function _setmode(fileno(stdout), _O_WTEXT).
I suppose that the cleanup is done automatically when the program exits, because I don't do any cleanup myself.
Typically C, C++, ANSI-C and ANSI-C++ don't do that and there is no hint in the documentation for assuming otherwise, see:
https://docs.microsoft.com/en-us/cpp/c- ... ew=vs-2019.

jfl wrote:
07 Aug 2020 03:17
Indeed the BOM is a strong hint... But unfortunately it's being abandoned. :-(
Where have you heard that?

I would say that claim is wrong, see the actual Unicode Version 13.0.0 that still specifies the BOM (chapter 23 subsection "23.8 Specials"):
https://www.unicode.org/versions/Unicode13.0.0/ch23.pdf

jfl wrote:
07 Aug 2020 03:17
Fortunately, a byte stream with bytes from \x80 to \xFF can be reliably validated as UTF-8 or not. The probability of a false positive is not 0, but it's very very low. The more non-ASCII bytes, the lower the probability.
Although bytes xF8-xFF are considered valid in utf-8, they are currently not in use (not enough existing unicode characters).
Therefore currently they are a strong hint, that the document is not utf-8.

Also note that there are multiple codepages for which (most if not all) characters in x80-xFF are valid:
An example for that is codepage 850 where all bytes are valid characters of that codepage.


penpen

jfl
Posts: 226
Joined: 26 Oct 2012 06:40
Location: Saint Hilaire du Touvet, France
Contact:

Re: CONVERTCP.exe - Convert text from one code page to another

#115 Post by jfl » 07 Aug 2020 09:55

penpen wrote:
07 Aug 2020 08:03
Typically C, C++, ANSI-C and ANSI-C++ don't do that
I know it's very rarely used. It's Microsoft C specific, hence the underscore ahead of the function name _setmode().
But it's ideal for tools wanting to output Unicode into the Windows console, independently of the code page.
Actually I'm pretty sure cmd.exe does something like this internally, because it can display any unicode character in any code page, for example when you do a dir.
penpen wrote:
07 Aug 2020 08:03
jfl wrote:
07 Aug 2020 03:17
Indeed the BOM is a strong hint... But unfortunately it's being abandoned. :-(
Where have you heard that?
It's still in the Unicode standard, but fewer and fewer people seem to be using it.
Actually I should have said "Microsoft tools are abandoning it", and UTF-8 without BOM is the de-facto standard on Unix and the Web:
  • Notepad now defaults to UTF-8 without a BOM. (And they're probably right, because most Unix tools generate UTF-8 text without a BOM.)
  • The MSVC compiler always choked on C files beginning with a BOM.
  • HTML UTF-8 files always specify their encoding with a <meta> tag, never with a BOM.
  • Etc
penpen wrote:
07 Aug 2020 08:03
Although bytes xF8-xFF are considered valid in utf-8, they are currently not in use (not enough existing unicode characters).
Therefore currently they are a strong hint, that the document is not utf-8.
Good idea, I'll use that fact when I implement the UTF-8 detection.

aGerman
Expert
Posts: 4678
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#116 Post by aGerman » 07 Aug 2020 13:40

It's the stdout file that I switch to 16-bits mode, not the console. This is done using the C library function _setmode(fileno(stdout), _O_WTEXT).
I suppose that the cleanup is done automatically when the program exits, because I don't do any cleanup myself.
Yes, that's what I was talking about. And yes it's a Windows-specific C runtime extension which doesn't make it necessary to restore the old settings. The console output buffer supports wide characters by default. So, after thinking twice, I think I would probably skip this C hack and rely on the console API (WriteConsoleW).

Yes I do! Notice how the two non-ASCII characters are changed when conv pipes that ANSI file to dump.exe, versus dumping the file directly
I'm still struggling with it :D
This short batch code is saved in Windows-1252 (which is also my default ACP).

Code: Select all

@echo off &setlocal
chcp

echo Jean-François habite à Grenoble|conv
echo Jean-François habite à Grenoble

pause <nul|conv
pause
Output:
Aktive Codepage: 850.
Jean-François habite à Grenoble
Jean-Franþois habite Ó Grenoble
Drcken Sie eine beliebige Taste . . .
Drücken Sie eine beliebige Taste . . .
So maybe I just still missunderstand it, but obviously conv expects to read ANSI-encoded (1252) text from the pipe.


As to valid UTF-8, refer to Table 3-7 on page 94:
https://www.unicode.org/versions/Unicode6.0.0/ch03.pdf


And for the BOM: Unicode allows a BOM but doesn't require it. This means the application has to be able to handle a BOM appropriately if it gets one. But it must not rely on getting a BOM. And especially for UTF-8 the BOM is rather meaningless. BOM is for Byte Order Mark. But there is no ambiguity in the byte order of UTF-8, whatever endianness a system works with.
However, receiving the BOM is just nice for us, because we know from the very beginning that the stream is UTF-8 encoded. But that's rather an issue on Windows. *nixoide systems work with a silent agreement that everything is UTF-8 by default. (At least I didn't find anything in the POSIX standard. It only requires the so-called Portable Character Set that ends at the same point as 7-bit ASCII.)


Steffen

penpen
Expert
Posts: 2009
Joined: 23 Jun 2013 06:15
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#117 Post by penpen » 08 Aug 2020 00:34

jfl wrote:
07 Aug 2020 09:55
Actually I'm pretty sure cmd.exe does something like this internally, because it can display any unicode character in any code page, for example when you do a dir.
I'm not sure if you misunderstood my comment.
I meant to say that there is no automatic cleanup.
You seem to have understood that MS used something like Unicode internally and uss that function to switch /which i would agree to).
In detail cmd.exe (and windows in general) uses UCS-2, which pretty mich is Unicode version 1.0 without surrogate codepoint pairs; instead both (low and high surrogate codepoints) are treated as custom characters, which don't need to appear in pairs

aGerman wrote:
I suppose that the cleanup is done automatically when the program exits, because I don't do any cleanup myself.
Yes, that's what I was talking about. And yes it's a Windows-specific C runtime extension which doesn't make it necessary to restore the old settings.
The cmd.exe uses UCS-2 internally, so it is not affected by that setting, but that doesn't mean doesn't mean that cleanup isn't neccessary.
It is neccessary to restore the old settings (and don't forget to flush the input stream before setting and resetting), because otherwise you may leave other executables that might have called your tools in an undefined state, because they expect their settings to be true all the time, which would be false if you don't reset it.

jfl wrote:
07 Aug 2020 09:55
Actually I should have said "Microsoft tools are abandoning it", and UTF-8 without BOM is the de-facto standard on Unix and the Web:
  • Notepad now defaults to UTF-8 without a BOM. (And they're probably right, because most Unix tools generate UTF-8 text without a BOM.)
  • The MSVC compiler always choked on C files beginning with a BOM.
  • HTML UTF-8 files always specify their encoding with a <meta> tag, never with a BOM.
  • Etc
There are only two reasons for using a BOM:
(1) Unknown Byte Order.
(2) Unknown Character Set.

As long as both are false (which is true for source files which are expected to use UTF-8) you should assume to read the Unicode Character 'ZERO WIDTH NO-BREAK SPACE', which has a different semantics, hence the choke, which is how the Unicode Standard expects Applications to handle that character - therefore the choke is not really a choke, but expected behaviour that is implemented correctly.

Files that are meant to be exchanged between computers have that BOM, because cause (2) is true then.
Therefore *.sln, *vcproj *.filters either have the BOM or are XML files (in case they use UTF-8, they should have no BOM).

aGerman wrote:So maybe I just still missunderstand it, but obviously conv expects to read ANSI-encoded (1252) text from the pipe.
You are right on that.
For example you saved the character 'à' i codepage 1252, which results in byte 0xE0.
Than cmd.exe reads it using cp 850 which results in character 'Ó'.
In case it s piped to "conv" it is again byte 0xE0, which is read from "conv" as an 'à',
which probably is causes by usng cp 1252 again.

In case of cp 437 you get an 'α' instead of that 'Ó'.


penpen

aGerman
Expert
Posts: 4678
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#118 Post by aGerman » 08 Aug 2020 04:08

penpen wrote:
08 Aug 2020 00:34
The cmd.exe uses UCS-2 internally, so it is not affected by that setting, but that doesn't mean doesn't mean that cleanup isn't neccessary.
It is neccessary to restore the old settings (and don't forget to flush the input stream before setting and resetting), because otherwise you may leave other executables that might have called your tools in an undefined state, because they expect their settings to be true all the time, which would be false if you don't reset it.
Our apps communicate directly with conhost which acts as a terminal. The cmd is only the shell that may call them, but any other shell (e.g. explorer) might be the shell to call it. So, cmd has basically nothing to do with it. (It gets involved for pipes and redirections, but thats not the point if we are talking about writing to the console.)
_setmode influences the passed file stream. This is held by our application. Even if stdout is linked to the console output buffer, no configuration on the console is made. Jean-François is right on that. That's the reason why this setting only lasts as long as the process lives, and is only affecting the stream buffer of our applications. As long as we don't try to write multibyte streams and wide streams alternately you neither need to reset the configuration nor do you need to flush the stdout beforehand. However, it doesn't hurt of course :wink:

Steffen

jfl
Posts: 226
Joined: 26 Oct 2012 06:40
Location: Saint Hilaire du Touvet, France
Contact:

Re: CONVERTCP.exe - Convert text from one code page to another

#119 Post by jfl » 17 Aug 2020 10:29

So maybe I just still missunderstand it, but obviously conv expects to read ANSI-encoded (1252) text from the pipe.
You're right. I was convinced of the contrary, but indeed after looking at the source, this is the way it was. :shock:
I had forgotten about that, but I had changed it that way, because one of the most common case where I need conv.exe is when some other command outputs something in the wrong code page for the console. I fix it by repeating the command, with "| conv" appended in the end. So defaulting to the current console code page on input made no sense, as if it's correct, you don't need conv.exe.

Anyway, I've now pushed on github an updated version of conv.c, implementing the heuristics for detecting UTF-8 and UTF-16 without BOM. This works quite nicely in my testing.
For UTF-8, it eliminates bytes \xF8-\xFF as suggested by penpen. But it also eliminates \xC0, \xC1, \xF5, \xF6, \xF7 which I just learned were also invalid UTF-8.

For the default output type when writing to files, I've chosen to stick to ANSI for Windows versions up to WIndows 10 build 18297, and change that default to UTF-8 for Windows 10 build 18298 and later (Released in December 2018). This build was the first one where Notepad itself defaulted to using UTF-8 for new files.

I've not yet made a new release of my System Tools Library, but if you want to give it a try, I've put my latest conv.exe build there.

aGerman
Expert
Posts: 4678
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#120 Post by aGerman » 17 Aug 2020 13:10

After confirming all browser warnings, Windows Defender still reports the downloaded binary and immediately removes it. So, I wasn't even able to try out :( However, I had a look into the source file. What still isn't clear to me is after how many bytes do you actually stop trying to figure out what encoding you got?

For performance reasons I'll probably not doing too much of that in CONVERTCP. Although I still consider to check the file type of the output handle and default to wide string using WriteConsoleW if it is linked to a text device. Just have to make a couple of tests to see how it works if lounched within the Windows Terminal.

Steffen

Post Reply