CONVERTCP.exe - Convert text from one code page to another
Moderator: DosItHelp
Re: CONVERTCP.exe - Convert text from one code page to another
Steffen,
does ConvertCP have a built-in feature which allows to know which version it is?
Can't find anything of the kind in the tool's syntax, but maybe there's an easy workaround?
The "ConvertCP.exe" file which I occasionally use, could be any version...
It might be easier to check the version of my copy than to download the most recent online version to make sure that I accidentally didn't forget to fetch your latest release!
BG
does ConvertCP have a built-in feature which allows to know which version it is?
Can't find anything of the kind in the tool's syntax, but maybe there's an easy workaround?
The "ConvertCP.exe" file which I occasionally use, could be any version...
It might be easier to check the version of my copy than to download the most recent online version to make sure that I accidentally didn't forget to fetch your latest release!
BG
Re: CONVERTCP.exe - Convert text from one code page to another
Either run CONVERTCP /? and find it in the first line of the help message, or look it up in the Details list of the file properties.
Steffen
Steffen
Re: CONVERTCP.exe - Convert text from one code page to another
Waw, hard to believe I missed that!
The 1st line in my DOS box is always "Usage" (which is the 3rd output line for /?), but still...
Thx,
BG
The 1st line in my DOS box is always "Usage" (which is the 3rd output line for /?), but still...
Thx,
BG
Re: CONVERTCP.exe - Convert text from one code page to another
Hi Steffen,
I used your tool to convert the code page of an old file I had made from a previous pc that had saved in a format my current pc didn't have the code page for. It worked a charm - thank you very much!
I'm not sure what I would be able to do if I couldn't remember or guess the original files code page though. I was wondering what people use to find that out.
I used your tool to convert the code page of an old file I had made from a previous pc that had saved in a format my current pc didn't have the code page for. It worked a charm - thank you very much!
I'm not sure what I would be able to do if I couldn't remember or guess the original files code page though. I was wondering what people use to find that out.
Re: CONVERTCP.exe - Convert text from one code page to another
Thanks for your kind feedback!
https://sourceforge.net/p/convertcp/wiki/Home/#c4ac
As you can see there is already quite a bit you can do to figure it out. Although if we're talking about single-byte codepages it's for sure a pain to figure it out.
So, my suggestion in such cases is to use a HEX editor. If a character shows up as a wrong glyph in a normal text editor and you know what glyph it should actually be, you can use the HEX editor to get the value of the byte in charge. Unfortunately your task is to manually search the codepages for the one with the right glyph representation of this byte value. I know there are plenty of codepages, but probably only a few of them are reasonable for your environment/language.
Since I don't know which codepage was the culprit in your case I just provide an example in my mother tongue:
If I execute the PAUSE command in a cmd window, the output looks like that:
... and open test.txt in a text editor like notepad then I'm seeing ...
I want to see the "ü" because I know this is the right glyph in the word "Drücken" but it doesn't show up. If I open test.txt in a HEX editor then I get this:
What's next? First you should look for codepages that are likely to be used for German language. You can use the list provided by Microsoft
https://docs.microsoft.com/en-us/window ... dentifiers
or you can execute CONVERTCP /L to get the list of installed codepages on your computer.
So in my case I found
850 OEM Multilingual Latin 1; Western European (DOS) and
1252 ANSI Latin 1; Western European (Windows)
being reasonable.
Next step could be to just try them out using CONVERTCP or you might walk through the actual character maps. Let's just do the latter in order to give you an idea of how to find the right codepage by hand.
Taking 1252 first:
https://en.wikipedia.org/wiki/Windows-1 ... racter_set
We are looking for byte 81. Going to row 8_ and column _1 leads us to ... nothing. This means no character is represented by byte 81 in codepage 1252. So, this cant be the codepage we are looking for.
OK, let's try 850 next:
https://en.wikipedia.org/wiki/Code_page ... racter_set
Now we repeat - going to row 8_ and column _1 leads us to "ü".
This indicates that test.txt is CP850 encoded. And indeed this is the case here.
The conclusion is, to avoid this pain always use an encoding for your text files that fully supports Unicode, such like UTF-8 or UTF-16
Steffen
The wiki contains a paragraph about that.
https://sourceforge.net/p/convertcp/wiki/Home/#c4ac
As you can see there is already quite a bit you can do to figure it out. Although if we're talking about single-byte codepages it's for sure a pain to figure it out.
So, my suggestion in such cases is to use a HEX editor. If a character shows up as a wrong glyph in a normal text editor and you know what glyph it should actually be, you can use the HEX editor to get the value of the byte in charge. Unfortunately your task is to manually search the codepages for the one with the right glyph representation of this byte value. I know there are plenty of codepages, but probably only a few of them are reasonable for your environment/language.
Since I don't know which codepage was the culprit in your case I just provide an example in my mother tongue:
If I execute the PAUSE command in a cmd window, the output looks like that:
As you can see the third character isn't ASCII. If I redirect the output into a file ...Drücken Sie eine beliebige Taste . . .
Code: Select all
pause > test.txt
Well, of course I already know that this is because the default OEM codepage used in the console window is 850, while the default ANSI codepage used in notepad is 1252 on my computer. However, to guide you through the steps to go if you don't already know it I'll pretend to see the content of this file the first time in my life.Drcken Sie eine beliebige Taste . . .
I want to see the "ü" because I know this is the right glyph in the word "Drücken" but it doesn't show up. If I open test.txt in a HEX editor then I get this:
Third byte should represent the "ü". Its value is 81.44 72 81 63 6B 65 6E 20 53 69 65 20 65 69 6E 65 20 62 65 6C 69 65 62 69 67 65 20 54 61 73 74 65 20 2E 20 2E 20 2E 20 0D 0A
What's next? First you should look for codepages that are likely to be used for German language. You can use the list provided by Microsoft
https://docs.microsoft.com/en-us/window ... dentifiers
or you can execute CONVERTCP /L to get the list of installed codepages on your computer.
So in my case I found
850 OEM Multilingual Latin 1; Western European (DOS) and
1252 ANSI Latin 1; Western European (Windows)
being reasonable.
Next step could be to just try them out using CONVERTCP or you might walk through the actual character maps. Let's just do the latter in order to give you an idea of how to find the right codepage by hand.
Taking 1252 first:
https://en.wikipedia.org/wiki/Windows-1 ... racter_set
We are looking for byte 81. Going to row 8_ and column _1 leads us to ... nothing. This means no character is represented by byte 81 in codepage 1252. So, this cant be the codepage we are looking for.
OK, let's try 850 next:
https://en.wikipedia.org/wiki/Code_page ... racter_set
Now we repeat - going to row 8_ and column _1 leads us to "ü".
This indicates that test.txt is CP850 encoded. And indeed this is the case here.
The conclusion is, to avoid this pain always use an encoding for your text files that fully supports Unicode, such like UTF-8 or UTF-16
Steffen
Re: CONVERTCP.exe - Convert text from one code page to another
@aGerman
Hi Steffen,
In my own code page conversion tool conv.exe, I use a heuristic for selecting a default output code page, that works very well in practice:
For example, these four commands display French text correctly, whereas the type command gets it right in only one case:
And (not clearly visible above) the content of the clipboard, when using the 2clip command, is correct in all cases too.
The text file was created in Notepad, and is encoded in the system code page (1252 on my system), which is different from the default console code page (437 on my system).
Hi Steffen,
In my own code page conversion tool conv.exe, I use a heuristic for selecting a default output code page, that works very well in practice:
- If stdout is a console, then switch it to 16-bits mode, and output UTF-16. (This allows displaying all Unicode characters, whatever the current code page.)
- Else if stdout is a pipe, then use the current console code page, from GetConsoleOutputCP(). (All Microsoft tools expect that encoding for data passed in pipes.)
- Else it's a file. Use the default system code page, from GetACP(). (This was the default encoding used by GUI applications like Notepad.)
- Check if the input is valid UTF-8 or UTF-16. If so, use that. (UTF-8 is becoming a more and more common encoding, and it's easy to recognize reliably.)
- Else if stdin is a console or a pipe, then use the current console code page, from GetConsoleOutputCP(). (Same reason as for stdout)
- Else it's a file. Use the default system code page, from GetACP(). (Same reason as for stdout)
For example, these four commands display French text correctly, whereas the type command gets it right in only one case:
Code: Select all
C:\JFL\Temp>chcp
Active code page: 437
C:\JFL\Temp>type test.txt
Jean-Franτois habite α Grenoble
C:\JFL\Temp>type test.txt | conv
Jean-François habite à Grenoble
C:\JFL\Temp>conv test.txt
Jean-François habite à Grenoble
C:\JFL\Temp>conv < test.txt
Jean-François habite à Grenoble
C:\JFL\Temp>conv < test.txt | 2clip
C:\JFL\Temp>1clip
Jean-François habite à Grenoble
C:\JFL\Temp>chcp 1252
Active code page: 1252
C:\JFL\Temp>type test.txt
Jean-François habite à Grenoble
C:\JFL\Temp>type test.txt | conv
Jean-François habite à Grenoble
C:\JFL\Temp>conv test.txt
Jean-François habite à Grenoble
C:\JFL\Temp>conv < test.txt
Jean-François habite à Grenoble
C:\JFL\Temp>conv < test.txt | 2clip
C:\JFL\Temp>1clip
Jean-François habite à Grenoble
C:\JFL\Temp>chcp 65001
Active code page: 65001
C:\JFL\Temp>type test.txt
Jean-Fran�ois habite � Grenoble
C:\JFL\Temp>type test.txt | conv
Jean-François habite à Grenoble
C:\JFL\Temp>conv test.txt
Jean-François habite à Grenoble
C:\JFL\Temp>conv < test.txt
Jean-François habite à Grenoble
C:\JFL\Temp>conv < test.txt | 2clip
C:\JFL\Temp>1clip
Jean-François habite à Grenoble
C:\JFL\Temp>
The text file was created in Notepad, and is encoded in the system code page (1252 on my system), which is different from the default console code page (437 on my system).
Re: CONVERTCP.exe - Convert text from one code page to another
Hi Jean-François,
It wouldn't have helped c160704 to define defaults since they didn't know the encoding of the incoming file
However, I also thought a lot about defaults. Thank's for your suggestions Definitely worth to consider again.
FWIW It's not that CONVERCP doesn't support defaults. You can pass 0 for the ACP and 1 for the OEMCP since the Windows conversion functions support them out of the box. But, yes, you still have to explicitely pass them to CONVERTCP.
Just lets go through your suggestions step by step:
- In case of UTF-8 all ASCII characters still stay the same. It might be easy for any other language than English. But English is ASCII only and it may take thousands of characters before the first non-ASCII character appears.
- In case of UTF-16 you can check for the typical alternately appearing zero bytes. But in CJK languages you'll be out of luck.
Serously, how do you handle this in your source code
Looking forward to your feedback
Steffen
It wouldn't have helped c160704 to define defaults since they didn't know the encoding of the incoming file
However, I also thought a lot about defaults. Thank's for your suggestions Definitely worth to consider again.
FWIW It's not that CONVERCP doesn't support defaults. You can pass 0 for the ACP and 1 for the OEMCP since the Windows conversion functions support them out of the box. But, yes, you still have to explicitely pass them to CONVERTCP.
Just lets go through your suggestions step by step:
Good idea. You still have to condition the console window for UTF-16 output and restore the old behavior if you're done. But that's at least reasonable.If stdout is a console, then switch it to 16-bits mode, and output UTF-16. (This allows displaying all Unicode characters, whatever the current code page.)
Yes probably you're right.Else if stdout is a pipe, then use the current console code page, from GetConsoleOutputCP(). (All Microsoft tools expect that encoding for data passed in pipes.)
Oh. I would avoid that like the plague. Even Microsoft realized that ANSI codepages are evil and they changed the behavior of notepad on Win10 to save files in UTF-8 (without BOM) by default, rather than using the default ANSI codepage. Thus, I'd really favor an encoding that fully supports Unicode, preferably UTF-8. I think we shouldn't make the same mistakes that Microsoft made in the past. In the best case conversion tools like our's should become obsolete one dayElse it's a file. Use the default system code page, from GetACP(). (This was the default encoding used by GUI applications like Notepad.)
If the input has a BOM then it's easy indeed. Otherwise I have my doubts.And for the input code page, the heuristic for selecting a default is:
Check if the input is valid UTF-8 or UTF-16. If so, use that. (UTF-8 is becoming a more and more common encoding, and it's easy to recognize reliably.)
- In case of UTF-8 all ASCII characters still stay the same. It might be easy for any other language than English. But English is ASCII only and it may take thousands of characters before the first non-ASCII character appears.
- In case of UTF-16 you can check for the typical alternately appearing zero bytes. But in CJK languages you'll be out of luck.
Serously, how do you handle this in your source code
OK, but at least for the pipe you're not doing it. I tested a few minutes ago and it seems you're using the ACP along with convElse if stdin is a console or a pipe, then use the current console code page, from GetConsoleOutputCP(). (Same reason as for stdout)
Way too vague IMHO. Especially since most text editors use UTF-8 by default nowadays. I'm really not sure if I would even consider to allow a default here.Else it's a file. Use the default system code page, from GetACP(). (Same reason as for stdout)
Looking forward to your feedback
Steffen
Re: CONVERTCP.exe - Convert text from one code page to another
Hi Steffen,
I suppose that the cleanup is done automatically when the program exits, because I don't do any cleanup myself.
One possible refinement would be to obey Notepad's own default encoding, stored in the registry in HKCU\Software\Microsoft\Notepad\iDefaultEncoding.
Fortunately, a byte stream with bytes from \x80 to \xFF can be reliably validated as UTF-8 or not. The probability of a false positive is not 0, but it's very very low. The more non-ASCII bytes, the lower the probability.
Again we're talking about heuristics for choosing defaults. We know we have to accept a (hopefully small) proportion of errors.
Using a large buffer, and testing the first part of the file, would give a good default in most cases.
Note that the only UTF-16 files I usually have to deal with are Windows own log files, and registry exports. These contain mostly ASCII anyway, even on CJK versions of Windows.
Finally note that I did experiment with the COM API IMultiLanguage2::DetectInputCodepage()... But the results were very poor: It's slower, and wrong more often than my current heuristics.
It's the stdout file that I switch to 16-bits mode, not the console. This is done using the C library function _setmode(fileno(stdout), _O_WTEXT).You still have to condition the console window for UTF-16 output and restore the old behavior if you're done.
I suppose that the cleanup is done automatically when the program exits, because I don't do any cleanup myself.
Agreed, I need to change that in my code, at least for Windows 10.I'd really favor an encoding that fully supports Unicode, preferably UTF-8.
One possible refinement would be to obey Notepad's own default encoding, stored in the registry in HKCU\Software\Microsoft\Notepad\iDefaultEncoding.
Indeed the BOM is a strong hint... But unfortunately it's being abandoned.If the input has a BOM then it's easy indeed. Otherwise I have my doubts.
Fortunately, a byte stream with bytes from \x80 to \xFF can be reliably validated as UTF-8 or not. The probability of a false positive is not 0, but it's very very low. The more non-ASCII bytes, the lower the probability.
Correct. It's a weakness of my conv program, which reads the whole file in memory. So it can't convert huge files that don't fit.it may take thousands of characters before the first non-ASCII character appears.
Again we're talking about heuristics for choosing defaults. We know we have to accept a (hopefully small) proportion of errors.
Using a large buffer, and testing the first part of the file, would give a good default in most cases.
Correct. I have very little experience with CJK languages, but I hope that there are occasional spaces or digits that may help.In case of UTF-16 you can check for the typical alternately appearing zero bytes. But in CJK languages you'll be out of luck.
Note that the only UTF-16 files I usually have to deal with are Windows own log files, and registry exports. These contain mostly ASCII anyway, even on CJK versions of Windows.
Well, actually only the BOM detection is implemented in conv.c. The UTF-8 / UTF-16 validation has been on my to-do list for years, but it's never made it to the top.how do you handle this in your source code
Finally note that I did experiment with the COM API IMultiLanguage2::DetectInputCodepage()... But the results were very poor: It's slower, and wrong more often than my current heuristics.
Yes I do! Notice how the two non-ASCII characters are changed when conv pipes that ANSI file to dump.exe, versus dumping the file directly:OK, but at least for the pipe you're not doing it. I tested a few minutes ago and it seems you're using the ACP along with conv
Code: Select all
C:\JFL\Temp>chcp
Active code page: 437
C:\JFL\Temp>conv test.txt
Jean-François habite à Grenoble
C:\JFL\Temp>dump test.txt
Offset 00 04 08 0C 0 4 8 C
-------- ----------- ----------- ----------- ----------- -------- --------
00000000 4A 65 61 6E 2D 46 72 61 6E E7 6F 69 73 20 68 61 Jean-Fra n�ois ha
00000010 62 69 74 65 20 E0 20 47 72 65 6E 6F 62 6C 65 0D bite � G renoble
00000020 0A
C:\JFL\Temp>conv test.txt | dump
Offset 00 04 08 0C 0 4 8 C
-------- ----------- ----------- ----------- ----------- -------- --------
00000000 4A 65 61 6E 2D 46 72 61 6E 87 6F 69 73 20 68 61 Jean-Fra n�ois ha
00000010 62 69 74 65 20 85 20 47 72 65 6E 6F 62 6C 65 0D bite � G renoble
00000020 0A
C:\JFL\Temp>codepage 437
Code page 437: OEM - United States (SBCS) ASCII-compatible
80 Ç 90 É A0 á B0 ░ C0 └ D0 ╨ E0 α F0 ≡
81 ü 91 æ A1 í B1 ▒ C1 ┴ D1 ╤ E1 ß F1 ±
82 é 92 Æ A2 ó B2 ▓ C2 ┬ D2 ╥ E2 Γ F2 ≥
83 â 93 ô A3 ú B3 │ C3 ├ D3 ╙ E3 π F3 ≤
84 ä 94 ö A4 ñ B4 ┤ C4 ─ D4 ╘ E4 Σ F4 ⌠
85 à 95 ò A5 Ñ B5 ╡ C5 ┼ D5 ╒ E5 σ F5 ⌡
86 å 96 û A6 ª B6 ╢ C6 ╞ D6 ╓ E6 µ F6 ÷
87 ç 97 ù A7 º B7 ╖ C7 ╟ D7 ╫ E7 τ F7 ≈
88 ê 98 ÿ A8 ¿ B8 ╕ C8 ╚ D8 ╪ E8 Φ F8 °
89 ë 99 Ö A9 ⌐ B9 ╣ C9 ╔ D9 ┘ E9 Θ F9 ∙
8A è 9A Ü AA ¬ BA ║ CA ╩ DA ┌ EA Ω FA ·
8B ï 9B ¢ AB ½ BB ╗ CB ╦ DB █ EB δ FB √
8C î 9C £ AC ¼ BC ╝ CC ╠ DC ▄ EC ∞ FC ⁿ
8D ì 9D ¥ AD ¡ BD ╜ CD ═ DD ▌ ED φ FD ²
8E Ä 9E ₧ AE « BE ╛ CE ╬ DE ▐ EE ε FE ■
8F Å 9F ƒ AF » BF ┐ CF ╧ DF ▀ EF ∩ FF
C:\JFL\Temp>codepage 1252
Code page 1252: ANSI - Latin I (SBCS) ASCII-compatible
80 € 90 A0 B0 ° C0 À D0 Ð E0 à F0 ð
81 91 ‘ A1 ¡ B1 ± C1 Á D1 Ñ E1 á F1 ñ
82 ‚ 92 ’ A2 ¢ B2 ² C2 Â D2 Ò E2 â F2 ò
83 ƒ 93 “ A3 £ B3 ³ C3 Ã D3 Ó E3 ã F3 ó
84 „ 94 ” A4 ¤ B4 ´ C4 Ä D4 Ô E4 ä F4 ô
85 … 95 • A5 ¥ B5 µ C5 Å D5 Õ E5 å F5 õ
86 † 96 – A6 ¦ B6 ¶ C6 Æ D6 Ö E6 æ F6 ö
87 ‡ 97 — A7 § B7 · C7 Ç D7 × E7 ç F7 ÷
88 ˆ 98 ˜ A8 ¨ B8 ¸ C8 È D8 Ø E8 è F8 ø
89 ‰ 99 ™ A9 © B9 ¹ C9 É D9 Ù E9 é F9 ù
8A Š 9A š AA ª BA º CA Ê DA Ú EA ê FA ú
8B ‹ 9B › AB « BB » CB Ë DB Û EB ë FB û
8C Œ 9C œ AC ¬ BC ¼ CC Ì DC Ü EC ì FC ü
8D 9D AD BD ½ CD Í DD Ý ED í FD ý
8E Ž 9E ž AE ® BE ¾ CE Î DE Þ EE î FE þ
8F 9F Ÿ AF ¯ BF ¿ CF Ï DF ß EF ï FF ÿ
C:\JFL\Temp>
All I can tell is that it works well for me in most cases.I'm really not sure if I would even consider to allow a default here.
Re: CONVERTCP.exe - Convert text from one code page to another
Typically C, C++, ANSI-C and ANSI-C++ don't do that and there is no hint in the documentation for assuming otherwise, see:jfl wrote: ↑07 Aug 2020 03:17It's the stdout file that I switch to 16-bits mode, not the console. This is done using the C library function _setmode(fileno(stdout), _O_WTEXT).You still have to condition the console window for UTF-16 output and restore the old behavior if you're done.
I suppose that the cleanup is done automatically when the program exits, because I don't do any cleanup myself.
https://docs.microsoft.com/en-us/cpp/c- ... ew=vs-2019.
Where have you heard that?
I would say that claim is wrong, see the actual Unicode Version 13.0.0 that still specifies the BOM (chapter 23 subsection "23.8 Specials"):
https://www.unicode.org/versions/Unicode13.0.0/ch23.pdf
Although bytes xF8-xFF are considered valid in utf-8, they are currently not in use (not enough existing unicode characters).
Therefore currently they are a strong hint, that the document is not utf-8.
Also note that there are multiple codepages for which (most if not all) characters in x80-xFF are valid:
An example for that is codepage 850 where all bytes are valid characters of that codepage.
penpen
Re: CONVERTCP.exe - Convert text from one code page to another
I know it's very rarely used. It's Microsoft C specific, hence the underscore ahead of the function name _setmode().
But it's ideal for tools wanting to output Unicode into the Windows console, independently of the code page.
Actually I'm pretty sure cmd.exe does something like this internally, because it can display any unicode character in any code page, for example when you do a dir.
It's still in the Unicode standard, but fewer and fewer people seem to be using it.
Actually I should have said "Microsoft tools are abandoning it", and UTF-8 without BOM is the de-facto standard on Unix and the Web:
- Notepad now defaults to UTF-8 without a BOM. (And they're probably right, because most Unix tools generate UTF-8 text without a BOM.)
- The MSVC compiler always choked on C files beginning with a BOM.
- HTML UTF-8 files always specify their encoding with a <meta> tag, never with a BOM.
- Etc
Good idea, I'll use that fact when I implement the UTF-8 detection.
Re: CONVERTCP.exe - Convert text from one code page to another
Yes, that's what I was talking about. And yes it's a Windows-specific C runtime extension which doesn't make it necessary to restore the old settings. The console output buffer supports wide characters by default. So, after thinking twice, I think I would probably skip this C hack and rely on the console API (WriteConsoleW).It's the stdout file that I switch to 16-bits mode, not the console. This is done using the C library function _setmode(fileno(stdout), _O_WTEXT).
I suppose that the cleanup is done automatically when the program exits, because I don't do any cleanup myself.
I'm still struggling with itYes I do! Notice how the two non-ASCII characters are changed when conv pipes that ANSI file to dump.exe, versus dumping the file directly
This short batch code is saved in Windows-1252 (which is also my default ACP).
Code: Select all
@echo off &setlocal
chcp
echo Jean-François habite à Grenoble|conv
echo Jean-François habite à Grenoble
pause <nul|conv
pause
So maybe I just still missunderstand it, but obviously conv expects to read ANSI-encoded (1252) text from the pipe.Aktive Codepage: 850.
Jean-François habite à Grenoble
Jean-Franþois habite Ó Grenoble
Drcken Sie eine beliebige Taste . . .
Drücken Sie eine beliebige Taste . . .
As to valid UTF-8, refer to Table 3-7 on page 94:
https://www.unicode.org/versions/Unicode6.0.0/ch03.pdf
And for the BOM: Unicode allows a BOM but doesn't require it. This means the application has to be able to handle a BOM appropriately if it gets one. But it must not rely on getting a BOM. And especially for UTF-8 the BOM is rather meaningless. BOM is for Byte Order Mark. But there is no ambiguity in the byte order of UTF-8, whatever endianness a system works with.
However, receiving the BOM is just nice for us, because we know from the very beginning that the stream is UTF-8 encoded. But that's rather an issue on Windows. *nixoide systems work with a silent agreement that everything is UTF-8 by default. (At least I didn't find anything in the POSIX standard. It only requires the so-called Portable Character Set that ends at the same point as 7-bit ASCII.)
Steffen
Re: CONVERTCP.exe - Convert text from one code page to another
I'm not sure if you misunderstood my comment.
I meant to say that there is no automatic cleanup.
You seem to have understood that MS used something like Unicode internally and uss that function to switch /which i would agree to).
In detail cmd.exe (and windows in general) uses UCS-2, which pretty mich is Unicode version 1.0 without surrogate codepoint pairs; instead both (low and high surrogate codepoints) are treated as custom characters, which don't need to appear in pairs
The cmd.exe uses UCS-2 internally, so it is not affected by that setting, but that doesn't mean doesn't mean that cleanup isn't neccessary.aGerman wrote:Yes, that's what I was talking about. And yes it's a Windows-specific C runtime extension which doesn't make it necessary to restore the old settings.I suppose that the cleanup is done automatically when the program exits, because I don't do any cleanup myself.
It is neccessary to restore the old settings (and don't forget to flush the input stream before setting and resetting), because otherwise you may leave other executables that might have called your tools in an undefined state, because they expect their settings to be true all the time, which would be false if you don't reset it.
There are only two reasons for using a BOM:jfl wrote: ↑07 Aug 2020 09:55Actually I should have said "Microsoft tools are abandoning it", and UTF-8 without BOM is the de-facto standard on Unix and the Web:
- Notepad now defaults to UTF-8 without a BOM. (And they're probably right, because most Unix tools generate UTF-8 text without a BOM.)
- The MSVC compiler always choked on C files beginning with a BOM.
- HTML UTF-8 files always specify their encoding with a <meta> tag, never with a BOM.
- Etc
(1) Unknown Byte Order.
(2) Unknown Character Set.
As long as both are false (which is true for source files which are expected to use UTF-8) you should assume to read the Unicode Character 'ZERO WIDTH NO-BREAK SPACE', which has a different semantics, hence the choke, which is how the Unicode Standard expects Applications to handle that character - therefore the choke is not really a choke, but expected behaviour that is implemented correctly.
Files that are meant to be exchanged between computers have that BOM, because cause (2) is true then.
Therefore *.sln, *vcproj *.filters either have the BOM or are XML files (in case they use UTF-8, they should have no BOM).
You are right on that.aGerman wrote:So maybe I just still missunderstand it, but obviously conv expects to read ANSI-encoded (1252) text from the pipe.
For example you saved the character 'à' i codepage 1252, which results in byte 0xE0.
Than cmd.exe reads it using cp 850 which results in character 'Ó'.
In case it s piped to "conv" it is again byte 0xE0, which is read from "conv" as an 'à',
which probably is causes by usng cp 1252 again.
In case of cp 437 you get an 'α' instead of that 'Ó'.
penpen
Re: CONVERTCP.exe - Convert text from one code page to another
Our apps communicate directly with conhost which acts as a terminal. The cmd is only the shell that may call them, but any other shell (e.g. explorer) might be the shell to call it. So, cmd has basically nothing to do with it. (It gets involved for pipes and redirections, but thats not the point if we are talking about writing to the console.)penpen wrote: ↑08 Aug 2020 00:34The cmd.exe uses UCS-2 internally, so it is not affected by that setting, but that doesn't mean doesn't mean that cleanup isn't neccessary.
It is neccessary to restore the old settings (and don't forget to flush the input stream before setting and resetting), because otherwise you may leave other executables that might have called your tools in an undefined state, because they expect their settings to be true all the time, which would be false if you don't reset it.
_setmode influences the passed file stream. This is held by our application. Even if stdout is linked to the console output buffer, no configuration on the console is made. Jean-François is right on that. That's the reason why this setting only lasts as long as the process lives, and is only affecting the stream buffer of our applications. As long as we don't try to write multibyte streams and wide streams alternately you neither need to reset the configuration nor do you need to flush the stdout beforehand. However, it doesn't hurt of course
Steffen
Re: CONVERTCP.exe - Convert text from one code page to another
You're right. I was convinced of the contrary, but indeed after looking at the source, this is the way it was.So maybe I just still missunderstand it, but obviously conv expects to read ANSI-encoded (1252) text from the pipe.
I had forgotten about that, but I had changed it that way, because one of the most common case where I need conv.exe is when some other command outputs something in the wrong code page for the console. I fix it by repeating the command, with "| conv" appended in the end. So defaulting to the current console code page on input made no sense, as if it's correct, you don't need conv.exe.
Anyway, I've now pushed on github an updated version of conv.c, implementing the heuristics for detecting UTF-8 and UTF-16 without BOM. This works quite nicely in my testing.
For UTF-8, it eliminates bytes \xF8-\xFF as suggested by penpen. But it also eliminates \xC0, \xC1, \xF5, \xF6, \xF7 which I just learned were also invalid UTF-8.
For the default output type when writing to files, I've chosen to stick to ANSI for Windows versions up to WIndows 10 build 18297, and change that default to UTF-8 for Windows 10 build 18298 and later (Released in December 2018). This build was the first one where Notepad itself defaulted to using UTF-8 for new files.
I've not yet made a new release of my System Tools Library, but if you want to give it a try, I've put my latest conv.exe build there.
Re: CONVERTCP.exe - Convert text from one code page to another
After confirming all browser warnings, Windows Defender still reports the downloaded binary and immediately removes it. So, I wasn't even able to try out However, I had a look into the source file. What still isn't clear to me is after how many bytes do you actually stop trying to figure out what encoding you got?
For performance reasons I'll probably not doing too much of that in CONVERTCP. Although I still consider to check the file type of the output handle and default to wide string using WriteConsoleW if it is linked to a text device. Just have to make a couple of tests to see how it works if lounched within the Windows Terminal.
Steffen
For performance reasons I'll probably not doing too much of that in CONVERTCP. Although I still consider to check the file type of the output handle and default to wide string using WriteConsoleW if it is linked to a text device. Just have to make a couple of tests to see how it works if lounched within the Windows Terminal.
Steffen