CONVERTCP.exe - Convert text from one code page to another
Moderator: DosItHelp
Re: CONVERTCP.exe - Convert text from one code page to another
Please have a look at the list of code pages:
https://msdn.microsoft.com/en-us/library/dd317756.aspx
There are already code pages like 037, 500, 1026, 1047, 1140-1149. If you still have some EBCDIC data you may do some tests.
Steffen
https://msdn.microsoft.com/en-us/library/dd317756.aspx
There are already code pages like 037, 500, 1026, 1047, 1140-1149. If you still have some EBCDIC data you may do some tests.
Steffen
Re: CONVERTCP.exe - Convert text from one code page to another
Rather by a fluke I found a serious bug that could have happened while reading UTF-8. Fixed with v1.3.2.
Steffen
Steffen
Re: CONVERTCP.exe - Convert text from one code page to another
UTF-16 big endian is supported with version 1.4.0 (something that batch can't handle natively). Use code page ID 1201.
Also you can specify the source and destination files directly using options /i and /o. Of course redirections do still work.
Steffen
Also you can specify the source and destination files directly using options /i and /o. Of course redirections do still work.
Steffen
Re: CONVERTCP.exe - Convert text from one code page to another
I did a little code profiling on the weekend. Outcome is that threading of the conversion isn't as important as I expected. It makes more sense to separate reading and writing on the file system because these are slow processes. I changed the behavior in a way that writing is done in a parallel thread while the next chunk of data can be read. Surprisingly I got the best performance results if both converting and writing run together in one thread.
To cut it short: The performance increasement is insignificant but existing. Thus, I'd like to share it by version 1.4.1.
Steffen
To cut it short: The performance increasement is insignificant but existing. Thus, I'd like to share it by version 1.4.1.
Steffen
Re: CONVERTCP.exe - Convert text from one code page to another
Thanks for the update.
Saso
Saso
Re: CONVERTCP.exe - Convert text from one code page to another
Great tool. I reduced the executable size to 8Kb. Pm sent.
Re: CONVERTCP.exe - Convert text from one code page to another
Thank you Carlos!
I will definitely try some of the compiler options in order to reduce the size of the tool. Unfortunately the tool you sent me was immediately removed by Avira (free antivirus) There are some good reasons why my tool has a few extra KBs. I'll explain it via PM.
Steffen
I will definitely try some of the compiler options in order to reduce the size of the tool. Unfortunately the tool you sent me was immediately removed by Avira (free antivirus) There are some good reasons why my tool has a few extra KBs. I'll explain it via PM.
Steffen
Re: CONVERTCP.exe - Convert text from one code page to another
I managed to add carlos' size improvements. See comments of the DECREASE_SIZE_GCC macro in the source code. That way the size of the utility was reduced by half (without noticeable performance increasement though).
In order to preserve cross-compiler support I added a few pre-processor directives for retrieving arguments UTF-16-encoded.
Since I don't have any experiences with this kind of size optimizations yet I would like you to report if the new version causes false positives of your antivirus software.
Steffen
In order to preserve cross-compiler support I added a few pre-processor directives for retrieving arguments UTF-16-encoded.
Since I don't have any experiences with this kind of size optimizations yet I would like you to report if the new version causes false positives of your antivirus software.
Steffen
Re: CONVERTCP.exe - Convert text from one code page to another
After testing at virustotal the executables uploaded with version 1.4.2. do not cause any findings. At least I hope this can be proved in real world, too.
Steffen
https://www.virustotal.com/en/file/a8d6 ... 485797283/
https://www.virustotal.com/en/file/7562 ... 485797365/
Steffen
https://www.virustotal.com/en/file/a8d6 ... 485797283/
https://www.virustotal.com/en/file/7562 ... 485797365/
Re: CONVERTCP.exe - Convert text from one code page to another
With version 1.4.3. comes the feature to add to an existing file using option /a. See the initial post.
Again I checked the executables on virustotal. No false positives detected.
https://www.virustotal.com/en/file/53c0 ... 486055380/
https://www.virustotal.com/en/file/f552 ... 486055433/
As always - the updated file can be found in the initial post of this thread.
Now I'm out of ideas (and am tired reading the source code repeatedly). I'll archive it and leave it alone
Steffen
Again I checked the executables on virustotal. No false positives detected.
https://www.virustotal.com/en/file/53c0 ... 486055380/
https://www.virustotal.com/en/file/f552 ... 486055433/
As always - the updated file can be found in the initial post of this thread.
Now I'm out of ideas (and am tired reading the source code repeatedly). I'll archive it and leave it alone
Steffen
Re: CONVERTCP.exe - Convert text from one code page to another
Quoted from there:
viewtopic.php?f=3&t=7703&p=51312#p51310
This is by design and actually wanted behavior.
1) https://msdn.microsoft.com/en-us/library/windows/desktop/dd374130(v=vs.85).aspx
That means at least for UTF-7 and UTF-8 I'm not even able to define a default character.
I noted this behavior in my first reply to Dave:
viewtopic.php?f=3&t=7570#p50285
2) The reason why I don't even want to work around it is that the utility was requested by miskox. He told me via email
That's why I called it "wanted behavior".
Steffen
viewtopic.php?f=3&t=7703&p=51312#p51310
penpen wrote:I have tested your CONVERTCP utility, and read the source code:
I saw no error, but i noticed that your tool does more, than just converting between codepages - it also approximates characters that are not within the target codepage (which is not that bad, because cmd.exe is doing the same, but i would mention it somewhere).
For example i created a file "string.txt" with this content (i hope it is not corrupted) encoded using UTF-8:Code: Select all
ĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩ
If you convert it to codepage 850 you get:Code: Select all
AaAaAaCcCcCcCcDdDdEeEeEeEeEeGgGgGgGgHhHhIi
The recommended behaviour for such cases i know were to use the REPLACEMENT CHARACTER, a question mark, a square, or a question mark in a square for such cases.
This is by design and actually wanted behavior.
1) https://msdn.microsoft.com/en-us/library/windows/desktop/dd374130(v=vs.85).aspx
Code: Select all
int WideCharToMultiByte(
_In_ UINT CodePage,
_In_ DWORD dwFlags,
_In_ LPCWSTR lpWideCharStr,
_In_ int cchWideChar,
_Out_opt_ LPSTR lpMultiByteStr,
_In_ int cbMultiByte,
_In_opt_ LPCSTR lpDefaultChar,
_Out_opt_ LPBOOL lpUsedDefaultChar
);
...
lpDefaultChar [in, optional]
...
For the CP_UTF7 and CP_UTF8 settings for CodePage, this parameter must be set to NULL. Otherwise, the function fails with ERROR_INVALID_PARAMETER.
lpUsedDefaultChar [out, optional]
...
For the CP_UTF7 and CP_UTF8 settings for CodePage, this parameter must be set to NULL. Otherwise, the function fails with ERROR_INVALID_PARAMETER.
...
That means at least for UTF-7 and UTF-8 I'm not even able to define a default character.
I noted this behavior in my first reply to Dave:
viewtopic.php?f=3&t=7570#p50285
2) The reason why I don't even want to work around it is that the utility was requested by miskox. He told me via email
I 'patched' original .exe to make another .exe version with NOCSZ (that is NOČŠŽ) which replaces ČŠŽĐĆ characters with ordinary CZSDC - depending on the input code page.
That's why I called it "wanted behavior".
Steffen
Re: CONVERTCP.exe - Convert text from one code page to another
I was asked to add another option in order to automatically replace the original file content with the converted content. I won't do so.
The utility was designed to convert big files. That means it doesn't read the whole content into memory before it begins with the conversion in order to avoid running out of RAM space and to be able to read and convert data in parallel threads. Concurrent access to the same file could cause data losses, especially if the converted data is bigger than the data read.
Of course I could let the tool automatically write to a temporary file and replace the original file after the conversion was finished. But as soon as the temporary file and the original file are saved on different volumes this would cause a physical copying of data which wastes time and resources.
Thus, I would rather keep it in your hands. Moving a file to another file at the same logical drive will only lead to changing the file addressing. Example:
Steffen
The utility was designed to convert big files. That means it doesn't read the whole content into memory before it begins with the conversion in order to avoid running out of RAM space and to be able to read and convert data in parallel threads. Concurrent access to the same file could cause data losses, especially if the converted data is bigger than the data read.
Of course I could let the tool automatically write to a temporary file and replace the original file after the conversion was finished. But as soon as the temporary file and the original file are saved on different volumes this would cause a physical copying of data which wastes time and resources.
Thus, I would rather keep it in your hands. Moving a file to another file at the same logical drive will only lead to changing the file addressing. Example:
Code: Select all
convertcp 1 65001 /b /i "test.txt" /o "test.txt.temp~"
if not errorlevel 1 move /y "test.txt.temp~" "test.txt"
Steffen
Re: CONVERTCP.exe - Convert text from one code page to another
I didn't like to have only a link to the list of Code Page Identifiers in the help message. That's why I decided to add /l to the supported options that displays a list of installed code pages on your computer together with the information of how they can be used as input code page (see section "additional information" of the initial post), and their description.
Virustotal didn't find any false positives for version 1.4.4.
x86: https://www.virustotal.com/en/file/33108943bf6f8575a49873c44d0eef7ce30ffdd4af7f8564f6c2f8339171581c/analysis/
x64: https://www.virustotal.com/en/file/961bf49a7e624709742cde83ae5739f8e1f949a6e08e0e1a9f29e1f075afa9a4/analysis/
Steffen
Virustotal didn't find any false positives for version 1.4.4.
x86: https://www.virustotal.com/en/file/33108943bf6f8575a49873c44d0eef7ce30ffdd4af7f8564f6c2f8339171581c/analysis/
x64: https://www.virustotal.com/en/file/961bf49a7e624709742cde83ae5739f8e1f949a6e08e0e1a9f29e1f075afa9a4/analysis/
Steffen
Re: CONVERTCP.exe - Convert text from one code page to another
Great tool Steffen.
I have another option for you - the new JREPL.BAT version 7 features (currently v7.4), can also be used to transform a text file from one encoding to another. I believe it is more restrictive on which character sets can be used because it only supports your machines native code page, plus UTF-16LE, plus code pages that have valid internet character set names. EDIT - Actually it is not that bad. Here is a page that lists code pages along with there internet (.NET) names. Most of the code pages have a valid name
Here is an example that transforms 1252 to UTF-8:
But JREPL has a significant advantage in that you can provide custom transformations for source characters that do not exist in the target character set. This could satisfy Sasso's "custom character set" request. This is probably easiest to accomplish by using the JREPL /T option.
One thing that is pretty cool is that with the /X option, you can specify a character using the \xnn escape sequence, where nn is the hexadecimal byte code for the relevant character set. Within a search string it uses the input character set. Within a replacement string it uses the output character set. The \xnn sequence only works properly if the character set is a single byte character set.
With the /T "FILE" option, you can place all your search terms in one file, one per line, and all your replacement (transform) terms in a 2nd file. This helps prevent out of control command line lengths. Another cool feature is you can specify that the search file matches the input character set, and the replacement file matches the output character set.
There is no need for the transformations to involve just single characters. One input character can be transformed into multiple output characters, and vice versa.
Here is an example of what a custom transformation could look like (without specifying the actual custom transformations)
Dave Benham
I have another option for you - the new JREPL.BAT version 7 features (currently v7.4), can also be used to transform a text file from one encoding to another. I believe it is more restrictive on which character sets can be used because it only supports your machines native code page, plus UTF-16LE, plus code pages that have valid internet character set names. EDIT - Actually it is not that bad. Here is a page that lists code pages along with there internet (.NET) names. Most of the code pages have a valid name
Here is an example that transforms 1252 to UTF-8:
Code: Select all
jrepl "^" "" /f "source.txt|Windows-1252" /o "destination.txt|UTF-8"
But JREPL has a significant advantage in that you can provide custom transformations for source characters that do not exist in the target character set. This could satisfy Sasso's "custom character set" request. This is probably easiest to accomplish by using the JREPL /T option.
One thing that is pretty cool is that with the /X option, you can specify a character using the \xnn escape sequence, where nn is the hexadecimal byte code for the relevant character set. Within a search string it uses the input character set. Within a replacement string it uses the output character set. The \xnn sequence only works properly if the character set is a single byte character set.
With the /T "FILE" option, you can place all your search terms in one file, one per line, and all your replacement (transform) terms in a 2nd file. This helps prevent out of control command line lengths. Another cool feature is you can specify that the search file matches the input character set, and the replacement file matches the output character set.
There is no need for the transformations to involve just single characters. One input character can be transformed into multiple output characters, and vice versa.
Here is an example of what a custom transformation could look like (without specifying the actual custom transformations)
Code: Select all
jrepl "1252to1250find.txt|Windows-1252" "1252to1250repl.txt|Windows-1250" /x /t file /f "source.txt|Windows-1252" /o "destination.txt|Windows-1250"
Dave Benham
Re: CONVERTCP.exe - Convert text from one code page to another
Dave
I'm quite interested in JREPL.BAT as you know. Using ADO streams was a huge improvement. Also for my understanding it's a good alternative for CONVERTCP.
Of course everything has pros and cons. What I really like is that JREPL doesn't need any 3rd party tools. It's something that I can't compete with CONVERTCP. To compensate this deficiency a little I used C and WinAPI (that runs natively and isn't dependent on .NET or Java), I provided the source code (to enable people to read or edit the source and compile the tool by themself) and added a program flow chart (because an executable is like a black box where you can't see the way it works).
On the other hand the main scopes of JREPL and CONVERTCP are quite different. This makes that JREPL is able (and designed) to do customized replacements while CONVERTCP can't do that. But this also makes that CONVERTCP is so much faster for big files. 307s JREPL vs. 9s CONVERTCP for 360MB of text Windows-1252 to UTF-8 in my tests because it efficiently converts and writes in parallel threads. Converting big files was one of Saso's original requirements.
I don't want to make a fuss about CONVERTCP. Initially it was a gift to Saso who suggested to publish it. So why not It seems to be helpful for some people. Version 1.4.4 was downloaded ~120 times now. That's approx. once a day and thus, maybe 10 times more than I would have ever expected but isn't comparable to what JREPL catches on At the end there is no "nec plus ultra". The users have to decide which meets their needs best. For that reason I really appreciate that you left some notes and a link to JREPL.BAT in this thread. This will give the users more opportunities to find the right tool for their tasks
Steffen
This tangential topic about JREPL continues at JREPL.BAT, ADO Streams and big files
I'm quite interested in JREPL.BAT as you know. Using ADO streams was a huge improvement. Also for my understanding it's a good alternative for CONVERTCP.
Of course everything has pros and cons. What I really like is that JREPL doesn't need any 3rd party tools. It's something that I can't compete with CONVERTCP. To compensate this deficiency a little I used C and WinAPI (that runs natively and isn't dependent on .NET or Java), I provided the source code (to enable people to read or edit the source and compile the tool by themself) and added a program flow chart (because an executable is like a black box where you can't see the way it works).
On the other hand the main scopes of JREPL and CONVERTCP are quite different. This makes that JREPL is able (and designed) to do customized replacements while CONVERTCP can't do that. But this also makes that CONVERTCP is so much faster for big files. 307s JREPL vs. 9s CONVERTCP for 360MB of text Windows-1252 to UTF-8 in my tests because it efficiently converts and writes in parallel threads. Converting big files was one of Saso's original requirements.
I don't want to make a fuss about CONVERTCP. Initially it was a gift to Saso who suggested to publish it. So why not It seems to be helpful for some people. Version 1.4.4 was downloaded ~120 times now. That's approx. once a day and thus, maybe 10 times more than I would have ever expected but isn't comparable to what JREPL catches on At the end there is no "nec plus ultra". The users have to decide which meets their needs best. For that reason I really appreciate that you left some notes and a link to JREPL.BAT in this thread. This will give the users more opportunities to find the right tool for their tasks
Steffen
This tangential topic about JREPL continues at JREPL.BAT, ADO Streams and big files