CONVERTCP.exe - Convert text from one code page to another

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Message
Author
miskox
Posts: 630
Joined: 28 Jun 2010 03:46

Re: CONVERTCP.exe - Convert text from one code page to another

#76 Post by miskox » 22 Mar 2019 02:42

What was at first a 'simple' problem for me (converting .txt files between CP852 and CP1250 -see viewtopic.php?p=50289#p50289) is now an ongoing project.

Steffen once wrote
I don't expect to get bug reports because the utility will not be found and used that often.
How many times this thread has been read: 27,000 times!

Also
dated 06 Dec 2016 14:05
Then I'll leave it as it is unless somebody finds a bug or has a request to add another feature ...
and
dated 02 Feb 2017 20:21 (version 1.4.3)
I'll archive it and leave it alone
And now we are at version 6.1!

Steffen: thanks again.

Saso

aGerman
Expert
Posts: 4678
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#77 Post by aGerman » 22 Mar 2019 10:44

Yes, and now I have to grapple all the time to keep the thing up to date and running. You're the culprit, Saso :evil: (just kidding :lol:)
Seriously, when I started developing this utility I didn't expect that there is no end in sight. But meanwhile it's something like my baby. And still it's fun to work on it, and still I learn something new every time. And as long as a few people have a use for it, it's motivation enough to continue. Curiously there is no on-board tool for Windows like iconv for *nixoid systems.
So, thank you for having the idea :)

Steffen

Squashman
Expert
Posts: 4486
Joined: 23 Dec 2011 13:59

Re: CONVERTCP.exe - Convert text from one code page to another

#78 Post by Squashman » 23 Mar 2019 07:59

aGerman wrote:
22 Mar 2019 10:44
Curiously there is no on-board tool for Windows like iconv for *nixoid systems.
I wonder if it comes with the Linux subsytem for Windows 10? I have yet to install and try it.

aGerman
Expert
Posts: 4678
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#79 Post by aGerman » 23 Mar 2019 09:57

Squashman wrote:
23 Mar 2019 07:59
I wonder if it comes with the Linux subsytem for Windows 10?
Yes, of course. But the WSL isn't available for Win10 x86, and since iconv (along with the other Linux tools) is a native ELF file, you can't just execute it from the Windows command line. You always need the Linux shell of your installed distribution involved.

Steffen

penpen
Expert
Posts: 2009
Joined: 23 Jun 2013 06:15
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#80 Post by penpen » 23 Mar 2019 10:53

To be honest, i never needed such a tool:
Most of the time it was sufficient to be able to convert from utf-16le to all installed codepages.
(So i never tried to convert from codepage to utf-16le, so i never checked if that was possible.)

utf-8le -> any installed codepage:

Code: Select all

@echo off
:: needed files:
:: "bom.utf-16le.txt" contains 2 boms, nothing else
:: "test.utf-16le.txt" contains any text must have a utf-16le bom

:: with or without a bom
chcp 65001
>"test.utf-8.bom.txt" type "bom.utf-16le.txt" "test.utf-16le.txt"
>"test.utf-8.txt" type "test.utf-16le.txt"

chcp 65000
>"test.utf-7.txt" type "test.utf-16le.txt"

chcp 850
>"test.cp850.txt" type "test.utf-16le.txt"

penpen

aGerman
Expert
Posts: 4678
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#81 Post by aGerman » 23 Mar 2019 12:39

In post #3 I already addressed this possibility, penpen. Also ADO streams as used in Dave's JREPL.BAT are good alternatives to convert the text encoding. I'm absolutely of your opinion that you don't need any 3rd party whenever you can use the possibilities that the operating system already provides.
(So i never tried to convert from codepage to utf-16le, so i never checked if that was possible.)
Think of CMD /u /c.

It's rather the multi-threaded processing in CONVERTCP that makes it quite usefull if you have to convert big files. Furthermore you can convert UTF-16 BE and UTF-32 LE/BE where the combination of CHCP and TYPE isn't applicable anymore. And TYPE still causes problems using UTF-8 because character boundaries are not respected.

Steffen

aGerman
Expert
Posts: 4678
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#82 Post by aGerman » 23 Apr 2019 14:07

In C, the internal parser for command line arguments as well as library functions made for this purpose, may return partially quoted paths incorrectly. These parsers treat backslashes as escape characters to preserve quotation marks as literal expressions. This behavior causes errors though because on Windows the backslash is used as separator in paths. E.g. if you pass
C:\"my folder"\file.ext foo
you might have expected to get
C:\my folder\file.ext
as first argument and
foo
as second argument.
But instead, C sees
C:"my
as first argument and
folder\file.ext foo
as second argument.

CONVERTCP has no use for literal quotes in any of the arguments passed. To overcome faulty path specifications I implemented an own command line parser in version 6.2. It still uses quotation marks to preserve spaces and tab characters in a quoted substring, but it removes all quotation marks from the passed arguments and keeps backslashes as literal expressions.


Virustotal scans of version 6.2:
x86: https://www.virustotal.com/gui/file/b14 ... /detection
x64: https://www.virustotal.com/gui/file/c5d ... /detection

Steffen

smrutibora
Posts: 1
Joined: 25 Apr 2019 04:50

Re: CONVERTCP.exe - Convert text from one code page to another

#83 Post by smrutibora » 26 Apr 2019 02:40

So the low request ASCII code esteems continue as before, yet the high request esteems fluctuate from code page to code page. I can perceive how some code pages may share a few characters in like manner, yet their high request code esteems may be unique. So your utility can do the fundamental interpretation for characters in like manner. In any case, the end result for different characters that are not shared?

What's more, are there regularly enough high request characters in like manner to make the utility worthwhile?

I should think there would be various code pages with no non-ASCII cover by any means, so I can't perceive how the utility could be helpful in those cases.

At first, I considered how the utility functions - how might it know all the right mappings? Be that as it may, I took a gander at the source and see that it changes over the content to UTF-16, and afterward changes over back to an alternate single-byte character set. I guess it is the equivalent basic schedules that cmd.exe utilizations to change over stretched out ASCII content to and from UTF-16.
Last edited by aGerman on 27 Apr 2019 07:18, edited 2 times in total.
Reason: Moderator note: later added advertising link removed - you get banned from the forum

aGerman
Expert
Posts: 4678
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#84 Post by aGerman » 26 Apr 2019 09:11

smrutibora wrote:
26 Apr 2019 02:40
So the low request ASCII code esteems continue as before, yet the high request esteems fluctuate from code page to code page. I can perceive how some code pages may share a few characters in like manner, yet their high request code esteems may be unique. So your utility can do the fundamental interpretation for characters in like manner. In any case, the end result for different characters that are not shared?
No utility can convert characters that are not shared between the involved codepages. Please read the initial post. The paragraph beginning with "The support of code pages is restricted ..." explains the behavior.
smrutibora wrote:
26 Apr 2019 02:40
What's more, are there regularly enough high request characters in like manner to make the utility worthwhile?

I should think there would be various code pages with no non-ASCII cover by any means, so I can't perceive how the utility could be helpful in those cases.
The existence of single byte codepages has rather historical reasons. It's a poor concept and doomed to failure. Unfortunately it's still widespread on Windows. Convert your text to an encoding that fully supports Unicode, such as UTF-8 or UTF-16. That way it will be readable in every environment, regardless of local settings.
smrutibora wrote:
26 Apr 2019 02:40
how might it know all the right mappings?
They are already stored in the *.NLS files in folder System32.
smrutibora wrote:
26 Apr 2019 02:40
I guess it is the equivalent basic schedules that cmd.exe utilizations to change over stretched out ASCII content to and from UTF-16.
Correct. Conversions of UTF-16 BE and UTF-32 LE and BE are own extensions though. There are no Windows API functions for this purpose.

Steffen

aGerman
Expert
Posts: 4678
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#85 Post by aGerman » 10 Jun 2019 06:06

Most of the current command line utilities don't support Virtual Terminal processing yet. In this case ANSI escape sequences are not used to control the console output and their textual expressions get printed to the screen. Example using an old version:

Code: Select all

>nul chcp 65001
echo +ABsAWw-93;42m+JYgliCWIJZMlkyWTJZIlkiWSJZElkSWR-   +ABsAWw-0m|convertcp "UTF-7" "UTF-8"
old_behavior.png
old_behavior.png (2.25 KiB) Viewed 27460 times
Even if I expect that VT processing will be only barely used along with CONVERTCP, it won't hurt to enable it once that Windows 10 provides this possibility.
Same example code using CONVERTCP v. 6.3:
virtual_terminal_processing_v6.3.png
virtual_terminal_processing_v6.3.png (1.73 KiB) Viewed 27460 times
Virtual Terminal processing affects the output to the console window only. Thus, CONVERTCP has to print to the window directly by omitting option /o and any redirections of the standard output stream. It's supported on Windows version 10.0.10586 onwards if the new console host is used.
The behavior on older Windows versions and for writing to files as well as for redirections keeps being the same as before.

Virustotal scans of version 6.3:
x86: https://www.virustotal.com/gui/file/94d ... /detection
x64: https://www.virustotal.com/gui/file/036 ... /detection

Steffen

aGerman
Expert
Posts: 4678
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#86 Post by aGerman » 18 Aug 2019 05:51

I found that increasing of allocated memory using realloc() in the conversion routine may lead to copying of data to another location in physical memory. This is unnecessary since the old content is outdated at this point.
Additionally CONVERTCP v6.4 supports BOM processing of GB-18030 encoded streams.


Virustotal scans of version 6.4:
x86: https://www.virustotal.com/gui/file/31e ... /detection
x64: https://www.virustotal.com/gui/file/326 ... /detection

Steffen

aGerman
Expert
Posts: 4678
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#87 Post by aGerman » 23 Dec 2019 16:10

Version 7 pretty much supports what Saso suggested in post #13. While he later referenced EBCDIC (which already worked using CONVERTCP), I recently received an inquiry to support an old Bulgarian DOS charset called "MIK". This does neither have a code page ID, nor is it installed on Windows. As long as those charsets are single-byte charsets you may write a custom text file containing the character map, which you can pass in place of CP_In or CP_out. The format specification of those files can be found in the readme.md.

Virustotal scans of version 7.0:
x86: https://www.virustotal.com/gui/file/6de ... /detection
x64: https://www.virustotal.com/gui/file/65e ... /detection

Steffen

aGerman
Expert
Posts: 4678
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#88 Post by aGerman » 24 Dec 2019 06:21

I received the information that option /l is broken on XP. The update to v7.1 is supposed to fix that. Although I have to wait for feedback since I can't test on XP anymore.

Virustotal scans of version 7.1:
x86: https://www.virustotal.com/gui/file/860 ... /detection
x64: https://www.virustotal.com/gui/file/3b4 ... /detection

Steffen

carlos
Expert
Posts: 503
Joined: 20 Aug 2010 13:57
Location: Chile
Contact:

Re: CONVERTCP.exe - Convert text from one code page to another

#89 Post by carlos » 24 Dec 2019 06:41

Hello Steffen.
Nice tool, I always update my archive with the last version, but maybe you can control the source code version in github or gitlab?
Is nice see the history for see what lines you changes for the update, and more easy for me for update simply pulling the last changes.

Also about xp, is really needed continue supporting it?
If I not remember bad, xp not implement something of unicode fully. Maybe a conversion using utf-8 can be different on xp than in upper versions?
Convertcp always depends of the codepages files (.nls files) installed in the host windows machine?

aGerman
Expert
Posts: 4678
Joined: 22 Jan 2010 18:01
Location: Germany

Re: CONVERTCP.exe - Convert text from one code page to another

#90 Post by aGerman » 24 Dec 2019 08:36

Hi Carlos.

Looking at the statistics makes me wonder if you were the only one who ever downloaded the source code :lol:
Seriously. The reason to move the project to SourceForge was that discussions about the C source would be rather off topic at DosTips. But apparently nobody really cares about the source code. The majority of people just downloads the binaries and leaves the source alone. Thus, I don't get much feedback about the source over there (as I don't get much feedback at all). GitHub is nice for developing in a team but according to the little interest in the source I don't expect any contributors. I like built-in features like the Wiki in the SourceForge projects. I'm not familiar with web development and I'm neither able nor am I willing to write a buch of html only for some further explanation of the tool.
tl;dr SourceForge was a good choice for sharing CONVERTCP imho.

Basically I agree with you about the support of XP. But this project has its history. It was explicitly developed to work on XP. Saso is the father of CONVERTCP, once he asked me to write it and he still works on XP. Also if you have a look at the recent feedback I got - one asked for supporting a vintage charset and one complained about the broken option on XP. So, as long as the tool keeps being compliant with the most recent Windows versions I'll still consider to update it accordingly.

You're right about the dependencies. But that's nothing I want to change. I could have used libraries like ICU but I'm afraid that would destroy the performance and just bloat the tool. Beginning with v7 you have at least the possibility to incorporate custom single-byte charsets. When I thought about the format of the charmap file I already considered the possibility to extend it to multi-byte charsets if there was any feature request in future.

Steffen

Post Reply