CONVERTCP.exe - Convert text from one code page to another

Message

#1 Post by **aGerman** » 24 Nov 2016 17:44

This command line utility is a codepage converter. It supports charsets such as single-byte code pages, UTF-8, UTF-16 LE/BE, UTF-32 LE/BE, and EBCDIC. It's designed to process big files also. It shall work on Windows XP onwards (tested on XP, Windows 7, Windows 8.1, Windows 10, and Windows 11). It's a free and open source tool.

A few days ago miskox asked me to rewrite an old 16 bit tool that he uses in order to make it run on 64 bit Windows also. The tool converts text from one single-byte code page to another. I bet the native English speakers of you are wondering what such a tool is even good for. The answer is that the CMD console and Windows applications use different code pages where non-ASCII characters have different code points. Thus, characters like Ü, É, Š, and the like show up as different/wrong characters.

Steffen

May, 29th 2022 updated to version 8.4.
The download of CONVERTCP is available on SourceForge. (right click and open in a new tab)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Usage of convertcp.exe

Code: Select all

CONVERTCP v.8.4. Converts a stream of characters to another code page.

Usage:
CONVERTCP CP_In CP_Out [/i "infile.txt"] [/o "outfile.txt"] [/v] [/f] [/b|/a]
CONVERTCP /?|/l

CP_In     Code Page Identifier of the input stream
          A solitary question mark passed as CP_In causes CONVERTCP
           to try guessing the encoding.
           A preceding question mark to the identifier makes CP_In
           a preferred Code Page. If reasonable, CONVERTCP will use
           this preference rather than the guessed encoding.
           NOTE Guessing is error-prone. Don't rely on CONVERTCP
           being able to determine the correct encoding.
CP_Out    Code Page Identifier of the output stream
 To get a list of supported Code Page Identifiers use option /l
 Alternatively you can use 0 for the ANSI Code Page
  and 1 for the OEM Code Page of your system default settings.
 Instead of the Code Page Identifier you may pass the related
  MIME type, or the name of a custom *.sbcs file.
 CP_Out isn't used if the text is printed to a console or terminal
  window. It defaults to UTF-16 to get advanced character support.

/i        Introduces the source file
/o        Introduces the destination file
           (the content of an existing file will be truncated
           unless option /a was passed)
           Pass a solitary "-" (without option /a) to overwrite the
           original file specified along option /i.
 Redirections to or from CONVERTCP can be used instead of /i and /o

/v        Verify that all characters have been converted without
           using the replacement character or approximated ASCII
           characters
           Only in this case CONVERTCP returns a zero value
           NOTE Option /v is supported on Windows Vista and later
/f        Flush the stream buffer before CONVERTCP terminates
           in case the new file shall be accessed immediately
/b        Add the Byte Order Mark to the output stream
           (will be ignored if CP_Out was not one of
           65001, 1200, 1201, 12000, 12001, or 54936)
/a        Append the output stream to the destination file
           (always use the same CP_Out)
 Do not combine options /b and /a

/?        Display this help message
/l        Display a list of supported Code Page Identifiers
           installed on this computer

infile    Path of a text file whose content shall be converted
outfile   Path of a text file where the converted stream
           shall be written
 Input file and output file must not be the same, unless a minus
  sign is specified along with option /o.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Additional information:

To get a list of Code Page Identifiers along with a short description either use option /l or see
https://msdn.microsoft.com/en-us/library/dd317756.aspx
Furthermore I uploaded a table with aliases (such as MIME names, IATA numbers, names used on different operating systems or programming languages, etc.) to look up the related code page IDs
https://sourceforge.net/projects/conver ... 20Aliases/
You can also pass MIME types rather than Code Page Identifiers (e.g. UTF-8 instead of 65001).
Custom single-byte charsets are supported, too. For more information see
https://sourceforge.net/projects/conver ... 0charsets/

The support of code pages is restricted ...
a) by the shared characters of both used code pages. If a read character has no equivalent the implementations of the used API functions decide if they
- either convert to the approximated ASCII character (e.g. Š to S)
- or replace it with a default character (usually a question mark)
b) by the maximum number of bytes used to represent a character. The table outputted using option /l indicates in the second column whether or not a code page can be used by CONVERTCP for input streams greater than 511MB (while all listed code pages can be used for output streams independent of their size).

The utility was written in C/WinAPI. Besides of the exe files (which are 32 bit and 64 bit MinGW/GCC release builds) the source code is included in the project. The program flow chart is for those who try to understand how the program works (even though it's simplified and incomplete). All files under MIT license (brief).

CONVERTCP doesn't need any installation but if you frequently use it for your daily work you may copy it to the Windows command utilities:
On 32 bit Windows

Copy the 32 bit convertcp.exe out of the x86 subfolder to the System32 directory (usually C:\Windows\System32).

On 64 bit Windows

Copy the 64 bit convertcp.exe out of the x64 subfolder to the System32 directory (usually C:\Windows\System32).
Copy the 32 bit convertcp.exe out of the x86 subfolder to the SysWOW64 directory (usually C:\Windows\SysWOW64).

This way you can use CONVERTCP without having the executable in the same folder along with your script.

About Byte Order Marks (BOMs):
CONVERTCP provides the opportunity to add a BOM to UTF-8, UTF-16 and UTF-32 encoded output streams. A BOM has to be always the first byte sequence in a file. The reading program may use it to recognize unicode encoded file contents. See https://en.wikipedia.org/wiki/Byte_order_mark. Some rules of thumb when to add or omit BOMs:

Add the BOM to text files that are intended to be read in text editors on Windows.
Omit the BOM in markup text (such as HTML or XML) where the encoding is specified in the markup or where it defaults to be recognized as UTF-8.
Omit the BOM for files that are intended to be shared with other operating systems like Unix or Linux.
Never use a BOM for text that you append to an existing file.

Feedback is always welcome.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Examples

Convert the output of a command and save it in a text file.
(The output of FINDSTR /? will be converted from the default OEM code page to UTF-16 LE with BOM prepended. The converted stream will be saved in "commands.txt".)

Code: Select all

findstr /? | convertcp 1 1200 /b /o "commands.txt"

Convert the content of a text file and save it to another text file.
(The content of "commands.txt" will be converted from UTF-16 LE to the default ANSI code page and saved in "commands2.txt")

Code: Select all

convertcp 1200 0 /i "commands.txt" /o "commands2.txt"

Convert the content of a text file from guessed encoding and save it to another text file.
(The encoding of "commands.txt" is guessed, its content converted to UTF-8 and saved in "commands3.txt")
Command line:

Code: Select all

convertcp ? 65001 /i "commands.txt" /o "commands3.txt"

Convert the content of a text file and output it to the console window.
(The content of "commands2.txt" will be converted from the default ANSI code page to the default OEM code page and displayed.)

Code: Select all

convertcp 0 1 /i "commands2.txt"

Append to an existing file.
(The output of FIND /? will be converted from the default OEM code page to UTF-16 LE. The converted stream will be appended to "commands.txt".)

Code: Select all

find /? | convertcp 1 1200 /a /o "commands.txt"

Create a file with a Byte Order Mark only.
(NUL is redirected to CONVERTCP. Thus, the input stream is empty. The input code page ID is meaningless. Because the output code page ID is for UTF-8 and option /b was passed only the UTF-8 BOM will be written to the file. This might be useful if you want to append text to the file in multiple steps afterwards.)

Code: Select all

<nul convertcp 0 65001 /b /o "bom.txt"

List the installed code pages.
(Process the outputted list of CONVERTCP /L in a FOR /F loop in order to write the values comma-separated)

Code: Select all

for /f "skip=3 tokens=1,3,4*" %%i in ('convertcp /l') do echo "%%i","%%j","%%l"

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Release notes:
2022/05/29 - v8.4.0/1.0 fix broken identification of names
2021/10/18 - v8.3.0/1.0 improve performance of UTF-8 identification, shrink the binary size
2021/07/25 - v8.2.0/1.0 add support for overwriting of the original file
2021/07/02 - v8.1.0/1.0 bugfix for UTF-7 detection, print meaningful error messages, minor optimizations
2021/06/23 - v8.0.0/1.0 support guessing the charset of the incoming stream
2020/08/18 - v7.5.0/1.0 make the output default to UTF-16 if printed to a text device (console or terminal) to get advanced character support
2020/01/18 - v7.4.0/1.0 improve speed for UTF-8 <--> UTF-16 conversion of ASCII characters
2019/12/28 - v7.3.0/1.0 revision of UTF-8 validation
2019/12/26 - v7.2.0/1.0 override incorrect conversion from and to UTF-8 on XP, bugfix for broken ID 0 for the default ANSI code page
2019/12/24 - v7.1.0/1.0 supposed fix for broken option /l on XP
2019/12/23 - v7.0.0/1.0 add support for custom single-byte charsets
2019/08/18 - v6.4.0/1.0 improvement for memory management, BOM processing for GB-18030 enabled
2019/06/10 - v6.3.0/1.0 Virtual Terminal processing for Windows 10 enabled
2019/04/23 - v6.2.0/1.0 bugfix for faulty parsing of partially quoted paths
2019/03/19 - v6.1.0/1.0 bugfix for BOM processing
2019/03/17 - v6.0.0/1.0 added option /v to verify the conversion
2019/03/04 - v5.2.0/1.0 bugfix for redirected UTF-8 streams
2018/06/14 - v5.1.0/1.0 file size optimization
2018/05/12 - v5.0.0/1.0 added support for MIME names
2018/04/29 - v4.3.0/1.0 bugfix for memory leak of conversion to UTF-32 without threading
2018/04/27 - v4.2.0/1.0 removed option /n, code pages are assessed for an automatic decision if threading will be applied
2018/04/26 - v4.1.0/1.0 bugfix for wrong maximum of options
2018/04/26 - v4.0.0/1.0 added option /n for "no threading" to overcome the 1 MB limit of certain code pages
2018/04/20 - v3.1.0/1.0 thread-waiting moved
2018/04/20 - v3.0.0/1.0 added option /f to force the flushing of the file buffer before CONVERTCP terminates
2018/04/18 - v2.2.0/1.0 bugfix for unexpected output
2018/04/11 - v2.1.0/1.0 bugfix for unexpected output caused by still buffered stream content (finally fixed in v2.2)
2018/02/01 - v2.0.0/1.0 UTF-32 LE/BE support added, bugfix for reading UTF-16 BE
2017/12/30 - v1.5.0/1.0 bugfix for file names with leading dash
2017/05/27 - v1.4.4.0/1 added option /l to print a list of installed code pages
2017/02/02 - v1.4.3.0/1 added option /a for appending to an existing file
2017/01/29 - v1.4.2.0/1 reduced the size of the binary files by half (kudos to carlos)
2017/01/23 - v1.4.1.0/1 minor performance improvement
2016/12/28 - v1.4.0.0/1 UTF-16 BE support added, options /i and /o added
2016/12/09 - v1.3.2.0/1 fixed bug in conversion from UTF-8
2016/12/08 - v1.3.1.0/1 ambiguous code fixed, minor optimizations, source code tidied
2016/12/05 - v1.3.0.0/1 UTF-16 LE support added
2016/12/03 - v1.2.0.0/1 UTF-8 support added, fixed misleading error message if the input stream has a size of exact multiples of 4 MB
2016/11/28 - v1.1.4.0/1 minor optimizations, source code tidied, 64bit utility added
2016/11/25 - v1.1.3.0 fixed possible deadlock caused by unsignaled threads
2016/11/24 - v1.1.2.0 fixed possible memory leak if reallocations fail
2016/11/24 - v1.1.1.0 moved to C, multithreaded conversion added
unpublished - first versions using C++ vector containers, without multithreading

#2 Post by **dbenham** » 25 Nov 2016 01:23

I'm a bit confused as to how this works, and/or how useful it could be.

So the low order ASCII code values remain the same, but the high order values vary from code page to code page. I can see how some code pages may share some characters in common, but their high order code values might be different. So your utility can do the necessary translation for characters in common. But what happens to the other characters that are not shared?

And are there frequently enough high order characters in common to make the utility worth while?

I should think there would be a number of code pages with no non-ASCII overlap at all, so I can't see how the utility could be useful in those cases.

At first I wondered how the utility works - how could it know all the correct mappings? But I looked at the source and see that it converts the text to UTF-16, and then converts back to a different single byte character set. I suppose it is the same underlying routines that cmd.exe uses to convert extended ASCII text to and from UTF-16.

Dave Benham

#3 Post by **aGerman** » 25 Nov 2016 03:03

I absolutely understand your concerns Dave and I know it's pretty difficult to see the benefit as long as you don't have to deal with languages that permanently uses characters other than the default ASCII. E.g. see the output of PAUSE /? on my pc:

Hält die Ausführung einer Batchdatei an und zeigt folgende Meldung an:
Drücken Sie eine beliebige Taste . . .

I agree that you can't convert between codepages like 1251 and 1252 because there is no overlap in the extended ASCII range. The default OEM code page and the default ANSI code page on the same system will certainly share most of the characters. That's the reason why you can pass 1 and 0 instead of the code page IDs.
If a character has no equivalent the implementation of the used API functions decide if it
- either converts to the base character (e.g. Š to S)
- or replaces it with a question mark
Of course one can use a combination of TYPE, CMD /U, and CHCP to convert text to UTF-16 and back to another code page. As mentioned above I wrote the utility on behalf of miskox who already converted files with hundreds of MB of text. It seems to be useful for at least some people :lol:

Steffen

miskox · #4 Post by **miskox** » 25 Nov 2016 13:20

Again I must say Thank you! to aGerman for providing this program.

As he mentioned I had very old MS-DOS 16-bit exe which does not work on x64. I received a source code from the author (written in Turbo Pascal). aGerman said that it is easier to write a program from scratch then to try and relink it.

Back in the old days we in former Yugoslavia had 3 (yes, three!) different ways of displaying our characters that are special to our alphabet: ČŠŽ and also ĆĐ in Croatia, Serbia...

See this translation table:

First I had to use character [ to display letter Š - fonts were patched to support this. After that 852 (OEM) and 1250 (ANSI) were introduced.

If I have a a.txt file with this letter Š (first letter is DEC 230, second character is DEC 138)

Code: Select all

1250 852
Š       Ő

And I do

Code: Select all

type a.txt

I see letter Š on the right as it should be, but letter Š on the left is not displayed correcty. If you edit this file with NOTEPAD letter on the left is correct but not letter on the right.

If I have a .txt file with CP1250 character (for example Š) in it and try to find a letter (also Š) in command prompt window I will not succeed because these characters have different values in a code page table.

Saso

#5 Post by **aGerman** » 28 Nov 2016 01:11

New release with additional 64bit utility.

Steffen

jfl · #6 Post by **jfl** » 01 Dec 2016 10:43

dbenham wrote:I'm a bit confused as to how this works, and/or how useful it could be.

+1 on aGerman answer:
As soon as you start working with non-English documents, you'll quickly encounter some with illegible characters. This is due to them being in the wrong encoding for your version of Windows.
And regularly facing that same problem, I've also developed long ago my own encoding converting tool: It's called conv.exe, and available in my system tools library at https://github.com/JFLarvoire/SysToolsLib/releases.

Steffen, Saso,
Mine also has options for converting to and from UTF8, which is the most common encoding error I encounter nowadays.
You might also be interested by the 1clip.exe and 2clip.exe and 12.bat tools, allowing to use command-line tools (yours or mine) to convert data directly inside GUI apps.

#7 Post by **aGerman** » 01 Dec 2016 12:57

Thanks jfl

I already thought about adding UTF-8 support. The conversion to UTF-8 is quite simple. Actually it does already work except that the BOM is not prepended. Although that can be fixed easily.
However converting vice versa is much more complicated. The input stream will be read in chunks of 1 MB in order to be able to process big files * . The conversion will fail if the chunk ends in between a multibyte sequence of a UTF-8 stream. Currently I don't have any good idea how to solve that issue.

Steffen

* That's where your conv.exe utility doesn't seem to work anymore. I tested with a file of only 256 MB where it ends up with a deadlock.

#8 Post by **aGerman** » 03 Dec 2016 19:23

I found a way to handle UTF-8. Pass 65001 as code page ID.
The UTF-8 Byte Order Mark will be prepended to the output stream if you pass /b as third argument.

Steffen

#9 Post by **aGerman** » 05 Dec 2016 06:11

I changed the I/O from C to WinAPI in order to have UTF-16 little endian supported also. Pass 1200 as code page ID.

Steffen

miskox · #10 Post by **miskox** » 06 Dec 2016 03:19

Thank you, Steffen! New release almost daily. Great!

Saso

#11 Post by **aGerman** » 06 Dec 2016 06:05

I try to work on it as long as it's fresh. I don't expect to get bug reports because the utility will not be found and used that often. Thus, finding uncertain code and optimizations keep being my own task. It would take me an hour to understand my own code after half a year not looking at it if I don't do it now.

I think in a few days I will upload one last minor release for the moment. After adding UTF-16 support there is no need to change the code that much. I'll try to find some ambiguous or uncertain code, do some minor optimizations, remove redundant code etc. Then I'll leave it as it is unless somebody finds a bug or has a request to add another feature ...

Steffen

#12 Post by **aGerman** » 08 Dec 2016 05:27

As already announced ...
Corrected ambiguous code for BOM removement
Outsourced BOM removement into a function in order to remove redundant code
Removed unnecessary memory reallocations
Replaced multiplications/divisions by two with faster bitwise shifting

Steffen

miskox · #13 Post by **miskox** » 08 Dec 2016 08:18

aGerman wrote:...Then I'll leave it as it is unless somebody finds a bug or has a request to add another feature ...

Maybe just an idea (probably not neeeded at the moment):

Add a support for custom code page(s).

Code: Select all

convertcp.exe my_private_CP1 my_private_CP2 <file_in.txt >file_out.txt

and there you have a translation table between these two private tables:

Code: Select all

0x00 from CP1 translates into 0x12 in CP2
0x01 ---> 0x11
.
.
.

Thanks for everything.
Saso

#14 Post by **aGerman** » 08 Dec 2016 14:11

Saso

What you suggest is rather something like low-level cryptography and actually not the purpose of this utility. It doesn't make much sense to convert 0x00 to whatever byte in a plain text file. All single-byte code pages have the same code points in the ASCII range (until 0x7F).
If you want to have your own translation, then it should begin with 0x80 and end with 0xFF for the bytes read. Each of them having an associated other byte. Thus, you would need only one table (instead of two) with 128 pairs of values. I'm not sure if that was what you meant.

Steffen

miskox · #15 Post by **miskox** » 09 Dec 2016 01:44

@aGerman:

A translation from EBCDIC to ASCII was my initial thought that I had to use in the past. I did not check if current WinAPI can do this. So if this is not supperted by API then we can call it a 'custom' translation table.

As I said: this was just an idea - the question is if it is really needed.

Thanks.

Saso

DosTips.com

CONVERTCP.exe - Convert text from one code page to another

CONVERTCP.exe - Convert text from one code page to another

Re: CONVERTCP.exe - Convert text from one code page to another

Re: CONVERTCP.exe - Convert text from one code page to another

Re: CONVERTCP.exe - Convert text from one code page to another

Re: CONVERTCP.exe - Convert text from one code page to another

Re: CONVERTCP.exe - Convert text from one code page to another

Re: CONVERTCP.exe - Convert text from one code page to another

Re: CONVERTCP.exe - Convert text from one code page to another

Re: CONVERTCP.exe - Convert text from one code page to another

Re: CONVERTCP.exe - Convert text from one code page to another

Re: CONVERTCP.exe - Convert text from one code page to another

Re: CONVERTCP.exe - Convert text from one code page to another

Re: CONVERTCP.exe - Convert text from one code page to another

Re: CONVERTCP.exe - Convert text from one code page to another

Re: CONVERTCP.exe - Convert text from one code page to another