A few days ago miskox asked me to rewrite an old 16 bit tool that he uses in order to make it run on 64 bit Windows also. The tool converts text from one single-byte code page to another. I bet the native English speakers of you are wondering what such a tool is even good for. The answer is that the CMD console and Windows applications use different code pages where non-ASCII characters have different code points. Thus, characters like Ü, É, Š, and the like show up as different/wrong characters.
Steffen
May, 29th 2022 updated to version 8.4.
The download of CONVERTCP is available on SourceForge. (right click and open in a new tab)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Usage of convertcp.exe
Code: Select all
CONVERTCP v.8.4. Converts a stream of characters to another code page.
Usage:
CONVERTCP CP_In CP_Out [/i "infile.txt"] [/o "outfile.txt"] [/v] [/f] [/b|/a]
CONVERTCP /?|/l
CP_In Code Page Identifier of the input stream
A solitary question mark passed as CP_In causes CONVERTCP
to try guessing the encoding.
A preceding question mark to the identifier makes CP_In
a preferred Code Page. If reasonable, CONVERTCP will use
this preference rather than the guessed encoding.
NOTE Guessing is error-prone. Don't rely on CONVERTCP
being able to determine the correct encoding.
CP_Out Code Page Identifier of the output stream
To get a list of supported Code Page Identifiers use option /l
Alternatively you can use 0 for the ANSI Code Page
and 1 for the OEM Code Page of your system default settings.
Instead of the Code Page Identifier you may pass the related
MIME type, or the name of a custom *.sbcs file.
CP_Out isn't used if the text is printed to a console or terminal
window. It defaults to UTF-16 to get advanced character support.
/i Introduces the source file
/o Introduces the destination file
(the content of an existing file will be truncated
unless option /a was passed)
Pass a solitary "-" (without option /a) to overwrite the
original file specified along option /i.
Redirections to or from CONVERTCP can be used instead of /i and /o
/v Verify that all characters have been converted without
using the replacement character or approximated ASCII
characters
Only in this case CONVERTCP returns a zero value
NOTE Option /v is supported on Windows Vista and later
/f Flush the stream buffer before CONVERTCP terminates
in case the new file shall be accessed immediately
/b Add the Byte Order Mark to the output stream
(will be ignored if CP_Out was not one of
65001, 1200, 1201, 12000, 12001, or 54936)
/a Append the output stream to the destination file
(always use the same CP_Out)
Do not combine options /b and /a
/? Display this help message
/l Display a list of supported Code Page Identifiers
installed on this computer
infile Path of a text file whose content shall be converted
outfile Path of a text file where the converted stream
shall be written
Input file and output file must not be the same, unless a minus
sign is specified along with option /o.
Additional information:
To get a list of Code Page Identifiers along with a short description either use option /l or see
https://msdn.microsoft.com/en-us/library/dd317756.aspx
Furthermore I uploaded a table with aliases (such as MIME names, IATA numbers, names used on different operating systems or programming languages, etc.) to look up the related code page IDs
https://sourceforge.net/projects/conver ... 20Aliases/
You can also pass MIME types rather than Code Page Identifiers (e.g. UTF-8 instead of 65001).
Custom single-byte charsets are supported, too. For more information see
https://sourceforge.net/projects/conver ... 0charsets/
The support of code pages is restricted ...
a) by the shared characters of both used code pages. If a read character has no equivalent the implementations of the used API functions decide if they
- either convert to the approximated ASCII character (e.g. Š to S)
- or replace it with a default character (usually a question mark)
b) by the maximum number of bytes used to represent a character. The table outputted using option /l indicates in the second column whether or not a code page can be used by CONVERTCP for input streams greater than 511MB (while all listed code pages can be used for output streams independent of their size).
The utility was written in C/WinAPI. Besides of the exe files (which are 32 bit and 64 bit MinGW/GCC release builds) the source code is included in the project. The program flow chart is for those who try to understand how the program works (even though it's simplified and incomplete). All files under MIT license (brief).
CONVERTCP doesn't need any installation but if you frequently use it for your daily work you may copy it to the Windows command utilities:
On 32 bit Windows
- Copy the 32 bit convertcp.exe out of the x86 subfolder to the System32 directory (usually C:\Windows\System32).
- Copy the 64 bit convertcp.exe out of the x64 subfolder to the System32 directory (usually C:\Windows\System32).
- Copy the 32 bit convertcp.exe out of the x86 subfolder to the SysWOW64 directory (usually C:\Windows\SysWOW64).
About Byte Order Marks (BOMs):
CONVERTCP provides the opportunity to add a BOM to UTF-8, UTF-16 and UTF-32 encoded output streams. A BOM has to be always the first byte sequence in a file. The reading program may use it to recognize unicode encoded file contents. See https://en.wikipedia.org/wiki/Byte_order_mark. Some rules of thumb when to add or omit BOMs:
- Add the BOM to text files that are intended to be read in text editors on Windows.
- Omit the BOM in markup text (such as HTML or XML) where the encoding is specified in the markup or where it defaults to be recognized as UTF-8.
- Omit the BOM for files that are intended to be shared with other operating systems like Unix or Linux.
- Never use a BOM for text that you append to an existing file.
Feedback is always welcome.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Examples
Convert the output of a command and save it in a text file.
(The output of FINDSTR /? will be converted from the default OEM code page to UTF-16 LE with BOM prepended. The converted stream will be saved in "commands.txt".)
Code: Select all
findstr /? | convertcp 1 1200 /b /o "commands.txt"
(The content of "commands.txt" will be converted from UTF-16 LE to the default ANSI code page and saved in "commands2.txt")
Code: Select all
convertcp 1200 0 /i "commands.txt" /o "commands2.txt"
(The encoding of "commands.txt" is guessed, its content converted to UTF-8 and saved in "commands3.txt")
Command line:
Code: Select all
convertcp ? 65001 /i "commands.txt" /o "commands3.txt"
(The content of "commands2.txt" will be converted from the default ANSI code page to the default OEM code page and displayed.)
Code: Select all
convertcp 0 1 /i "commands2.txt"
(The output of FIND /? will be converted from the default OEM code page to UTF-16 LE. The converted stream will be appended to "commands.txt".)
Code: Select all
find /? | convertcp 1 1200 /a /o "commands.txt"
(NUL is redirected to CONVERTCP. Thus, the input stream is empty. The input code page ID is meaningless. Because the output code page ID is for UTF-8 and option /b was passed only the UTF-8 BOM will be written to the file. This might be useful if you want to append text to the file in multiple steps afterwards.)
Code: Select all
<nul convertcp 0 65001 /b /o "bom.txt"
(Process the outputted list of CONVERTCP /L in a FOR /F loop in order to write the values comma-separated)
Code: Select all
for /f "skip=3 tokens=1,3,4*" %%i in ('convertcp /l') do echo "%%i","%%j","%%l"
Release notes:
2022/05/29 - v8.4.0/1.0 fix broken identification of names
2021/10/18 - v8.3.0/1.0 improve performance of UTF-8 identification, shrink the binary size
2021/07/25 - v8.2.0/1.0 add support for overwriting of the original file
2021/07/02 - v8.1.0/1.0 bugfix for UTF-7 detection, print meaningful error messages, minor optimizations
2021/06/23 - v8.0.0/1.0 support guessing the charset of the incoming stream
2020/08/18 - v7.5.0/1.0 make the output default to UTF-16 if printed to a text device (console or terminal) to get advanced character support
2020/01/18 - v7.4.0/1.0 improve speed for UTF-8 <--> UTF-16 conversion of ASCII characters
2019/12/28 - v7.3.0/1.0 revision of UTF-8 validation
2019/12/26 - v7.2.0/1.0 override incorrect conversion from and to UTF-8 on XP, bugfix for broken ID 0 for the default ANSI code page
2019/12/24 - v7.1.0/1.0 supposed fix for broken option /l on XP
2019/12/23 - v7.0.0/1.0 add support for custom single-byte charsets
2019/08/18 - v6.4.0/1.0 improvement for memory management, BOM processing for GB-18030 enabled
2019/06/10 - v6.3.0/1.0 Virtual Terminal processing for Windows 10 enabled
2019/04/23 - v6.2.0/1.0 bugfix for faulty parsing of partially quoted paths
2019/03/19 - v6.1.0/1.0 bugfix for BOM processing
2019/03/17 - v6.0.0/1.0 added option /v to verify the conversion
2019/03/04 - v5.2.0/1.0 bugfix for redirected UTF-8 streams
2018/06/14 - v5.1.0/1.0 file size optimization
2018/05/12 - v5.0.0/1.0 added support for MIME names
2018/04/29 - v4.3.0/1.0 bugfix for memory leak of conversion to UTF-32 without threading
2018/04/27 - v4.2.0/1.0 removed option /n, code pages are assessed for an automatic decision if threading will be applied
2018/04/26 - v4.1.0/1.0 bugfix for wrong maximum of options
2018/04/26 - v4.0.0/1.0 added option /n for "no threading" to overcome the 1 MB limit of certain code pages
2018/04/20 - v3.1.0/1.0 thread-waiting moved
2018/04/20 - v3.0.0/1.0 added option /f to force the flushing of the file buffer before CONVERTCP terminates
2018/04/18 - v2.2.0/1.0 bugfix for unexpected output
2018/04/11 - v2.1.0/1.0 bugfix for unexpected output caused by still buffered stream content (finally fixed in v2.2)
2018/02/01 - v2.0.0/1.0 UTF-32 LE/BE support added, bugfix for reading UTF-16 BE
2017/12/30 - v1.5.0/1.0 bugfix for file names with leading dash
2017/05/27 - v1.4.4.0/1 added option /l to print a list of installed code pages
2017/02/02 - v1.4.3.0/1 added option /a for appending to an existing file
2017/01/29 - v1.4.2.0/1 reduced the size of the binary files by half (kudos to carlos)
2017/01/23 - v1.4.1.0/1 minor performance improvement
2016/12/28 - v1.4.0.0/1 UTF-16 BE support added, options /i and /o added
2016/12/09 - v1.3.2.0/1 fixed bug in conversion from UTF-8
2016/12/08 - v1.3.1.0/1 ambiguous code fixed, minor optimizations, source code tidied
2016/12/05 - v1.3.0.0/1 UTF-16 LE support added
2016/12/03 - v1.2.0.0/1 UTF-8 support added, fixed misleading error message if the input stream has a size of exact multiples of 4 MB
2016/11/28 - v1.1.4.0/1 minor optimizations, source code tidied, 64bit utility added
2016/11/25 - v1.1.3.0 fixed possible deadlock caused by unsignaled threads
2016/11/24 - v1.1.2.0 fixed possible memory leak if reallocations fail
2016/11/24 - v1.1.1.0 moved to C, multithreaded conversion added
unpublished - first versions using C++ vector containers, without multithreading