UTF-8 To Unicode

Message

mauro012345 · #1 Post by **mauro012345** » 18 May 2012 01:53

Hi!

Is there a similar snippet, like that one:

http://www.dostips.com/?t=Snippets.AnsiToUnicode

to translate a file from UTF-8 to Unicode txt format?
I have to do that with thousands of files, so I need a command to call from a script

thanks!

#2 Post by **aGerman** » 18 May 2012 09:06

Batch has a horrible Unicode support and it doesn't support UTF-8. If you need to convert those character encodings you could try VBScript.
See http://www.robvanderwoude.com/vbstech_files_utf8.php
PM me if you need help with it.

Regards
aGerman

Liviu · #3 Post by **Liviu** » 18 May 2012 14:27

Save the following to a batch file.

Code: Select all

@echo off
setlocal disabledelayedexpansion

:: save original codepage
for /f "tokens=2 delims=:" %%a in ('chcp') do @set /a "cp=%%~a"

:: write utf-16le BOM
chcp 1252 >nul
rem replace with 'cmd /a /c (set ..' if called at 'cmd /u' prompt
(set /p =ÿþ) <nul >%2 2>nul
chcp %cp% >nul

:: convert utf-8 to utf-16le
rem all on one line since batch parsing fails while active codepage is utf-8
chcp 65001 >nul & cmd /u /c type %1 >>%2 & chcp %cp% >nul

Call it with the source UTF-8 encoded file as the 1st argument, and the destination filename as the 2nd argument to be saved as UTF-16LE (including the leading BOM). It was tested to work under my xp.sp3, note however that this is just a minimal snippet with no error checking.

Liviu

#4 Post by **aGerman** » 18 May 2012 16:39

I wasn't aware that TYPE would return a usable output. Great, Liviu!

Regards
aGerman

#5 Post by **Squashman** » 18 May 2012 19:04

I tried just the original code and it seemed to do ok when I told my file viewing software to display the output file using UTF-16LE but there were a few unreadable characters at the beginning.

Liviu · #6 Post by **Liviu** » 18 May 2012 20:00

aGerman wrote:I wasn't aware that TYPE would return a usable output.

TYPE is indeed surprisingly well behaved for a builtin command ;-)

Using combinations of chcp, the half-baked 65001 codepage support, and 'cmd /u' one can use TYPE to convert text files between codepages, or 8-bit and Unicode encodings.

Squashman wrote:...but there were a few unreadable characters at the beginning.

Maybe your input file had a UTF-8 BOM (neither required nor recommended), which TYPE doesn't like. Or maybe your viewer did not skip over the UTF-16LE BOM (both required and recommended). Or maybe you just had some characters in the test file that the viewer font does not cover.

Liviu

#7 Post by **Squashman** » 18 May 2012 21:19

I was just testing with plain old american english. Just 3 sentences. I used notepad to save it as UTF-8 and then ran the code. I was going to post a screen shot but didn't have time earlier.

Liviu · #8 Post by **Liviu** » 19 May 2012 09:25

Squashman wrote:I used notepad to save it as UTF-8

Notepad does indeed write a BOM to the UTF-8 file. When redirecting the output to a file, "type" converts the UTF-8 BOM to a UTF-16LE BOM. Since the original code forces a UTF-16LE BOM itself, the end result would be a UTF-16LE file mistakenly starting with two BOM sequences (0xFF 0xFE 0xFF 0xFE).

If you remove the ":: write utf-16le BOM" section from the original code, the conversion will work for UTF-8 files with an embedded BOM.

Liviu

mauro012345 · #9 Post by **mauro012345** » 21 May 2012 03:39

Let's try this!

DosTips.com

UTF-8 To Unicode

UTF-8 To Unicode

Re: UTF-8 To Unicode

Re: UTF-8 To Unicode

Re: UTF-8 To Unicode

Re: UTF-8 To Unicode

Re: UTF-8 To Unicode

Re: UTF-8 To Unicode

Re: UTF-8 To Unicode

Re: UTF-8 To Unicode