Page 1 of 1

UTF-8 To Unicode

Posted: 18 May 2012 01:53
by mauro012345
Hi! :)
Is there a similar snippet, like that one:

http://www.dostips.com/?t=Snippets.AnsiToUnicode

to translate a file from UTF-8 to Unicode txt format?
I have to do that with thousands of files, so I need a command to call from a script 8)
thanks!

Re: UTF-8 To Unicode

Posted: 18 May 2012 09:06
by aGerman
Batch has a horrible Unicode support and it doesn't support UTF-8. If you need to convert those character encodings you could try VBScript.
See http://www.robvanderwoude.com/vbstech_files_utf8.php
PM me if you need help with it.

Regards
aGerman

Re: UTF-8 To Unicode

Posted: 18 May 2012 14:27
by Liviu
Save the following to a batch file.

Code: Select all

@echo off
setlocal disabledelayedexpansion

:: save original codepage
for /f "tokens=2 delims=:" %%a in ('chcp') do @set /a "cp=%%~a"

:: write utf-16le BOM
chcp 1252 >nul
rem replace with 'cmd /a /c (set ..' if called at 'cmd /u' prompt
(set /p =ÿþ) <nul >%2 2>nul
chcp %cp% >nul

:: convert utf-8 to utf-16le
rem all on one line since batch parsing fails while active codepage is utf-8
chcp 65001 >nul & cmd /u /c type %1 >>%2 & chcp %cp% >nul

Call it with the source UTF-8 encoded file as the 1st argument, and the destination filename as the 2nd argument to be saved as UTF-16LE (including the leading BOM). It was tested to work under my xp.sp3, note however that this is just a minimal snippet with no error checking.

Liviu

Re: UTF-8 To Unicode

Posted: 18 May 2012 16:39
by aGerman
I wasn't aware that TYPE would return a usable output. Great, Liviu!

Regards
aGerman

Re: UTF-8 To Unicode

Posted: 18 May 2012 19:04
by Squashman
I tried just the original code and it seemed to do ok when I told my file viewing software to display the output file using UTF-16LE but there were a few unreadable characters at the beginning.

Re: UTF-8 To Unicode

Posted: 18 May 2012 20:00
by Liviu
aGerman wrote:I wasn't aware that TYPE would return a usable output.
TYPE is indeed surprisingly well behaved for a builtin command ;-) Using combinations of chcp, the half-baked 65001 codepage support, and 'cmd /u' one can use TYPE to convert text files between codepages, or 8-bit and Unicode encodings.

Squashman wrote:...but there were a few unreadable characters at the beginning.
Maybe your input file had a UTF-8 BOM (neither required nor recommended), which TYPE doesn't like. Or maybe your viewer did not skip over the UTF-16LE BOM (both required and recommended). Or maybe you just had some characters in the test file that the viewer font does not cover.

Liviu

Re: UTF-8 To Unicode

Posted: 18 May 2012 21:19
by Squashman
I was just testing with plain old american english. Just 3 sentences. I used notepad to save it as UTF-8 and then ran the code. I was going to post a screen shot but didn't have time earlier.

Re: UTF-8 To Unicode

Posted: 19 May 2012 09:25
by Liviu
Squashman wrote:I used notepad to save it as UTF-8
Notepad does indeed write a BOM to the UTF-8 file. When redirecting the output to a file, "type" converts the UTF-8 BOM to a UTF-16LE BOM. Since the original code forces a UTF-16LE BOM itself, the end result would be a UTF-16LE file mistakenly starting with two BOM sequences (0xFF 0xFE 0xFF 0xFE).

If you remove the ":: write utf-16le BOM" section from the original code, the conversion will work for UTF-8 files with an embedded BOM.

Liviu

Re: UTF-8 To Unicode

Posted: 21 May 2012 03:39
by mauro012345
Let's try this!