Page 1 of 1
UTF-8 To Unicode
Posted: 18 May 2012 01:53
by mauro012345
Hi!
Is there a similar snippet, like that one:
http://www.dostips.com/?t=Snippets.AnsiToUnicodeto translate a file from UTF-8 to Unicode txt format?
I have to do that with thousands of files, so I need a command to call from a script
thanks!
Re: UTF-8 To Unicode
Posted: 18 May 2012 09:06
by aGerman
Batch has a horrible Unicode support and it doesn't support UTF-8. If you need to convert those character encodings you could try VBScript.
See
http://www.robvanderwoude.com/vbstech_files_utf8.phpPM me if you need help with it.
Regards
aGerman
Re: UTF-8 To Unicode
Posted: 18 May 2012 14:27
by Liviu
Save the following to a batch file.
Code: Select all
@echo off
setlocal disabledelayedexpansion
:: save original codepage
for /f "tokens=2 delims=:" %%a in ('chcp') do @set /a "cp=%%~a"
:: write utf-16le BOM
chcp 1252 >nul
rem replace with 'cmd /a /c (set ..' if called at 'cmd /u' prompt
(set /p =ÿþ) <nul >%2 2>nul
chcp %cp% >nul
:: convert utf-8 to utf-16le
rem all on one line since batch parsing fails while active codepage is utf-8
chcp 65001 >nul & cmd /u /c type %1 >>%2 & chcp %cp% >nul
Call it with the source UTF-8 encoded file as the 1st argument, and the destination filename as the 2nd argument to be saved as UTF-16LE (including the leading BOM). It was tested to work under my xp.sp3, note however that this is just a minimal snippet with no error checking.
Liviu
Re: UTF-8 To Unicode
Posted: 18 May 2012 16:39
by aGerman
I wasn't aware that TYPE would return a usable output. Great, Liviu!
Regards
aGerman
Re: UTF-8 To Unicode
Posted: 18 May 2012 19:04
by Squashman
I tried just the original code and it seemed to do ok when I told my file viewing software to display the output file using UTF-16LE but there were a few unreadable characters at the beginning.
Re: UTF-8 To Unicode
Posted: 18 May 2012 20:00
by Liviu
aGerman wrote:I wasn't aware that TYPE would return a usable output.
TYPE is indeed surprisingly well behaved for a builtin command
Using combinations of chcp, the half-baked 65001 codepage support, and 'cmd /u' one can use TYPE to convert text files between codepages, or 8-bit and Unicode encodings.
Squashman wrote:...but there were a few unreadable characters at the beginning.
Maybe your input file had a UTF-8 BOM (neither required nor recommended), which TYPE doesn't like. Or maybe your viewer did not skip over the UTF-16LE BOM (both required and recommended). Or maybe you just had some characters in the test file that the viewer font does not cover.
Liviu
Re: UTF-8 To Unicode
Posted: 18 May 2012 21:19
by Squashman
I was just testing with plain old american english. Just 3 sentences. I used notepad to save it as UTF-8 and then ran the code. I was going to post a screen shot but didn't have time earlier.
Re: UTF-8 To Unicode
Posted: 19 May 2012 09:25
by Liviu
Squashman wrote:I used notepad to save it as UTF-8
Notepad does indeed write a BOM to the UTF-8 file. When redirecting the output to a file, "type" converts the UTF-8 BOM to a UTF-16LE BOM. Since the original code forces a UTF-16LE BOM itself, the end result would be a UTF-16LE file mistakenly starting with two BOM sequences (0xFF 0xFE 0xFF 0xFE).
If you remove the ":: write utf-16le BOM" section from the original code, the conversion will work for UTF-8 files with an embedded BOM.
Liviu
Re: UTF-8 To Unicode
Posted: 21 May 2012 03:39
by mauro012345
Let's try this!