Page 1 of 1

[Little challenge]Converting text file with unknown encoding into ANSI

Posted: 05 Nov 2016 01:55
by BoQsc
Information about the system the file and scripts are tested on:

Code: Select all

 INFO.BAT version 1.3
--------------------------------------------------------------------------------
Windows version        :  Microsoft Windows [Version 10.0.14393]
Product name           :  Windows 10 Home, 64 bit
Performance indicators :  Processor Cores: 6      Visible RAM: 8386100 kilobytes

Date/Time format       :  (yy/mm/dd)  2016-11-06  12:11:41,44
__APPDIR__             :  C:\WINDOWS\system32\
ComSpec                :  C:\WINDOWS\system32\cmd.exe
PathExt                :  .COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC
Extensions             :  system: Enabled   user: Enabled
Delayed expansion      :  system: Disabled  user: Disabled
Locale name            :  lt-LT       Code Pages: OEM  775    ANSI 1257
DIR  format            :  2016-11-05  16:07     1 342 177 280 pagefile.sys
Permissions            :  Elevated Admin=No, Admin group=Yes

                          Missing from the tool collection:  debug


Thread Description:
BoQsc wrote:I wasted a lot of time today figuring it out, but couldn't make this text file appear as ANSI without losing the content in it.
________________________________________________
Merged_registry.reg
Image
[Download]

This is file that is encoded in unknown text format
________________________________________________
Ps. I need it to be done in the batch script.



I'm hoping that someone can give me a helping hand, I've got tired.
Image

Re: [Little challenge]Converting text file with unknown encoding into ANSI

Posted: 05 Nov 2016 06:34
by LotPings
Looks to me like UCS2 LE without BOM

Code: Select all

> dumphex /l100 merged_registry.reg
DumpHex Version 1.0.1
Copyright (c) 2003 Robert Bachmann

00000000h: 57 00 69 00 6E 00 64 00 6F 00 77 00 73 00 20 00 W.i.n.d.o.w.s. .
00000010h: 52 00 65 00 67 00 69 00 73 00 74 00 72 00 79 00 R.e.g.i.s.t.r.y.
00000020h: 20 00 45 00 64 00 69 00 74 00 6F 00 72 00 20 00  .E.d.i.t.o.r. .
00000030h: 56 00 65 00 72 00 73 00 69 00 6F 00 6E 00 20 00 V.e.r.s.i.o.n. .
00000040h: 35 00 2E 00 30 00 30 00 0D 00 0A 00 0D 00 0A 00 5...0.0.........
00000050h: 5B 00 48 00 4B 00 45 00 59 00 5F 00 43 00 4C 00 [.H.K.E.Y._.C.L.
00000060h: 41 00 53 00 53 00 45 00 53 00 5F 00 52 00 4F 00 A.S.S.E.S._.R.O.
00000070h: 4F 00 54 00 5C 00 62 00 61 00 74 00 66 00 69 00 O.T.\.b.a.t.f.i.
00000080h: 6C 00 65 00 5C 00 73 00 68 00 65 00 6C 00 6C 00 l.e.\.s.h.e.l.l.
00000090h: 5C 00 70 00 72 00 69 00 6E 00 74 00 5D 00 0D 00 \.p.r.i.n.t.]...
000000A0h: 0A 00 0D 00 0A 00 5B 00 48 00 4B 00 45 00 59 00 ......[.H.K.E.Y.
000000B0h: 5F 00 43 00 4C 00 41 00 53 00 53 00 45 00 53 00 _.C.L.A.S.S.E.S.
000000C0h: 5F 00 52 00 4F 00 4F 00 54 00 5C 00 62 00 61 00 _.R.O.O.T.\.b.a.
000000D0h: 74 00 66 00 69 00 6C 00 65 00 5C 00 73 00 68 00 t.f.i.l.e.\.s.h.
000000E0h: 65 00 6C 00 6C 00 5C 00 70 00 72 00 69 00 6E 00 e.l.l.\.p.r.i.n.
000000F0h: 74 00 5C 00 63 00 6F 00 6D 00 6D 00 61 00 6E 00 t.\.c.o.m.m.a.n.


If open in notepad++ and convert to ansi I get this

Code: Select all

> dumphex /l100 merged_registry_.reg
DumpHex Version 1.0.1
Copyright (c) 2003 Robert Bachmann

00000000h: 57 69 6E 64 6F 77 73 20 52 65 67 69 73 74 72 79 Windows Registry
00000010h: 20 45 64 69 74 6F 72 20 56 65 72 73 69 6F 6E 20  Editor Version
00000020h: 35 2E 30 30 0D 0A 0D 0A 5B 48 4B 45 59 5F 43 4C 5.00....[HKEY_CL
00000030h: 41 53 53 45 53 5F 52 4F 4F 54 5C 62 61 74 66 69 ASSES_ROOT\batfi
00000040h: 6C 65 5C 73 68 65 6C 6C 5C 70 72 69 6E 74 5D 0D le\shell\print].
00000050h: 0A 0D 0A 5B 48 4B 45 59 5F 43 4C 41 53 53 45 53 ...[HKEY_CLASSES
00000060h: 5F 52 4F 4F 54 5C 62 61 74 66 69 6C 65 5C 73 68 _ROOT\batfile\sh
00000070h: 65 6C 6C 5C 70 72 69 6E 74 5C 63 6F 6D 6D 61 6E ell\print\comman
00000080h: 64 5D 0D 0A 40 3D 68 65 78 28 32 29 3A 32 35 2C d]..@=hex(2):25,
00000090h: 30 30 2C 35 33 2C 30 30 2C 37 39 2C 30 30 2C 37 00,53,00,79,00,7
000000A0h: 33 2C 30 30 2C 37 34 2C 30 30 2C 36 35 2C 30 30 3,00,74,00,65,00
000000B0h: 2C 36 64 2C 30 30 2C 35 32 2C 30 30 2C 36 66 2C ,6d,00,52,00,6f,
000000C0h: 30 30 2C 36 66 2C 30 30 2C 37 34 2C 30 30 2C 32 00,6f,00,74,00,2
000000D0h: 35 2C 5C 0D 0A 20 20 30 30 2C 35 63 2C 30 30 2C 5,\..  00,5c,00,
000000E0h: 35 33 2C 30 30 2C 37 39 2C 30 30 2C 37 33 2C 30 53,00,79,00,73,0
000000F0h: 30 2C 37 34 2C 30 30 2C 36 35 2C 30 30 2C 36 64 0,74,00,65,00,6d

Re: [Little challenge]Converting text file with unknown encoding into ANSI

Posted: 05 Nov 2016 06:40
by aGerman
You can't convert everything to ANSI because if you have e.g. something like š (latin s with caron) and ю (cyrillic yu) in the same file you won't find a code page with a single byte representation for both of them.

Your sample file doesn't have an unknown encoding. It's UTF-16 LE (or its UCS-2 subset) without Byte Order Mark. Prepend the BOM if you want to convert it with the type command.

Code: Select all

@echo off &setlocal
set "file=merged_registry.reg"

:: save the current codepage
for /f "tokens=2 delims=:" %%i in ('chcp') do set /a oemcp=%%~ni

:: switch to Windows-1252 codepage
>nul chcp 1252

:: create the Byte Order Mark (UTF-16 little endian)
<nul >"%file%~" set /p "=ÿþ"

:: switch back to your default
>nul chcp %oemcp%

:: append the file content
copy /b "%file%~" + /b "%file%" /b

:: overwrite the old file using TYPE command
>"%file%" type "%file%~"

:: delete temporary file
del "%file%~"

This should work if the script is encoded in Windows-1252. If you have trouble please run info.bat
viewtopic.php?f=3&t=7420&p=49133#p49133
and post the output of that script.

Steffen

Re: [Little challenge]Converting text file with unknown encoding into ANSI

Posted: 06 Nov 2016 04:17
by BoQsc
Image
aGerman wrote:It's UTF-16 LE (or its UCS-2 subset)
I'm leaning towards UCS-2, merged_registry.reg file is combined registry file from multiple UCS-2 Little Endian encoded reg files. But i'm not sure if the merged_registry.reg is also a UCS-2 Little Endian.

Here is the sample script of how registry files was combined into one registry file:

Code: Select all

@ECHO OFF

set "merged_output=merged_registry.reg"

::Creates new text file and makes it ASCII
cmd /u /c echo. 2>%merged_output%

::Plus sign (+) at the end of copy command is not a mistake.
copy %merged_output% + batfile.reg + cmdfile.reg + fonfile.reg +


[Download sample Reg files for testing]



aGerman wrote:If you have trouble please run info.bat
viewtopic.php?f=3&t=7420&p=49133#p49133
and post the output of that script.

Steffen
Info.bat output: Move to the main post of the thread



This should work if the script is encoded in Windows-1252.
Script Output:
Image

Re: [Little challenge]Converting text file with unknown encoding into ANSI

Posted: 06 Nov 2016 06:55
by aGerman
::Creates new text file and makes it ASCII

That's entirely wrong. Execute CMD /? and see what option /u is for.

I figured it would get a little more complicated in your case. We need a more generic way to create the BOM for your Lithuanian environment.

Code: Select all

@echo off &setlocal
set "merged_output=merged_registry.reg"

::Creates a new text file with BOM 0xFF 0xFE
>"t.tmp" type nul
>"%merged_output%" type nul
for %%i in (255 254) do (
  >nul makecab /d compress=off /d reserveperfoldersize=%%i /d reserveperdatablocksize=26 "t.tmp" "temp.tmp"
  type "temp.tmp" | ((for /l %%j in (1 1 38) do pause)>nul&findstr "^">"t.tmp")
  >nul copy /y "t.tmp" /a "temp.tmp" /b
  >nul copy /y "%merged_output%" /b + "temp.tmp" /b
)

:: delete temporary files
del "t.tmp" "temp.tmp"

::Plus sign (+) at the end of copy command is not a mistake.
copy %merged_output% + batfile.reg + cmdfile.reg + fonfile.reg +


:: ***** Now you have the file UCS-2 encoded with BOM. Only if you still need ASCII (that I don't believe!) then proceed *****

:: write a temporary ASCII file using TYPE command
>"%merged_output%~" type "%merged_output%"

:: overwrite the old file
move /y "%merged_output%~" "%merged_output%"


If you actually need the conversion to ASCII at the end of the code depends on what you want to do with the newly created "merged_registry.reg". Also I'm wondering if you really need a .reg file. What about the REG command in batch? What you try to do here seems to be much too complicated.

Steffen

Re: [Little challenge]Converting text file with unknown encoding into ANSI

Posted: 10 Nov 2016 07:00
by jfl
FYI, I've written a command-line tool for converting data from any encoding to any other. And it also has options for adding or removing BOMs.
The tool is called conv.exe, and it's available in the SysTools.zip package there: https://github.com/JFLarvoire/SysToolsLib/releases
The C sources are there too, if you're interrested.
I use it often, for example for converting emails encoded in UTF-8, and displayed as (unreadable) ANSI by some webmail readers.
Another common use case if for transfering data between GUI and command-line apps: The GUI apps use the Windows system code page (1252 on French systems), which cannot be changed ; Command-line apps use the console's "current code page" (850 by default on French systems), which can be changed by the CHCP command. conv.exe defaults to converting from the system CP to the current CP. So you can tranfter data from a GUI app to a DOS app by copying it into the clipboard, then running in a cmd console: 1clip.exe | conv.exe | myapp.exe
In your case, for converting a file in-place from UTF-16 (CP 1200) to ANSI (CP 1252), you'd have to run:

Code: Select all

conv 1200 1252 myfile -same -bak

Run (conv -?) to display a short help screen with a list of all options.