Input unicode from a text file. (With gawk? ghostmachine4?)

Message

orange_batch · #1 Post by **orange_batch** » 08 Sep 2010 05:12

The problem:
This process must be automated, so pasting manually like that other recent thread won't suffice. I'm writing a script which I want to support reading unicode paths from a text file. I want all unicode to be reproduced the same as pasting a character into DOS where it appears as a square (with or without a marking inside), which is like ANSI with unicode where such exists. DOS has no problem dealing with unicode characters, they just can't be displayed and this is fine.

Type will not reproduce unicode from a text file properly, regardless of it's encoding. For cannot process type's junky unicode output which makes it impossible to stick in a variable. For has no problem with pasted unicode as aforementioned. (FYI to avoid wasting time, this isn't a code page (chcp) problem.)

So far:
I open cmd.exe /u so I can redirect > a unicode path to a text file. This encodes it as UTF-16LE. Let's say I have the path containing the Japanese hiragana "ki". The path and UTF-16LE-encoded text file look like this:
"C:\Folder き"

I need a utility to read the text file and reproduce this string as-is. I have a feeling gawk might do it, but I don't have experience with gawk. I'm going to mess with it and see what I can do though...

I found this vbscript? I can't get it to work though, plus I need it to accept the path of a text file as an argument and read from it. (Second half, Unicode to ASCII?)

http://www.xtremedotnettalk.com/showpos ... ostcount=7

ghostmachine4 · #2 Post by **ghostmachine4** » 08 Sep 2010 05:57

orange_batch wrote:I need a utility to read the text file and reproduce this string as-is. I have a feeling gawk might do it, but I don't have experience with gawk. I'm going to mess with it and see what I can do though...

I would suggest you choose a programming language with better unicode support such as Python (or Perl), also for their abundance of modules available that can help with your everyday tasks. I am not sure about unicode on vbscript so if you choose the native approach, you will have to dig around a little.

Here's a read on how Python deals with Unicode. (similarly for Perl, see its Docs)

orange_batch · #3 Post by **orange_batch** » 08 Sep 2010 06:18

I really need this to work for DOS. Can gawk do it?

ghostmachine4 · #4 Post by **ghostmachine4** » 08 Sep 2010 06:42

orange_batch wrote:I really need this to work for DOS. Can gawk do it?

yes it should although i don't work with unicode much. But see its documentation. It says "As of version 3.1.5, its multibyte aware. .... "

orange_batch · #5 Post by **orange_batch** » 08 Sep 2010 06:46

I'll give it a shot, thanks. Any further help would be appreciated.

I really don't know what the heck I'm doing with gawk. :(

jeb · #6 Post by **jeb** » 09 Sep 2010 06:35

This works on my computer with unicode in the filename, and in the file content.

Code: Select all

@echo off
setlocal
for %%a in (cfi*.*) DO (
  echo ---------- %%a -----------
  more "%%a"
  echo ---------------------
)

orange_batch · #7 Post by **orange_batch** » 09 Sep 2010 11:10

scratch: It must be read from a text file jeb, but might be a key to a solution. If unicode characters can be reproduced always irregular, the string could be manipulated to replace unicode with ?, run for /d (for folder names) and that could be used to retrieve the unicode... :!:

This only matches unicode length though, hmm. Good enough...?

Or, duhhhhh! Rather, forget the whole cmd /u thing and it outputs the ? in place of unicode for us. Then we can retrieve the unicode through this "improper" solution and detect if it returns more than one result or not... and if so, have the user choose between which folders during run-time, and proceed with the unicode from there.

One of my concerns was for /r "path". It does NOT work with ? in place of unicode. But, we can pushd "path" where ? is in place of unicode, and run for /r without the path, then popd.

I pasted this around the other forums I asked this question:

jeb on DosTips.com made me realize a fair enough workaround:

First of all, I'm only dealing with folder names, but this can be applied to files as well.

Forget the whole cmd /u thing and it outputs the ? in place of unicode as usual. ? matches any single character in a path, including unicode.

Retrieve the unicode in a folder name:
for /d %%x in ("C:\Folder ?") set folder="%%x"

(or retrieve the unicode in a file name:)
for %%x in ("C:\file ?") set file="%%x"

There are two problems however:

1. "C:\Folder ?" will match both "C:\Folder き" and "C:\Folder こ", etc.

2. for /r "C:\Folder ?"... does not work. Other things might not work either, without retrieving the unicode.
Solution: pushd "C:\Folder ?" then for /r with no "path" (processes current directory) then popd.

For the first problem, we can detect if it returns more than one result or not. If it does, have the user choose between which folders during run-time. This would have to verify each folder containing ? from the root folder to the most descendant folder. Parent folders can be filtered further by their descendants, unless the descendant folders exist in both parents, or something to that effect. I'm going to figure it out and work on a script for this.

It's the best that can be done. If anyone who comes across this can manage a solution to the original problem, it would still be desired.

jeb · #8 Post by **jeb** » 09 Sep 2010 13:30

The interesting thing, I use the ? not for this, I only use it, because on my system your unicode char is displayed always as ?, I never seen a square :!:

I test a bit as file content and as filenames

My filelist.txt
CFile ╬.txt
CFile ¾.txt
CFile き.txt

The dir command shows, independent of cmd or cmd /u
CFile +.txt
CFile ¾.txt
CFile ?.txt

If I try this

Code: Select all

for /F "usebackq tokens=*" %%f in ("fileList.txt") DO echo File=%%f

I got no output.

This is better

Code: Select all

for /F "usebackq tokens=*" %%f in (`more fileList.txt`) DO (
  echo File=%%f
  echo %%f >> test.txt
)

File=CFile +.txt
File=CFile ¾.txt
File=CFile ?.txt

But it shows the problem, it seems that I always got a ? if it is a chinese symbol.
I suppose it depends on my german system.

I always can display file with "more", like my filelist.txt or test.txt
With "type" I can display the filelist.txt, but not always the test.txt :?:

It works only with cmd, not with cmd /u

I suppose, it is not the best idea, to solve a unicode problem with dos batch, perhaps vbscript could be a better solution (do I really say this? Me - the pure DOS man :roll:

)

orange_batch · #9 Post by **orange_batch** » 09 Sep 2010 15:22

Those other characters are more like extended ANSI/ASCII, 2-byte characters, true unicode characters like the Japanese き (not Chinese

) appear as square or ?, are 4-byte or possibly more. So those characters are proper and technically irrelevant. It is good to know though.

Now, time to see what I can do about this workaround!

Here are more Japanese characters you can test, if you want to see what I mean:

こ = ko
し = shi
つ = tsu
え = e

Btw this is what they look like:

jeb · #10 Post by **jeb** » 09 Sep 2010 16:06

Sorry, my japanese is nearly as bad as my chinese

I tested it with your charactes and also with ΩЯ٢٠١٠ (greek omega, cyrillic Ya, arabic numbers 2010)
I always get ?, only for the greek Omega I got a "O"

orange_batch · #11 Post by **orange_batch** » 09 Sep 2010 16:12

Yes, but they are recognized as different, no?

if ki==ki (echo same) else echo not same

if ki==ko (echo same) else echo not same

Both would look like if ?==?...

Anyhow the solution I mentioned and am just about to start working on is to address any of DOS's confusion with multiple results from the replacement ?s. Hmmmmm also I'm getting kind of excited about AutoIt, it seems much better than PowerShell even.

jeb · #12 Post by **jeb** » 10 Sep 2010 11:09

If I try this

Code: Select all

if "し"=="こ" (
  echo same
) ELSE (
  echo Not same - unicode works
)

It depends on, how I save the file.
As ANSI, I got "same"
As Unicode, I got "Not same - unicode works"
And if I type or more the file I got

Code: Select all

if "Òüù"=="Òüô" (

I have batch problems with simple german charcters like äöü (exists in some codepages), how it should handle "really" unicode characters?

orange_batch · #13 Post by **orange_batch** » 11 Sep 2010 04:16

Ah sorry, maybe it's my fault for not explaining well again. I meant to simply type/paste that test code into Command Prompt, which keeps the unicode intact.

I'm not trying to run unicode batch scripts, or necessarily read unicode text files (which is what I was trying to solve). It's dealing with unicode folder and file names, so I think the problem you're describing isn't really necessary to figure out.

There is some information here about some forms of displaying unicode, but it's irrelevant to my problem.

Anyhow, I'm almost done my script for handling unicode folders and files. I figured out a difficult part of the solution in my sleep again. :roll:

DosTips.com

Input unicode from a text file. (With gawk? ghostmachine4?)

Input unicode from a text file. (With gawk? ghostmachine4?)

Re: Input unicode from a text file. (With gawk? ghostmachine

Re: Input unicode from a text file. (With gawk? ghostmachine

Re: Input unicode from a text file. (With gawk? ghostmachine

Re: Input unicode from a text file. (With gawk? ghostmachine

Re: Input unicode from a text file. (With gawk? ghostmachine

Re: Input unicode from a text file. (With gawk? ghostmachine

Re: Input unicode from a text file. (With gawk? ghostmachine

Re: Input unicode from a text file. (With gawk? ghostmachine

Re: Input unicode from a text file. (With gawk? ghostmachine

Re: Input unicode from a text file. (With gawk? ghostmachine

Re: Input unicode from a text file. (With gawk? ghostmachine

Re: Input unicode from a text file. (With gawk? ghostmachine