Input unicode from a text file. (With gawk? ghostmachine4?)
Moderator: DosItHelp
-
- Expert
- Posts: 442
- Joined: 01 Aug 2010 17:13
- Location: Canadian Pacific
- Contact:
Input unicode from a text file. (With gawk? ghostmachine4?)
The problem:
This process must be automated, so pasting manually like that other recent thread won't suffice. I'm writing a script which I want to support reading unicode paths from a text file. I want all unicode to be reproduced the same as pasting a character into DOS where it appears as a square (with or without a marking inside), which is like ANSI with unicode where such exists. DOS has no problem dealing with unicode characters, they just can't be displayed and this is fine.
Type will not reproduce unicode from a text file properly, regardless of it's encoding. For cannot process type's junky unicode output which makes it impossible to stick in a variable. For has no problem with pasted unicode as aforementioned. (FYI to avoid wasting time, this isn't a code page (chcp) problem.)
So far:
I open cmd.exe /u so I can redirect > a unicode path to a text file. This encodes it as UTF-16LE. Let's say I have the path containing the Japanese hiragana "ki". The path and UTF-16LE-encoded text file look like this:
"C:\Folder き"
I need a utility to read the text file and reproduce this string as-is. I have a feeling gawk might do it, but I don't have experience with gawk. I'm going to mess with it and see what I can do though...
I found this vbscript? I can't get it to work though, plus I need it to accept the path of a text file as an argument and read from it. (Second half, Unicode to ASCII?)
http://www.xtremedotnettalk.com/showpos ... ostcount=7
This process must be automated, so pasting manually like that other recent thread won't suffice. I'm writing a script which I want to support reading unicode paths from a text file. I want all unicode to be reproduced the same as pasting a character into DOS where it appears as a square (with or without a marking inside), which is like ANSI with unicode where such exists. DOS has no problem dealing with unicode characters, they just can't be displayed and this is fine.
Type will not reproduce unicode from a text file properly, regardless of it's encoding. For cannot process type's junky unicode output which makes it impossible to stick in a variable. For has no problem with pasted unicode as aforementioned. (FYI to avoid wasting time, this isn't a code page (chcp) problem.)
So far:
I open cmd.exe /u so I can redirect > a unicode path to a text file. This encodes it as UTF-16LE. Let's say I have the path containing the Japanese hiragana "ki". The path and UTF-16LE-encoded text file look like this:
"C:\Folder き"
I need a utility to read the text file and reproduce this string as-is. I have a feeling gawk might do it, but I don't have experience with gawk. I'm going to mess with it and see what I can do though...
I found this vbscript? I can't get it to work though, plus I need it to accept the path of a text file as an argument and read from it. (Second half, Unicode to ASCII?)
http://www.xtremedotnettalk.com/showpos ... ostcount=7
-
- Posts: 319
- Joined: 12 May 2006 01:13
Re: Input unicode from a text file. (With gawk? ghostmachine
orange_batch wrote:I need a utility to read the text file and reproduce this string as-is. I have a feeling gawk might do it, but I don't have experience with gawk. I'm going to mess with it and see what I can do though...
I would suggest you choose a programming language with better unicode support such as Python (or Perl), also for their abundance of modules available that can help with your everyday tasks. I am not sure about unicode on vbscript so if you choose the native approach, you will have to dig around a little.
Here's a read on how Python deals with Unicode. (similarly for Perl, see its Docs)
-
- Expert
- Posts: 442
- Joined: 01 Aug 2010 17:13
- Location: Canadian Pacific
- Contact:
Re: Input unicode from a text file. (With gawk? ghostmachine
I really need this to work for DOS. Can gawk do it?
-
- Posts: 319
- Joined: 12 May 2006 01:13
Re: Input unicode from a text file. (With gawk? ghostmachine
orange_batch wrote:I really need this to work for DOS. Can gawk do it?
yes it should although i don't work with unicode much. But see its documentation. It says "As of version 3.1.5, its multibyte aware. .... "
-
- Expert
- Posts: 442
- Joined: 01 Aug 2010 17:13
- Location: Canadian Pacific
- Contact:
Re: Input unicode from a text file. (With gawk? ghostmachine
I'll give it a shot, thanks. Any further help would be appreciated.
I really don't know what the heck I'm doing with gawk. :(
I really don't know what the heck I'm doing with gawk. :(
Re: Input unicode from a text file. (With gawk? ghostmachine
This works on my computer with unicode in the filename, and in the file content.
Code: Select all
@echo off
setlocal
for %%a in (cfi*.*) DO (
echo ---------- %%a -----------
more "%%a"
echo ---------------------
)
-
- Expert
- Posts: 442
- Joined: 01 Aug 2010 17:13
- Location: Canadian Pacific
- Contact:
Re: Input unicode from a text file. (With gawk? ghostmachine
scratch: It must be read from a text file jeb, but might be a key to a solution. If unicode characters can be reproduced always irregular, the string could be manipulated to replace unicode with ?, run for /d (for folder names) and that could be used to retrieve the unicode... This only matches unicode length though, hmm. Good enough...?
Or, duhhhhh! Rather, forget the whole cmd /u thing and it outputs the ? in place of unicode for us. Then we can retrieve the unicode through this "improper" solution and detect if it returns more than one result or not... and if so, have the user choose between which folders during run-time, and proceed with the unicode from there.
One of my concerns was for /r "path". It does NOT work with ? in place of unicode. But, we can pushd "path" where ? is in place of unicode, and run for /r without the path, then popd.
I pasted this around the other forums I asked this question:
jeb on DosTips.com made me realize a fair enough workaround:
First of all, I'm only dealing with folder names, but this can be applied to files as well.
Forget the whole cmd /u thing and it outputs the ? in place of unicode as usual. ? matches any single character in a path, including unicode.
Retrieve the unicode in a folder name:
for /d %%x in ("C:\Folder ?") set folder="%%x"
(or retrieve the unicode in a file name:)
for %%x in ("C:\file ?") set file="%%x"
There are two problems however:
1. "C:\Folder ?" will match both "C:\Folder き" and "C:\Folder こ", etc.
2. for /r "C:\Folder ?"... does not work. Other things might not work either, without retrieving the unicode.
Solution: pushd "C:\Folder ?" then for /r with no "path" (processes current directory) then popd.
For the first problem, we can detect if it returns more than one result or not. If it does, have the user choose between which folders during run-time. This would have to verify each folder containing ? from the root folder to the most descendant folder. Parent folders can be filtered further by their descendants, unless the descendant folders exist in both parents, or something to that effect. I'm going to figure it out and work on a script for this.
It's the best that can be done. If anyone who comes across this can manage a solution to the original problem, it would still be desired.
Or, duhhhhh! Rather, forget the whole cmd /u thing and it outputs the ? in place of unicode for us. Then we can retrieve the unicode through this "improper" solution and detect if it returns more than one result or not... and if so, have the user choose between which folders during run-time, and proceed with the unicode from there.
One of my concerns was for /r "path". It does NOT work with ? in place of unicode. But, we can pushd "path" where ? is in place of unicode, and run for /r without the path, then popd.
I pasted this around the other forums I asked this question:
jeb on DosTips.com made me realize a fair enough workaround:
First of all, I'm only dealing with folder names, but this can be applied to files as well.
Forget the whole cmd /u thing and it outputs the ? in place of unicode as usual. ? matches any single character in a path, including unicode.
Retrieve the unicode in a folder name:
for /d %%x in ("C:\Folder ?") set folder="%%x"
(or retrieve the unicode in a file name:)
for %%x in ("C:\file ?") set file="%%x"
There are two problems however:
1. "C:\Folder ?" will match both "C:\Folder き" and "C:\Folder こ", etc.
2. for /r "C:\Folder ?"... does not work. Other things might not work either, without retrieving the unicode.
Solution: pushd "C:\Folder ?" then for /r with no "path" (processes current directory) then popd.
For the first problem, we can detect if it returns more than one result or not. If it does, have the user choose between which folders during run-time. This would have to verify each folder containing ? from the root folder to the most descendant folder. Parent folders can be filtered further by their descendants, unless the descendant folders exist in both parents, or something to that effect. I'm going to figure it out and work on a script for this.
It's the best that can be done. If anyone who comes across this can manage a solution to the original problem, it would still be desired.
Last edited by orange_batch on 09 Sep 2010 15:40, edited 1 time in total.
Re: Input unicode from a text file. (With gawk? ghostmachine
The interesting thing, I use the ? not for this, I only use it, because on my system your unicode char is displayed always as ?, I never seen a square
I test a bit as file content and as filenames
My filelist.txt
CFile ╬.txt
CFile ¾.txt
CFile き.txt
The dir command shows, independent of cmd or cmd /u
CFile +.txt
CFile ¾.txt
CFile ?.txt
If I try this
I got no output.
This is better
File=CFile +.txt
File=CFile ¾.txt
File=CFile ?.txt
But it shows the problem, it seems that I always got a ? if it is a chinese symbol.
I suppose it depends on my german system.
I always can display file with "more", like my filelist.txt or test.txt
With "type" I can display the filelist.txt, but not always the test.txt
It works only with cmd, not with cmd /u
I suppose, it is not the best idea, to solve a unicode problem with dos batch, perhaps vbscript could be a better solution (do I really say this? Me - the pure DOS man )
I test a bit as file content and as filenames
My filelist.txt
CFile ╬.txt
CFile ¾.txt
CFile き.txt
The dir command shows, independent of cmd or cmd /u
CFile +.txt
CFile ¾.txt
CFile ?.txt
If I try this
Code: Select all
for /F "usebackq tokens=*" %%f in ("fileList.txt") DO echo File=%%f
I got no output.
This is better
Code: Select all
for /F "usebackq tokens=*" %%f in (`more fileList.txt`) DO (
echo File=%%f
echo %%f >> test.txt
)
File=CFile +.txt
File=CFile ¾.txt
File=CFile ?.txt
But it shows the problem, it seems that I always got a ? if it is a chinese symbol.
I suppose it depends on my german system.
I always can display file with "more", like my filelist.txt or test.txt
With "type" I can display the filelist.txt, but not always the test.txt
It works only with cmd, not with cmd /u
I suppose, it is not the best idea, to solve a unicode problem with dos batch, perhaps vbscript could be a better solution (do I really say this? Me - the pure DOS man )
-
- Expert
- Posts: 442
- Joined: 01 Aug 2010 17:13
- Location: Canadian Pacific
- Contact:
Re: Input unicode from a text file. (With gawk? ghostmachine
Those other characters are more like extended ANSI/ASCII, 2-byte characters, true unicode characters like the Japanese き (not Chinese ) appear as square or ?, are 4-byte or possibly more. So those characters are proper and technically irrelevant. It is good to know though.
Now, time to see what I can do about this workaround!
Here are more Japanese characters you can test, if you want to see what I mean:
こ = ko
し = shi
つ = tsu
え = e
Btw this is what they look like:
Now, time to see what I can do about this workaround!
Here are more Japanese characters you can test, if you want to see what I mean:
こ = ko
し = shi
つ = tsu
え = e
Btw this is what they look like:
Re: Input unicode from a text file. (With gawk? ghostmachine
Sorry, my japanese is nearly as bad as my chinese
I tested it with your charactes and also with ΩЯ٢٠١٠ (greek omega, cyrillic Ya, arabic numbers 2010)
I always get ?, only for the greek Omega I got a "O"
I tested it with your charactes and also with ΩЯ٢٠١٠ (greek omega, cyrillic Ya, arabic numbers 2010)
I always get ?, only for the greek Omega I got a "O"
-
- Expert
- Posts: 442
- Joined: 01 Aug 2010 17:13
- Location: Canadian Pacific
- Contact:
Re: Input unicode from a text file. (With gawk? ghostmachine
Yes, but they are recognized as different, no?
if ki==ki (echo same) else echo not same
if ki==ko (echo same) else echo not same
Both would look like if ?==?...
Anyhow the solution I mentioned and am just about to start working on is to address any of DOS's confusion with multiple results from the replacement ?s. Hmmmmm also I'm getting kind of excited about AutoIt, it seems much better than PowerShell even.
if ki==ki (echo same) else echo not same
if ki==ko (echo same) else echo not same
Both would look like if ?==?...
Anyhow the solution I mentioned and am just about to start working on is to address any of DOS's confusion with multiple results from the replacement ?s. Hmmmmm also I'm getting kind of excited about AutoIt, it seems much better than PowerShell even.
Re: Input unicode from a text file. (With gawk? ghostmachine
If I try this
It depends on, how I save the file.
As ANSI, I got "same"
As Unicode, I got "Not same - unicode works"
And if I type or more the file I got
I have batch problems with simple german charcters like äöü (exists in some codepages), how it should handle "really" unicode characters?
Code: Select all
if "し"=="こ" (
echo same
) ELSE (
echo Not same - unicode works
)
It depends on, how I save the file.
As ANSI, I got "same"
As Unicode, I got "Not same - unicode works"
And if I type or more the file I got
Code: Select all
if "Òüù"=="Òüô" (
I have batch problems with simple german charcters like äöü (exists in some codepages), how it should handle "really" unicode characters?
-
- Expert
- Posts: 442
- Joined: 01 Aug 2010 17:13
- Location: Canadian Pacific
- Contact:
Re: Input unicode from a text file. (With gawk? ghostmachine
Ah sorry, maybe it's my fault for not explaining well again. I meant to simply type/paste that test code into Command Prompt, which keeps the unicode intact.
I'm not trying to run unicode batch scripts, or necessarily read unicode text files (which is what I was trying to solve). It's dealing with unicode folder and file names, so I think the problem you're describing isn't really necessary to figure out.
There is some information here about some forms of displaying unicode, but it's irrelevant to my problem.
Anyhow, I'm almost done my script for handling unicode folders and files. I figured out a difficult part of the solution in my sleep again.
I'm not trying to run unicode batch scripts, or necessarily read unicode text files (which is what I was trying to solve). It's dealing with unicode folder and file names, so I think the problem you're describing isn't really necessary to figure out.
There is some information here about some forms of displaying unicode, but it's irrelevant to my problem.
Anyhow, I'm almost done my script for handling unicode folders and files. I figured out a difficult part of the solution in my sleep again.