JREPL.BAT v8.6 - regex text processor with support for text highlighting and alternate character sets

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Message
Author
dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: JREPL.BAT - regex text processor - successor to REPL.BAT

#316 Post by dbenham » 25 Aug 2017 06:08

eugenemon wrote:I have tried both codes and now I understand the structure for the capture groups. However, the output file I received shows 3 blank lines when I opened it with notepad++. The file size is now 4 bytes compared to previously (when I did it wrongly) 0kb. I guess that's an improvement. :?
:?: :?

I've tested the code using the example input file you provided, and it worked perfectly. Either the input format differs from what you have posted, or else your source files are in Unicode. JREPL only supports ASCII format (though this solution should work with UTF-8)

eugenemon wrote:Also, I would like to consolidate all output to one file. I read from a previous post that I can use the following to append the result to the same file.
>> output.csv


Sure, you can remove the /O parameter and replace it with redirection. If you do, you may want to introduce a blank line between each output to make the result easier to read.

I would use:

Code: Select all

set "end=output.WriteLine(head+'\r\n'+data+'\r\n')"


Dave Benham

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: JREPL.BAT - regex text processor - successor to REPL.BAT

#317 Post by dbenham » 25 Aug 2017 13:58

Here is JREPL.BAT Version 6.8
JREPL6.8.zip
Also downloaded 119 times from the main release page in 2 weeks
(19.38 KiB) Downloaded 830 times

There are 3 new features:

1) New \c caret escape sequence available with the /X option

JREPL typically requires CALL when used within a batch script. The CALL command has an unfortunate side effect whereby it doubles all quoted caret literals (call echo "^" becomes echo "^^"). This is not an issue when using ^ as a line beginning anchor, because regular expressions "^Beginning of line" and "^^Beginning of line" are functionally equivalent. But it is an issue if ^ is used as a negative character set, or as a string literal.

call jrepl "[^ ]" ... is intended to match anything that is not a space, but instead it matches anything that is not a space or a caret.

Prior to v6.8, there were only two solution:
- Put the find and replace strings in variables and use the /V option, which may not be convenient
- Or else use the \x5E escape sequence, which is difficult to remember.

Now with v6.8 you can add the /X option and use \c.

Code: Select all

@echo off
setlocal
set "str=A B C ^ 1 2 3"
:: Desired operation: substitute . for all characters except space.
 
:: Caret doubling issue demonstration:
call jrepl "[^ ]" "." /s str
 
:: Solution with /X option and \c escape sequence:
call jrepl "[\c ]" "." /x /s str
--OUTPUT--

Code: Select all

. . . ^ . . .
. . . . . . .


2) New \APP option for use with \O causes output to be appended to any pre-existing output file, rather than overwriting it.

jrepl ... /O outFile /APP is equivalent to jrepl ... >>outFile

However, the /APP option does provide one new function in that it can also be used with /F inFile /O - /APP, which allows new content (the result) to be appended to the original source file.


3) New openFile() JScript function allows the destination of output to change midstream.

Code: Select all

      openOutput( fileName [,appendBoolean] )

               Open a new TextStream object for writing and assign it to the
               output variable. If appendBoolean is truthy, then open the file
               for appending.

               If fileName is falsey, then assign output to stdout.

               All subsequent output will be written to the new destination.

               Any prior output file is automatically closed.

For example, This StackOverflow question wanted to split an input file into multiple output files, breaking at each HD record. The name of each output file is extracted from the HD line, and any lines prior to the first HD line are discarded.

source.txt

Code: Select all

File Date Source Target
HD|out1.txt|Field 2|Field 3
ITEM 1|Other fields 1
ITEM 2|Other fields 2
HD|out2.txt|Field A|Field B
ITEM A|Other fields A
ITEM B|Other fields B
ITEM C|Other fields C

Code: Select all

jrepl "^HD\|([^|]+)" "openOutput($1);$txt=$0" /jq /f "source.txt" >nul
--RESULT--
out1.txt

Code: Select all

HD|out1.txt|Field 2|Field 3
ITEM 1|Other fields 1
ITEM 2|Other fields 2
out2.txt

Code: Select all

HD|out2.txt|Field A|Field B
ITEM A|Other fields A
ITEM B|Other fields B
ITEM C|Other fields C


Dave Benham

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

JREPL.BAT v7.0 - regex text processor now with Unicode and XRegExp support

#318 Post by dbenham » 07 Sep 2017 21:46

Here is version 7.3 - A major new release with Unicode and XRegExp support
JREPL7.3.zip
Version 7.3 was downloaded 23 times from the main release page over 2 days while it was the current version.
(22.46 KiB) Downloaded 837 times
Bugged v7.1 was download 172 times in 15 days

Rather than write a new summary of the changes, I will post the relevant built in help text to catalog the enhancements.

Summary of changes

Code: Select all

>jrepl /?history

    2017-09-23 v7.3: Fixed /O - support for ADO input.
    2017-09-23 v7.2: Improved documentation of new 7.0 features.
                     Bug fix - /T FILE ADO support was broken
    2017-09-08 v7.1: Bug fix - v7.0 failed if Find or Replace contained )
    2017-09-08 v7.0: Added /XREG and /TFLAG for XRegExp regex support.
                     Added /UTF for UTF-16LE support.
                     Added /X support for the \u{N} unicode escape sequence.
                     Added |CharSet syntax for file names to allow reading
                     and writing via ADO with a specified character set.
                     Exposed the fso FileSystemObject to user JScript.
                     Augmented openOutput for Unicode and ADO support.
... <truncated>


Native Unicode 16 Little Endian support (UTF-16LE) for input and output

Code: Select all

>jrepl /?/utf

      /UTF - All input and output encodings are Unicode UTF-16 Little
            Endian (UTF-16LE). This includes stdin and stdout. The only
            exceptions are /JLIB and /XREG files, which are still read
            as ASCII.

            The \xFF\xFE BOM is optional for input.

            Output files will automatically have the \xFF\xFE BOM inserted.
            But stdout will not have the BOM.

            Extended ASCII escape sequences (\x80 - \xFF) should not be used
            with /UTF combined with /X.

            Regular expression support of Unicode can be improved by using
            the /XREG option.

            Variable values are no longer written to temporary files when
            /X is used if /UTF is also used.

            Unfortunately, /UTF is incompatible with /RTN.


Read and write files using virtually any character set (including UTF-8) via ADO
A list of valid character set names and their corresponding code page can be found at https://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx

Code: Select all

>jrepl /?/f & jrepl /?/o & jrepl /?/t

      /F InFile[|CharSet]

            Input is read from file InFile instead of stdin.

        ->  If |CharSet (internet character set name) is appended to InFile,
        ->  then the file is opened via ADO using the specified CharSet value.
        ->  JREPL still recognizes both \n and \r\n as input line terminators
        ->  when using ADO. Both ADO and the CharSet must be available on the
        ->  local system.


      /O OutFile[|CharSet]

            Output is written to file OutFile instead of stdout. Any existing
            OutFile is overwritten unless the /APP option is also used.

        ->  If |CharSet (internet character set name) is appended to OutFile,
        ->  then the file is opened via ADO using the specified CharSet value.
        ->  The output line terminator still defaults to \r\n when using ADO,
        ->  and may be changed to \n with the \U option. Both ADO and the
        ->  CharSet must be available on the local system.

           If /F InFile is also used, then an OutFile value of "-" overwrites
           the original InFile with the output, preserving the character set.
           The output is first written to a temporary file with the same path
           and name, with .new appended. Upon completion, the temp file is
           moved to replace the InFile. It is not valid to use "-|CharSet"


      /T DelimiterChar
      /T FILE

            The /T option is very similar to the Oracle Translate() function,
            or the unix tr command, or the sed y command.

            The Search represents a set of search expressions, and Replace
            is a like sized set of replacement expressions. Expressions are
            delimited by DelimiterChar (a single character). If DelimiterChar
            is an empty string, then each character is treated as its own
            expression. The /L option is implicitly set if DelimiterChar is
            empty. Escape sequences are interpreted after the search and
            replace strings are split into expressions, so escape sequences
            cannot be used without a delimiter.

            An alternate syntax is to specify the word FILE instead of a
            DelimiterChar, in which case the Search and Replace parameters
            specify files that contain the search and replace expressions,
        ->  one expression per line. Each file can be opened via ADO if
        ->  |CharSet (internet character set name) is appended to the file
            name. Note that the /V option does not apply to Search and Replace
            if /T FILE is used.
           
            ... <truncated>


Add Unicode support to objects available to user supplied JScript

Code: Select all

>jrepl /?jscript

  The following global JScript variables/objects/functions are available for
  use in JScript code associated with the /Jxxx options.
   ... <truncated>

      input  - The TextStream object from which input is read.
               This may be stdin or a file.

               If the file was opened by ADO with |CharSet, then input is
               an object that partially emulates a TextStream object, with
               a private ADO Stream doing the actual work. The following
               public members are available to the ADO object:

                  Property           Method
                  -------------      -----------------------------------
                  AtEndOfStream      Read
                                     ReadLine
                                     SkipLine
                                     Write
                                     WriteLine
                                     Close

      output - The TextStream object to which the output is written.
               This may be stdout or a file. ... <truncated>

               If the file was opened by ADO with |CharSet, then output is
               an object that partially emulates a TextStream object (see the
               input object).

      openOutput( fileName[|CharSet] [,appendBoolean [,utfBoolean]] )

               Open a new TextStream object for writing and assign it to the
               output variable. If appendBoolean is truthy, then open the file
               for appending.

               If |CharSet is appended to the fileName, then open the file
               using ADO and the specified internet character set name. The
               output variable will be set to an object that partially
               emulates a TextStream object (see the input object).

               If utfBoolean is truthy, then output is encoded as unicode
               (UTF-16LE). The unicode file will automatically have the BOM
               unless opened for appending. The utfBoolean argument is ignored
               if |CharSet is also specified.

               If fileName is falsey, then output is written to stdout.

               All subsequent output will be written to the new destination.

               Any prior output file is automatically closed.

... <truncated>


New escape sequence \u{N} for access to any Unicode code point, including "Astral" (supplemental) planes

Code: Select all

>jrepl /?/x

      /X  - ... <truncated>

            Also enables extended substitution pattern syntax with support
            for the following escape sequences within the Replace string:
            ... <truncated>

            \u{N}  -  Any Unicode code point where N is 1 to 6 hex digits
           
            Also enables the \q, \c, and \u{N} escape sequences for the Search
            string. The other escape sequences are already standard for a
            regular expression Search string.

            ... <truncated>

            When using \xnn with /X, JREPL assumes your machine defaults to
            Windows-1252, which is generally true for Western Europe and North
            and South America. If your machine doesn't use Windows-1252, then
            you should not use \xnn with values above 7F unless you force
            input and output to use Windows-1252 via /F "inFile|Windows-1252"
            and /O "outFile|Windows-1252" (or /O -).

            Note that without the /X option, \xnn within a regex search string
            maps to unicode code points.           


Enhanced regular expression syntax via XRegExp (xregexp.com)

Code: Select all

>jrepl /?/xreg & jrepl /?/tflag

      /XREG FileList

            Adds support for XRegExp by loading the xregexp files specified
            in FileList before any /JLIB code is loaded. Multiple files are
            delimited by forward slashes (/). If FileList is simply a dot,
            then substitute the value of environment variable XREGEXP for
            the FileList.

            The simplest option is to load "xregexp-all.js", but this
            includes all available XRegExp options and addons, some of which
            are unlikely to be useful to JREPL. Alternatively you can load
            only the specific modules you need, but they must be loaded in the
            correct order.

            Once the XRegExp module(s) are loaded, all user supplied regular
            expressions are created using the XRegExp constructor rather than
            the standard RegExp constructor. Also, XRegExp.install('natives')
            is executed so that many standard regular expression methods are
            overridden by XRegExp methods.

            /XREG requires XRegExp version 2.0.0 or 3.x.x. JREPL will not
            support version 4.x.x (when it is released) because v4.x.x
            is scheduled to drop support for XRegExp.install('natives').

            One of the key features of XRegExp is that it extends the JScript
            regular expression syntax to support named capture groups, as in
            (?<name>anyCapturedExpression). Named groups can be referenced
            in Replace strings as ${name}, and in Replace JScript code as
            $0.name

            The /T option is no longer limited to 99 capture groups when
            /XREG is used. However, /T replace expressions must reference a
            captured group by name if the capture index is 100 or above.

            Every /T search expression is automatically given a capture group
            name of Tn, where n is the 0 based index of the /T expression.

            XRegExp also adds support for non-standard mode flags:
                n - Explicit capture
                s - Dot matches all
                x - Free spacing and line comments
                A - Astral
            These flags can generally be applied by using (?flags) syntax
            at the begining of any regex. This is true for /P, /INC, /EXC,
            and most Find regular expressions. The one exception is /T doesn't
            support (?flags) at the beginning of the Find string. The /TFLAG
            option should be used to specify XRegExp flags for use with /T.

            XRegExp also improves regular expression support for Unicode via
            \p{Category}, \p{Script}, \p{InBlock}, \p{Property} escape
            sequences, as well as the negated forms \P{...} and \p{^...}.
            Note that example usage on xregexp.com shows use of doubled back
            slashes like \\p{...}. But JREPL automatically does the doubling
            for you, so you should use \p{...} instead.

            See xregexp.com for more information about the capabilities of
            XRegExp, and for links to download XRegExp.


      /TFLAG Flags

            Used to specify XRegExp non-standard mode flags for use with /T.
            /TFLAG is ignored unless both /T and /XREG are used.


The enhancements were implemented in a fairly surgical manner, but they have a profound impact on the entire functioning of the utility. I have only done limited testing, so I won't be surprised if I have introduced some bugs.

I encourage everyone to try out the new features, and please report any problems that you find.

Dave Benham

eugenemon
Posts: 3
Joined: 24 Aug 2017 02:09

Re: JREPL.BAT - regex text processor - successor to REPL.BAT

#319 Post by eugenemon » 07 Sep 2017 23:51

dbenham wrote:I've tested the code using the example input file you provided, and it worked perfectly. Either the input format differs from what you have posted, or else your source files are in Unicode. JREPL only supports ASCII format (though this solution should work with UTF-8)


Omg, it's amazing! It worked. Sorry for the late reply. I output all txt files into ASCII encoding and JREPL read them perfectly. I am finally able to get the right data out from the mess. Thank you so much for this.

dbenham wrote:Here is version 7.0 - A major new release with Unicode and XRegExp support


Does this means unicode can now be recognized?

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: JREPL.BAT - regex text processor - successor to REPL.BAT

#320 Post by dbenham » 08 Sep 2017 02:59

eugenemon wrote:
dbenham wrote:Here is version 7.0 - A major new release with Unicode and XRegExp support


Does this means unicode can now be recognized?

Yes it does, as long as you know the encoding and use the correct options. :D
Go up two posts for details.

If your source file is encoded as UTF-16LE, then simply add the /UTF option to your command. JREPL will use the native CSCRIPT ability to read and write UTF-16 little endian format.

If your source file uses some other encoding, then you must use the "fileName|CharSet" ADO syntax for both the input and output. But this only works if your machine has ADO as well as the correct character set installed. Assuming your source uses UTF-16BE encoding, then you would use /F "%%F|UTF-16BE" /O="%%~nF.csv|UTF-16BE".

When using ADO, the output encoding can differ from the input. For example, both /F "%%F|UTF-16BE" /O "%%~nF.csv|UTF-16LE" and /F "%%F|UTF-16BE" /O "%%~nF.csv" /UTF would read UTF-16BE and write UTF-16LE. Other likely output formats would be UTF-8 or US-ASCII. But bear in mind that US-ASCII will corrupt the output if your source has any non-ASCII characters.


Dave Benham

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: JREPL.BAT v7.1 - regex text processor now with Unicode and XRegExp support

#321 Post by dbenham » 08 Sep 2017 08:33

Ugh - I had a stupid nasty bug that caused v7.0 to fail if the Find or Replace strings contained )
I think it would also fail if there was a poison character :evil: :oops:

I've fixed the bug and updated my v7.0 post to v7.1.


Dave Benham

aGerman
Expert
Posts: 4678
Joined: 22 Jan 2010 18:01
Location: Germany

Re: JREPL.BAT v7.1 - regex text processor now with Unicode and XRegExp support

#322 Post by aGerman » 08 Sep 2017 16:56

Great job, Dave!

I don't try to fully understand your code. Anyway, there is a section that draws my attention beginning at line 1958

Code: Select all

       case 'x80': return '\u20AC';
Assuming that byte 0x80 is the Euro currency symbol actually means that you expect the incoming encoding is extended latin as Windows-1250, -1252 and the like. There are a lot of code pages (in fact the majority) where 0x80 is a completely differnt character (like Windows-1251, OEM code pages, etc.). As I said - I don't understand your code and maybe you're aware of it ...

Steffen

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: JREPL.BAT v7.1 - regex text processor now with Unicode and XRegExp support

#323 Post by dbenham » 08 Sep 2017 17:31

All of those 'xNN' translations in the decode function have been there since the beginning to support processing of extended ASCII. They have nothing to do with the Unicode support.

CSCRIPT always interprets extended ASCII bytes the same way (perhaps Windows-1250, but not sure), regardless what the cmd.exe active code page is.

When CSCRIPT reads a 0x80 byte while in ASCII mode, it is interpreted as unicode code point 20AC. But a '\x80' escape sequence would be unicode '0080', a totally different character. So I need to translate the '\x80' escape sequence to get the expected extended ASCII result. When the character is written back out as ASCII, CSCRIPT translates back into a 0x80 byte.

That is why my documentation states you should not use the '\xNN' escape sequence when working with unicode input/output. You should use \uNNNN instead, (or \u{N...} with /X). Though technically, my extended ASCII translation only occurs when you use the /X option. So you could safely use '\xNN' with Unicode as long as you don't use the /X option.

In summary:

Unicode input/output - only use \xNN without /X
extended ASCII input/output - only use \xNN with /X


Dave Benham

aGerman
Expert
Posts: 4678
Joined: 22 Jan 2010 18:01
Location: Germany

Re: JREPL.BAT v7.1 - regex text processor now with Unicode and XRegExp support

#324 Post by aGerman » 09 Sep 2017 11:45

It's quite difficult for me to explain. I'll try again.
If you say "extended ASCII" you obvioulsy refer to a certain ANSI code page.

For you and for me the default ANSI code page is Windows-1252. That means if we write an € (Euro symbol) in Notepad and save it ANSI-encoded then Byte 0x80 was written to the file. The Unicode code point for the Euro symbol is u20AC. So far everything seems to be okay with the line that I quoted above.

A Serbian writes character Ђ (capital letter dje) in Notepad and saves it. His ANSI setting defaults to Windows-1251 and thus, the same Byte 0x80 was written to his file as for our Euro symbol in Windows-1252. But now your hard-coded conversion to u20AC is wrong because the code point for Ђ is u0402.

I just hope you understand my point, Dave. I believe a hard-coded conversion is ambiguous and the only reason why I write this comment is because your tool is great and it would be a pity if some features can't be properly used in other environments.

Steffen

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: JREPL.BAT v7.1 - regex text processor now with Unicode and XRegExp support

#325 Post by dbenham » 09 Sep 2017 12:20

aGerman wrote:A Serbian writes character Ђ (capital letter dje) in Notepad and saves it. His ANSI setting defaults to Windows-1251 and thus, the same Byte 0x80 was written to his file as for our Euro symbol in Windows-1252. But now your hard-coded conversion to u20AC is wrong because the code point for Ђ is u0402.

I just hope you understand my point, Dave. I believe a hard-coded conversion is ambiguous and the only reason why I write this comment is because your tool is great and it would be a pity if some features can't be properly used in other environments.

I absolutely understand your point. Now I'll try to do a better job explaining why my code is structured the way it is.

JScript (CSCRIPT) stores all strings internally as UTF-16LE. When a CSCRIPT ASCII TextStream reads a 0x80 byte, it always treats it as code point u20AC, regardless what code page cmd.exe is using. Somewhere on StackOverflow I saw a post claiming VBScript (CSCRIPT in general) uses Windows-1250. So JScript interprets the code incorrectly. But that is OK, as long as the same translation is used when the character is written back out (which it does). Now it will not look correct if JREPL writes directly to the console, but if saved as a file, then it will indeed be a 0x80 byte.

So if a user specifies a character using a \x80 escape sequence, then I need to interpret that escape sequence the same way that CSCRIPT interprets a 0x80 byte.

Hope that helps.


Dave

aGerman
Expert
Posts: 4678
Joined: 22 Jan 2010 18:01
Location: Germany

Re: JREPL.BAT v7.1 - regex text processor now with Unicode and XRegExp support

#326 Post by aGerman » 09 Sep 2017 12:47

dbenham wrote:When a CSCRIPT ASCII TextStream reads a 0x80 byte, it always treats it as code point u20AC ... Somewhere on StackOverflow I saw a post claiming VBScript (CSCRIPT in general) uses Windows-1250.

If this is true for all environments then I would agree . In that case you do it the right way for the bug in CSCRIPT :lol: But I'm still afraid that CSCRIPT determines the default ANSI code page of your computer settings (which has nothing to do with the default or temporary OEM code page in CMD).

Thanks for your explanation!

Steffen

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: JREPL.BAT v7.1 - regex text processor now with Unicode and XRegExp support

#327 Post by dbenham » 09 Sep 2017 14:58

aGerman wrote:But I'm still afraid that CSCRIPT determines the default ANSI code page of your computer settings (which has nothing to do with the default or temporary OEM code page in CMD).
Ooh, that is an interesting idea that would indeed cause problems. :!: :(

I've had a hard time finding definitive info about how CSCRIPT interprets extended ASCII (ANSI).

Is that a theory on your part? Or is that something you have experience with, or have you read (semi) official documentation?

If you are right, then I don't know what I can do. Even if I knew what code page was actually being used, it wouldn't do me any good. I would need to know how each byte code is being interpreted by CSCRIPT - the mapping from byte code to unicode. And for variable or multi-byte character sets like Japanese, that could be particularly troublesome, especially if the source code has invalid character sequences. Invalid character sequences can definitely occur with binary data if it is interpreted as a text stream (which is what JREPL does)

I must admit that my statement claiming "always..." can't possibly be true, because I did get one PM from a user with Japanase locale, using code page 932, that said JREPL was corrupting binary data when using the \xnn escape sequences. The find replace was supposed to preserve the total length, but the output was shortened. Your theory could explain why it failed for that person.

Hopefully the new /UTF option and/or the ADO capability will alleviate any problems that people might have. Users can specify exactly what character set they are using, and stick to the \unnnn or \u{N} escape sequences.

For editing binary data, it should be possible to use ADO with the exact character set that matches my decode function. I think it is time to get a definitive answer on the code page that my code is using.


Dave Benham

aGerman
Expert
Posts: 4678
Joined: 22 Jan 2010 18:01
Location: Germany

Re: JREPL.BAT v7.1 - regex text processor now with Unicode and XRegExp support

#328 Post by aGerman » 09 Sep 2017 15:32

dbenham wrote:Is that a theory on your part? Or is that something you have experience with, or have you read (semi) official documentation?

At the moment it's only a theory based on my experiences with Windows scripts. I think it could be proved though. You read that CSCRIPT always uses Windows-1250 but your default is Windows-1252. You only have to find the different characters and do some test against them.

And yes, I also think that the ADO streams should be able to solve the problem even if it could be more difficult. Most likely you have to iterate through the byte array of a binary stream to replace bytes.

Steffen

aGerman
Expert
Posts: 4678
Joined: 22 Jan 2010 18:01
Location: Germany

Re: JREPL.BAT v7.1 - regex text processor now with Unicode and XRegExp support

#329 Post by aGerman » 09 Sep 2017 16:30

It's really hard to find something about how CSCRIPT handles character encoding. There is a hint in a post of Eric Lippert (who was one of the developers working on the WSH at that time).
https://blogs.msdn.microsoft.com/ericlippert/2004/02/11/unicode-output-and-the-windows-script-host/
He wrote that they use Windows API functions such as MultiByteToWideChar and WideCharToMultiByte which was not surprising for me (I use the same as core functions in my CONVERTCP utility). More interesting was that he mentioned to use the CP_OEMCP constant (or even CP_ACP as the first commenter assumed). Those are macros defined with 1 for CP_OEMCP and 0 for CP_ACP. They are no real code page identifiers. The system default settings are used instead.

Steffen

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: JREPL.BAT v7.1 - regex text processor now with Unicode and XRegExp support

#330 Post by dbenham » 09 Sep 2017 22:51

Mystery solved :D
CSCRIPT on my machine definitely interprets extended ASCII as Windows-1252.

And if you use ADO with Windows-1252 for both input and output, plus the /X and /M options, then you can safely manipulate binary files using any \xnn escape sequence, regardless what default code page your machine uses.

I tested by first creating a binary file length 256 containing bytes 0x00 through 0xFF, called bytes.txt

I then used JREPL to read the file using the default encoding, and wrote the result out as UTF-16BE via ADO.

Code: Select all

jrepl "^" "" /m /f "bytes.txt" /o "bytes-default.txt|UTF-16BE"

Next I did nearly the same thing, except this time I explicitly read the input as Windows-1252 via ADO

Code: Select all

jrepl "^" "" /m /f "bytes.txt|Windows-1252" /o "bytes-1252.txt|UTF-16BE"

I used FC to look for differences, and the two outputs were identical :D

The only byte codes that do not translate to the same unicode code point are byte codes in the range 0x80 through 0x9F.
Here are the unicode code points that map to that range:

Code: Select all

      0    1    2    3    4    5    6    7    8    9    A    B    C    D    E    F
8  20AC 0081 201A 0192 201E 2026 2020 2021 02C6 2030 0160 2039 0152 008D 017D 008F
9  0090 2018 2019 201C 201D 2022 2013 2014 02DC 2122 0161 203A 0153 009D 017E 0178

I looked at the decode escape sequence translations that JREPL uses with the /X option, and they match the results above.

Thanks for your help Steffen.


Dave

Post Reply