JREPL.BAT v8.6 - regex text processor with support for text highlighting and alternate character sets

Message

#1 Post by **dbenham** » 14 Nov 2014 14:52

Here is the current version of JREPL.BAT as of 2020-07-30. Read the subsequent posts for examples of usage and to follow the development history.

JREPL8.6.zip: (31.48 KiB) Downloaded 18355 times

Full documentation is available from the command line via JREPL /?, or JREPL /?? for paged help.
The entire documentation is also listed at the top of the code, after the history.
Help is also available for specific topics - use JREPL /?HELP for more info.

Unfortunately the code has gotten too big to post on this site as view-able text. More than half the code is documentation

Index of releases
v8.6 2020-07-30: Extended /K /R and /MATCH syntax to support counting matches or rejects instead of printing them. Also added the counter JScript global variable for user JScript.
v8.5 2020-02-29: Added the /EOL option and the eol JScript global variable for user JScript
v8.4 2019-09-13: Bug fix for /RTN - use UTF-8 and CHCP 65001 to preserve Unicode when storing the result in the environment variable. Also fixed v8.0 bug that broke /K,/R with /INC,/EXC.
v8.1 2019-05-19: Add /VT to enable ANSI escape sequences without having to modify the registry.
v8.0 2019-05-15: Add /H /HON /HOFF and /HU options for highlighting replaced or matched text. Fixed behavior of /OFF when combined with /P.
v7.15 2018-10-20: Extend /INC and /EXC to support 'String' literals
v7.14 2018-10-15: Bug fix to allow user defined variables str and/or obj, and to hide xbytes from user code.
v7.13 2018-07-18: Added ability to create XBYTES.DAT via ADO in case CERTUTIL is missing, Fixed minor bug related to XBYTES.DAT that was sometimes leaving behind one or more copies of XBYTES.HEX, and Fixed major bug with /INC and /EXC regular expression blocks.
v7.11 2018-03-26: Added functionality to overwrite input file as ADO UTF without BOM (added support for /O "-|UTF-?|NB").
v7.10 2018-03-14: Now can block BOM in ADO output files by appending |NB to |CharSet in the /O option and OpenOutput() function.
v7.9 2017-11-23: Allow escape sequences with /T "", New /PREPL option to improve /P functionality, plus two minor bug fixes
v7.8 2017-11-13: Emulate FINDSTR /G; Split /X into /XSEQ and /XFILE; New \x{nn-mm} escape sequence
v7.7 2017-10-24: Fixed broken MicroSoft links; Allow /O "-|CharSet"; fix decode() bug - make CharSet arg optional
v7.6 2017-10-08: New /?CHARSET/[Query] list character sets help option, New /?CHARSET and /?XREGEXP web page help options, and minor bug fix to output.WriteLine()
v7.4 2017-09-25: Modified /X \xnn extended ASCII escape sequence to support any single byte character set, not just Windows-1252
v7.3 2017-09-23: Added ability to select character set for input and output, including Unicode. Also added support for XRegExp enhanced regular expressions
v6.8 2017-08-25: Added \c escape sequence for /X, added /APP option, and added openOutput() function.
v6.7 2017-04-09: Corrected /OFF, /EXC, and /INC documentation, plus spelling fixes.
v6.6 2016-12-23: New /RTN option to store result in a variable. Improved support for extended ASCII. Fixed documentation error introduced by version 6.0
v6.4 2016-11-01: Improved peformance by dynamically building optimized main loop based on chosen options
v6.2 2016-10-13: Added /K /R /MATCH /P /PFLAG /JQ /JMATCHQ. Improved /INC /EXC. Improved Performance.
v5.2 2016-09-27: Added /T FILE option
v5.0 2016-09-18: Added /U option
v4.5 2016-08-03: Added /D option
v4.4 2016-08-02: Bug fix for /C when last line missing \n
v4.3 2016-07-30: Added rpad() and improved lpad()
v4.2 2016-06-24: Improved /?Options help
v4.1 2016-06-23: Added help for single option/topic as well as /T examples
v4.0 2016-06-19: Added /INC and /EXC
v3.8 2016-03-27: Bug fix - hide some additional internal variables
v3.7 2016-01-14: Bug fix for \xnn and \unnnn in regex with /X
v3.6 2015-07-15: Added /?? paged help option
v3.5 2015-06-12: /T bug fix for $n and $nn when not /J or /JMATCH
v3.4 2015-01-22: "Hide" internal /TEST variable (instead of TEST)
v3.3 2014-12-24: Added /JLIB plus some bug fixes
v3.0 2014-11-23: Added /JBEGLN and /JENDLN
v2.2 2014-11-21: Added /T option
v1.0 2014-11-14: Initial release

Dave Benham

#2 Post by **dbenham** » 14 Nov 2014 15:26

JREPL.BAT is a powerful, general purpose, command line, regular expression text processor for ASCII data. It is a hybrid JScript/batch script that should run on any Windows machine from XP onward.

Here is a trivial example that will substitute "blue" for every occurrence of the word "red" within a file. Changes are made directly to the file:

Code: Select all

jrepl "\bred\b" "blue" /f test.txt /o -

If the command is used in a batch script, then CALL must be used so that the calling script can continue after JREPL finishes.

Full documentation can be accessed by using the following:

Code: Select all

jrepl /?

JREPL.BAT is a direct descendent from REPL.BAT. A new name was needed because the calling syntax is not backward compatible.

JREPL.BAT uses the same options as REPL.BAT, except instead of concatenating all the options into one string, each option must be listed separately with a slash prefix.

Besides having different calling syntax, JREPL version 1.0 offers the following enhancements over REPL.BAT version 6:

The input file may be specified using /F "file". No need to pipe or redirect the input file.
The output may be sent to a file using /O "file". No need to redirect the output.
/O "-" will replace the original file with the result. The output is first written to a temporary file, and then it is MOVEd to replace the original.
/JMATCH discards all non-matching text, and each match's replacement value is written on a new line. The replacement value is expressed as JScript code.
/JBEG "code" and /JEND "code" run initialization and termination JSCRIPT code supplied by the user. User defined global variables can be declared and initialized within the /JBEG code. Then the /J or /JMATCH Replacement code can update the variables upon each match. Finally, the /JEND code can write summarized information upon completion.
/N "width" prefixes each line of output with the corresponding line number from the input. The line numbers may be zero padded to a minimum width.
/OFF "width" prefixes each line of /JMATCH output with the offset within the source line where the match occurred. The offsets may be zero padded to a minimum width.
Expanded the number of predefined global variables/methods/objects that may be used by user supplied JScript code.
All global variables/methods/objects used by the JREPL script itself are hidden behind a single opaque global object named "_g". User supplied code can create identifiers without fear of corrupting the JREPL script as long as the _g object is avoided.

Please let me know if you find any bugs - I've done rudimentary testing, but I do not yet have a regression test plan, and the code has grown substantially since the original version of REPL.BAT. I may have unknowingly broken code that worked previously.

JREPL1.0.zip: (5.79 KiB) Downloaded 9002 times

Dave Benham

#3 Post by **foxidrive** » 14 Nov 2014 17:02

Me find redundant terms: run user supplied initialization and termination JSCRIPT code supplied by the user.

I like the concept of Jrepl but I'm unclear on this point - does it do everything that repl.bat did, except with extra features?

Forgive me to asking but I've had gastric problems for over 2 months and had less than 4 hours sleep for many of those nights - so my concentration to study complex stuff is sorely lacking. I'm also not very clued up with Jscript.

#4 Post by **dbenham** » 14 Nov 2014 17:15

Yes, it does everything that REPL.BAT did, plus more.

Dave Benham

#5 Post by **foxidrive** » 14 Nov 2014 18:41

I assume then that the repl syntax will all work the same way, in Jrepl ?

There is a thread over at computerhope http://www.computerhope.com/forum/index ... #msg919079
where I have attempted to solve the problem using repl, and I think it might be a good question for Jrepl to solve, if you can take the terms from my answer and wrap it in a Jrepl line.

It could be a good example for those of us that need functional examples to learn to use these new tools.

bars143 · #6 Post by **bars143** » 14 Nov 2014 19:30

HI,

can you give me script that can extract text between html tags specially <title> tag?
findstr cant do when a <title> tag is found in a single line together with a long line of css scripts up to 41KB , the findstr error is "...line 2 is too long string..."

here is example of html page (only two-lines):

Code: Select all

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//WAPFORUM//DTD XHTML Mobile 1.0//EN" "http://www.wapforum.org/DTD/xhtml-mobile10.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><title>Comments2</title><meta name="referrer" content="default" id="meta_referrer" /><noscript><meta http-equiv="X-Frame-Options" content="DENY" /></noscript><style type="text/css">/*<![CDATA[*/.f{background:#dcdee3;}.b .m{background:#f6f7f8;border:0;font-size:small;margin:0;}.m .bh{font-size:x-small;}.m .bf{font-size:x-small;}.bm .dh{font-size:13px;}.bm .dd .dk a,.bm .dd .dk abbr{white-space:nowrap;}.n{margin:0 6px 6px;padding:6px;}.b .n .n{border-color:#e9eaed;margin:6px 0 0;}.bj{display:inline-block;}.bg{margin-top:5px;}.r,.s{margin:5px 0;}.n a,.n a:visited{color:#2b55ad;}.n a:hover,.n a:focus{background:#2b55ad;color:#fff;}.m .n{font-size:small;padding:4px;}.m .bg{margin-top:6px;}.m .r,.m .s{margin:6px 0;}.b .o{background:white;margin:0;}.x{border:1px solid #ccc;margin:5px 0;word-wrap:break-word;}.t{display:block;}.b .t,.b .t:visited{color:#3e4350;}.b .t:focus,.b .t:hover,.b .t:focus .x,.b .t:hover .x{background:#3b5998;color:#fff;}.b a,.b a:visited{color:#3b5998;text-decoration:none;}.b .bk,.b .bk:visited{color:#6d84b4;}.b a:focus,.b a:hover,.b .bk:focus,.b .bk:hover{background-color:#3b5998;color:#fff;}.v{background:#f6f7f8;}.cl{background:#fff;}.b .cz{padding:0;}.b .cp{padding:2px;}.b .w{padding:4px;}.b .y{border:0;border-collapse:collapse;margin:0;padding:0;width:100%;}.b .y tbody,.b .z>tr>td,.b .z>tbody>tr>td,.b .y td.z{vertical-align:top;}.b .bs>tr>td,.b .bs>tbody>tr>td,.b .y td.bs{vertical-align:middle;}.b .y td{padding:0;}.b .y td.cp{padding:2px;}.b .y td.w{padding:4px;}.b .bb{width:100%;}.k{border:0;display:inline-block;vertical-align:top;}i.k

but i cut a last line cause its too big to paste here but its total kilobytes is 41.

did you see "<title>Comments2</title>" in secondline(also lastline) ? then after <title> tags -- there is "/*<![CDATA[*" , which i can use "[" as delimiter then copy it output to a file then do findstr again to extract "Comments2" -- thats the way i can solve but its more work and more time as i had thousands webpages that are not named according to its webpage title. but not all html has longline. some <title>
tags are separated from a longline of css script like this:

Code: Select all

<!DOCTYPE html>

<html dir="ltr" xmlns="http://www.w3.org/1999/xhtml" lang="en">
    <head><link rel="canonical" href="http://msdn.microsoft.com/en-us/library/aa664628(v=vs.71).aspx" />
        <title>1.1 Getting started (C#)</title>
        
        
<meta name="DCS.dcsuri" content="/en-us/library/aa664628(d=default,l=en-us,v=vs.71).aspx" />

<meta name="NormalizedUrl" content="http://msdn.microsoft.com/en-us/library/aa664628(d=default,l=en-us,v=vs.71).aspx" />

above <title> tags are easy to extract

and some .mht webpage has two-lines of webpage title which is separated by equal symbol "=" like this:

Code: Select all

From: <Saved by Mozilla 5.0 (Windows)>
Subject: DosTips.com - View topic - Another way to create a line feed
 variable
Date: Thu, 13 Nov 2014 17:24:51 +0800
MIME-Version: 1.0
Content-Type: multipart/related;
   type="text/html";
   boundary="----=_NextPart_000_0000_EA983E61.0B94984F"
X-MAF-Information: Produced By MAF V3.0.3

This is a multi-part message in MIME format.

------=_NextPart_000_0000_EA983E61.0B94984F
Content-Type: text/html;
   charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Content-Location: http://www.dostips.com/forum/viewtopic.php?f=3&t=4439

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.=
w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns=3D"http://www.w3.=
org/1999/xhtml" dir=3D"ltr" xml:lang=3D"en-gb" lang=3D"en-gb"><head>
<meta http-equiv=3D"content-type" content=3D"text/html; charset=3DUTF-8">

<meta http-equiv=3D"content-type" content=3D"text/html; charset=3DUTF-8">
<meta http-equiv=3D"content-language" content=3D"en-gb">
<meta http-equiv=3D"content-style-type" content=3D"text/css">
<meta http-equiv=3D"imagetoolbar" content=3D"no">
<meta name=3D"resource-type" content=3D"document">
<meta name=3D"distribution" content=3D"global">
<meta name=3D"copyright" content=3D"2000, 2002, 2005, 2007 phpBB Group">
<meta name=3D"keywords" content=3D"">
<meta name=3D"description" content=3D"">

<title>DosTips.com - View topic - Another way to create a line feed variabl=
e</title>

<link rel=3D"stylesheet" href=3D"urn:snapshot-9C7CAE89:http://www.dostips.c=
om/forum/styles/avalon/theme/stylesheet.css" type=3D"text/css">
<!--[if IE]>
<link rel=3D"stylesheet" type=3D"text/css" href=3D"./styles/avalon/theme/ie=
7.css" />
<![endif]-->

as shown above the <title> tags has two-lines:

<title>DosTips.com - View topic - Another way to create a line feed variabl=
e</title>

other .mht has different string in second line like this for examples:

Code: Select all

<title>DosTips.com - View topic - Another way to create a line feed var=
iable</title>

Code: Select all

<title>DosTips.com - View topic - Another way to create a line feed variable</tit=
le>

Code: Select all

<title>DosTips.com - View topic - Another way to create a line feed variable</=
title>

i tried this code but no luck in a single long stringline:

type infile.htm |repl ".*<title>(.*)</title>.*" "$1" >outfile.txt

can this repl.bat or jrepl.bat solves problems below:

1) extracting between html tags in a oneline in a long strings?
2) extracting between html tags (of some .mht webpages) which has two-lines long webpage title separated by equal(=) symbol ?

i needs above answers and im still on studying the so-called "regex code"

anyway thanks

Bars

windows xp sp3 32bit user

#7 Post by **dbenham** » 14 Nov 2014 23:50

@foxidrive - I posted solutions at ComputerHope. I also edited the top post on this thread to briefly describe the difference in syntax between JREPL and REPL. It shouldn't take long to get used to the new syntax.

@bars143 - This is really easy (and fast) with JREPL.BAT :!:

Code: Select all

type test.txt | jrepl "=?\r?\n" "" /m | jrepl "<title>(.*?)</title>" "$1" /jmatch /m

First I strip out all carriage returns and linefeeds. I also strip out = if it precedes the end of a line. Then I simply search for the title tag, capturing the contents. The /JMATCH option preserves only matched text.

Dave Benham

bars143 · #8 Post by **bars143** » 15 Nov 2014 01:55

dbenham wrote:@foxidrive - I posted solutions at ComputerHope. I also edited the top post on this thread to briefly describe the difference in syntax between JREPL and REPL. It shouldn't take long to get used to the new syntax.

@bars143 - This is really easy (and fast) with JREPL.BAT
Code: Select all
type test.txt | jrepl "=?\r?\n" "" /m | jrepl "<title>(.*?)</title>" "$1" /jmatch /m
First I strip out all carriage returns and linefeeds. I also strip out = if it precedes the end of a line. Then I simply search for the title tag, capturing the contents. The /JMATCH option preserves only matched text.

Dave Benham

thanks very much!!

Dave,

it works perfectly on .mht file and .htm file(for a long oneline strings)

but i already had working combo of find.exe and REPL.exe to work on testing some 15 webpage in one go.
yet needed "[" as delimeter on longline stringd. findstr does not work as its keep saying error message :
"...line 2 is too long strings..." and it does not work on .mht file with an equal sign "=" .

here is what i made:

Code: Select all

@echo off

setlocal enabledelayedexpansion

rem "xxxx4" is a folder that contents webpages files to be extracted

set "PD=%~dp0xxxx4\"
dir "%PD%" /b >webpagelist.txt
set "WB=webpagelist.txt"
set count=0
echo %WB%

for /f "delims=" %%a in (' type "!WB!" ') do (
   set "FILE=%%a"
   set "PFILE=!PD!!FILE!"
        set /a count+=1   
   
      echo this line will cut longline string by delimiting up to "["
      echo FINDSTR will not work here. i use FIND.
      for /f "delims=[" %%b in (' type "!PFILE!" ^|find "<title>" ^|repl ".*<title>(.*)</title>.*" "$1" ') do (
      set "c=%%b"
      ren "!PD!!FILE!" "!c!.htm"   
      
      )
    echo this line will rename some duplicate webpage title
   if exist "!PD!!FILE!" ren "!PD!!FILE!" "!c!_!count!.htm"
)
echo above code does not work on .mht files with an equal sign "=" <--it is as a separator of two-lines webpage title.
pause

i will try your script soon on some remaining webpages files.

Bars

#9 Post by **aGerman** » 15 Nov 2014 06:21

Bars

Use regular expressions on markup languages (like HTML, XML, and the like) very carefully. They may fail! Enjoy reading that legendary post at SO:
The <center> cannot hold it is too late.
Better take the data object model (DOM). JScript is suitable for that task as well. Maybe if I find the time I'll write a command line tool to work around XML issues.

Dave

Great work! I already used repl.bat very successfully...

Regards
aGerman

bars143 · #10 Post by **bars143** » 15 Nov 2014 11:47

aGerman, thanks for the link. Regards back to you.

Dave , i have a problem that some webpages title has one ore more of these characters: "|" "?" ":" "\" "/" "=" "<" ">"

i can delete these chars "\" "|" ":" "?" "=" by leading with "\" for ex. "\?" "\:"etc

but these cant , and dont know how to delete the ff : "/" "<" ">"

here is my code:

Code: Select all

type test.txt | jrepl "[\|\?\:\=\\]" "" /m | jrepl "<title>(.*?)</title>" "$1" /jmatch /m >output.txt

could you give me script that can remove "/" "<" ">" maybe you can add more char i dont know.
if you can , then i can rename "/" as [slash] , "<" as [lessthan] , ">" as [greaterthan] to be used as part of filename.
i always use webpage title as basis for my search query.

thanks,

Bars

#11 Post by **dbenham** » 15 Nov 2014 13:10

bars143 wrote:Dave , i have a problem that some webpages title has one ore more of these characters: "|" "?" ":" "\" "/" "=" "<" ">"

i can delete these chars "\" "|" ":" "?" "=" by leading with "\" for ex. "\?" "\:"etc

but these cant , and dont know how to delete the ff : "/" "<" ">"

Those chars are needed to identify the <title> </title> tags, so they must be deleted after you extract the title, not before.

Code: Select all

type test.txt | jrepl "=?\r?\n" "" /m | jrepl "<title>(.*?)</title>" "$1" /jmatch /m | jrepl "[|?:/\\=<>]" ""

Dave Benham

#12 Post by **foxidrive** » 15 Nov 2014 22:09

Dave, I was wondering if you would consider a transliteration switch like GnuSed has in the Y switch.

y/abc/xyz replaces - a with x - and b with y - and c with z.

Do you think that is a worthwhile function to include?

bars143 · #13 Post by **bars143** » 15 Nov 2014 22:30

dbenham wrote:
bars143 wrote:Dave , i have a problem that some webpages title has one ore more of these characters: "|" "?" ":" "\" "/" "=" "<" ">"

i can delete these chars "\" "|" ":" "?" "=" by leading with "\" for ex. "\?" "\:"etc

but these cant , and dont know how to delete the ff : "/" "<" ">"

Those chars are needed to identify the <title> </title> tags, so they must be deleted after you extract the title, not before.
Code: Select all
type test.txt | jrepl "=?\r?\n" "" /m | jrepl "<title>(.*?)</title>" "$1" /jmatch /m | jrepl "[|?:/\\=<>]" ""
Dave Benham

oh, thats a simple tweak... , i never though of that , --those chars should be deleted last.
And only backlash needed another backlash inside square parenthesis [|?:/\\=<>*] , <--- i added asterisk since its not allowed as part of filename. but a doublequote symbol ,i dont know how?

edited at nov 16 2014 :

now i found it: it is \q <---a doublequote regex

anyway thanks Dave for quick reply.

Bars

brinda · #14 Post by **brinda** » 17 Nov 2014 20:56

dave,

thank you for Jrepl.

Would it be possible to use Jrepl to remove duplicate lines (case sensitive) and leaving the lines in original order stripping blank lines as well.

from your linkhttp://stackoverflow.com/questions/11689689/batch-to-remove-duplicate-rows-from-text-file

example below

Code: Select all

@echo off
setlocal disableDelayedExpansion
set "file=%~1"
set "line=%file%.line"
set "deduped=%file%.deduped"
::Define a variable containing a linefeed character
set LF=^


::The 2 blank lines above are critical, do not remove
>"%deduped%" (
  for /f usebackq^ eol^=^%LF%%LF%^ delims^= %%A in ("%file%") do (
    set "ln=%%A"
    setlocal enableDelayedExpansion
    >"%line%" (echo !ln:\=\\!)
    >nul findstr /xlg:"%line%" "%deduped%" || (echo !ln!)
    endlocal
  )
)
>nul move /y "%deduped%" "%file%"
2>nul del "%line%"

If yes, could you please provide how to do? thanks.

#15 Post by **dbenham** » 17 Nov 2014 21:22

@brinda - I don't see how JREPL can help with that, especially if the lines are not sorted.

DosTips.com

JREPL.BAT v8.6 - regex text processor with support for text highlighting and alternate character sets

JREPL.BAT v8.6 - regex text processor with support for text highlighting and alternate character sets

JREPL.BAT - regex text processor - successor to REPL.BAT

Re: JREPL.BAT - regex text processor - successor to REPL.BAT

Re: JREPL.BAT - regex text processor - successor to REPL.BAT

Re: JREPL.BAT - regex text processor - successor to REPL.BAT

Re: JREPL.BAT - regex text processor - successor to REPL.BAT

Re: JREPL.BAT - regex text processor - successor to REPL.BAT

Re: JREPL.BAT - regex text processor - successor to REPL.BAT

Re: JREPL.BAT - regex text processor - successor to REPL.BAT

Re: JREPL.BAT - regex text processor - successor to REPL.BAT

Re: JREPL.BAT - regex text processor - successor to REPL.BAT

Re: JREPL.BAT - regex text processor - successor to REPL.BAT

Re: JREPL.BAT - regex text processor - successor to REPL.BAT

Re: JREPL.BAT - regex text processor - successor to REPL.BAT

Re: JREPL.BAT - regex text processor - successor to REPL.BAT