JREPL.BAT v8.6 - regex text processor with support for text highlighting and alternate character sets

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Post Reply
Message
Author
Grahack
Posts: 3
Joined: 25 May 2015 00:19

Re: JREPL.BAT - regex text processor - successor to REPL.BAT

#76 Post by Grahack » 08 Jun 2015 03:46

jeb wrote:...
After the first parser run in the batch file there is exactly one <LF> between "First line" and "second line".
This would produce two lines, but when this is transfer to the child cmd.exe (by the pipe) then the line is parsed again.
But then the linefeed is "raw" without the necessary esacping so it removes simply the rest of the line.
This results in the output of only "First line".


Indeed. I asked StackOverflow in the mean time. See here:
http://stackoverflow.com/questions/3061 ... 4#30616034
Thanks

Squashman
Expert
Posts: 4486
Joined: 23 Dec 2011 13:59

Re: JREPL.BAT - regex text processor - successor to REPL.BAT

#77 Post by Squashman » 10 Jun 2015 15:55

Hi Guys, sorry I have not been around much. Super busy with work and family.

Today I got some really horrible data from a client and I think I can fix it with JREPL.

You know how sometimes people will hard code a linefeed in an excel file and when they export it out of excel the linefeed also outputs but the line does end with a carriage return and line feed. Well I got the worst of both worlds on that somehow. The client somehow has a CRLF in the middle of a field on a CSV file they sent us. So a lot of the records are wrapping.

The records look like this. Typing out \CR\LF so you know what it really is.

Code: Select all

Monrovia,012345678,2015-06-02,SO-0654321,Sales Order,395.00,1015 ARCADIA AVE\CR\LF
APT 5,ARCADIA,CA,91007\CR\LF

What it should look like of course is this.

Code: Select all

Monrovia,012345678,2015-06-02,SO-0654321,Sales Order,395.00,1015 ARCADIA AVE APT 5,ARCADIA,CA,91007\CR\LF


I think the common denominator here is there should only be a CRLF if it is preceded by a 5 digit zipcode. Seems like all the records that are bad have some type of Apartment or Unit designation. So is there any way to strip the CRLF when it is not preceded by 5 numbers?

I guess maybe my other thought process was to count the number of tokens in each line. If there isn't 10 tokens then hold it in a temporary variable and append it to the next line. Either way it will probably require one of Dave's hybrid scripts in case one of the tokens is empty.

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: JREPL.BAT - regex text processor - successor to REPL.BAT

#78 Post by dbenham » 10 Jun 2015 16:31

Sure thing Squashman - easy peasy :)

Of course the /M option is needed to search for \CR\LF. So this can't work on multi gigabyte files.

I use the /T option to establish two search/replace pairs, with the left pair taking precedence.

The first looks for comma, followed by 5 digits, followed by \CR\LF (\CR being optional), and replaces the string with itself.
The second looks for \CR\LF (\CR being optional), and replaces it with a space.

I believe I have a bug :(
The following should work, but it does not substitute the $1 value properly. Instead it treats $1 as a literal.

Code: Select all

jrepl ",\d{5}\r?\n|\r?\n" "$1| " /m /t "|" /f test.csv /o -

But the following form using the /J option works great :D

Code: Select all

jrepl ",\d{5}\r?\n|\r?\n" "$1|' '" /j /m /t "|" /f test.csv /o -


Dave Benham

foxidrive
Expert
Posts: 6031
Joined: 10 Feb 2012 02:20

Re: JREPL.BAT - regex text processor - successor to REPL.BAT

#79 Post by foxidrive » 11 Jun 2015 03:50

This does depend on the data obviously, but assuming that the norty CRLF doesn't have a number after the CRLF then this should work too.

Just to be clearer, it's checking for a non-number both (before AND after) the CRLF

Code: Select all

jrepl "(\D)\r\n(\D)" "$1 $2" /m /x /f "file.csv"

Squashman
Expert
Posts: 4486
Joined: 23 Dec 2011 13:59

Re: JREPL.BAT - regex text processor - successor to REPL.BAT

#80 Post by Squashman » 11 Jun 2015 12:28

I ended up having to go with the 80/20 rule. The code I used fixed the majority of the wrapping records, but I still had to manually fix a few dozen records because I just could not get it to match the expression. The data was really bad so I doubt we could have come up with a regular expression that would have fixed them all but getting the majority really helped out a lot.

Just tweaked Dave's code a bit. I put in a check for two character state code before the zipcode. But I also had to tweak the zipcode logic. Sometimes the zipcode was listed without the leading Zero. Why the USPS decided to use leading zeros for the zipcode is beyond me. We all know how excel likes to drop leading zeros and trailing decimals with zero.

Code: Select all

jrepl ",[a-z][a-z],\d{4,}\r?\n|\r?\n" "$1|' '" /i /j /m /t "|" /f file.txt /o -

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: JREPL.BAT - regex text processor - successor to REPL.BAT

#81 Post by dbenham » 12 Jun 2015 07:42

I have fixed the /T bug such that $n and $nn substitution now works properly without the /J option. :-)

So for Squashman, either of the following will now work:

Code: Select all

jrepl ",[a-z][a-z],\d{4,}\r?\n|\r?\n" "$1| " /i /m /t "|" /f file.txt /o -
or

Code: Select all

jrepl ",[a-z][a-z],\d{4,}\r?\n|\r?\n" "$&| " /i /m /t "|" /f file.txt /o -


Here is version 3.5
JREPL3.5.zip
(8.7 KiB) Downloaded 1294 times


Dave Benham
Last edited by dbenham on 12 Jun 2015 10:49, edited 1 time in total.

foxidrive
Expert
Posts: 6031
Joined: 10 Feb 2012 02:20

Re: JREPL.BAT - regex text processor - successor to REPL.BAT

#82 Post by foxidrive » 12 Jun 2015 09:16

I have a V 3.4 here already, Dave.

Code: Select all

::JREPL.BAT version 3.4
::
::  Release History:
::    2015-01-22 v3.4: Bug fix - Use /TEST instead of TEST as a variable name
::                     within the option parser so that it is unlikely to
::                     collide with a user defined variable name.
::    2014-12-24 v3.3: Bug fix for when /JMATCH is combined with /M or /S
::    2014-12-09 v3.2: Bug fix for /T without /JMATCH - fixed dynamic repl func
::                     Added GOTO at top for improved startup performance
::    2014-11-25 v3.1: Added /JLIB option
::                     Exception handler reports when regex is bad
::                     Fix /X bug with extended ASCII
::    2014-11-23 v3.0: Added /JBEGLN and /JENDLN options
::                     Added skip, quit, and lpad() global variables/functions
::                     Exception handler reports when error in user code
::    2014-11-21 v2.2: Bug fix for /T with /L option.
::    2014-11-20 v2.1: Bug fix for /T option when match is an empty string
::    2014-11-17 v2.0: Added /T (translate) and /C (count input lines) options
::    2014-11-14 v1.0: Initial release
::

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: JREPL.BAT - regex text processor - successor to REPL.BAT

#83 Post by dbenham » 12 Jun 2015 10:55

Thanks foxi :!:

I made the change on a computer with an outdated version. :evil:

I've merged the two versions and edited my prior post to version 3.5.


Dave Benham

Squashman
Expert
Posts: 4486
Joined: 23 Dec 2011 13:59

Re: JREPL.BAT - regex text processor - successor to REPL.BAT

#84 Post by Squashman » 12 Jun 2015 15:46

Is it possible to match 9 commas to basically say there are 10 fields and then a CRLF. I am not sure of the syntax because you could have fields that are blank so it would be a double comma. But if the asterisk matches the preceding character and a period matches any single character how do you create an expression that says there may or may not be data between the commas?

Code: Select all

jrepl ".*,.*,.*,.*,.*,.*,.*,.*,.*,.*\r?\n|\r?\n" "$1|' '" /i /j /m /t "|" /f file.txt /o -


Or could you great a grouping and use a repetition factor of that grouping? Again not sure of the syntax.

Code: Select all

jrepl "(.*,){9}.*\r?\n|\r?\n" "$1|' '" /i /j /m /t "|" /f file.txt /o -

foxidrive
Expert
Posts: 6031
Joined: 10 Feb 2012 02:20

Re: JREPL.BAT - regex text processor - successor to REPL.BAT

#85 Post by foxidrive » 13 Jun 2015 03:05

Use this construct instead of .* and it will match each comma.

By rights this should show you lines that do not have at least 10 fields.

Code: Select all

jrepl ".*?,.*?,.*?,.*?,.*?,.*?,.*?,.*?,.*?,.*" "" /f file.txt 

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: JREPL.BAT - regex text processor - successor to REPL.BAT

#86 Post by dbenham » 13 Jun 2015 07:43

Yes, the ? makes the * non-greedy. That command finds lines with at least 9 commas and turns them into blank lines. The search regex can be dramatically simplified, and the command modified to preserve only lines that have at least 9 commas:

Code: Select all

jrepl "(.*?,){9}.*" "$0" /jmatch /f test.csv

But that doesn't really help solve the problem if you plan on using just a single jrepl pass. Whenever a logical line is split into two, then the first \LF must be turned into a space, but the last one should not, yet the line with the valid \LF did not have 9 commas :!:

The solution is to use 3 jrepl passes. The first pass identifies "valid" \LF and doubles them. The second pass converts \CR\LF that do not precede \LF into a space, and the final pass converts \LF\LF back into \LF. I use [\s\S] instead of a dot so that it can match \LF.

Code: Select all

jrepl "([\s\S]*?,){9}.*\n" "$&\n" /x /m /f test.csv | jrepl "\r?\n\n|\r?\n" "$&| " /m /t "|"" | jrepl "\n\n" "\n" /x /m /o test.csv.new
move /y test.csv.new test.csv >nul

I put the word "valid" in quotes because it is impossible to know for sure which \LF is valid if unwanted \LF can occur in the first column, and can also occur in the last column. :(
The above assumes the first occurring \LF after 9 commas is valid, so it does not allow \LF in the last column.

Using jSCRIPT with the /J and /JBEG options, it is possible to do everything in one pass :)
I keep a running count of which column I am in, as identified by commas. If I find a (\CR)\LF while in the 10th column, then I reset the count to 1 and preserve it, otherwise I replace it with a space. I seriously abuse the condition?expression:expression construct with an ugly hack, but it works :wink:

Code: Select all

jrepl ",|\r?\n" "field+=1;$0|field==10?((field=1)==1?$0:''):' '" /j /jbeg "var field=1" /t "|" /m /f test.csv /o -


One final note - All the discussion so far assumes that all commas are column delimiters. But the CSV format allows for comma literals within values if the field is quoted. In reality, a field can also contain \LF if it is quoted, though not many parsers support that feature. The other complicating feature is quote literals are doubled within a quoted field. I don't think I would use JREPL to support these features with this problem, and I certainly wouldn't attempt it with pure batch. I would probably write a custom JScript script.

I've written the hybrid parseCSV.bat that allows FOR /F to parse nearly any properly formatted CSV. But I don't see how it helps with this problem given that the CSV is not valid. However, the JScript within might be a good starting point in developing a custom JScript solution to this problem.


Dave Benham

Squashman
Expert
Posts: 4486
Joined: 23 Dec 2011 13:59

Re: JREPL.BAT - regex text processor - successor to REPL.BAT

#87 Post by Squashman » 17 Jun 2015 22:00

Thanks for your help guys. Have not tried the last examples as I have not been at work. But since I have been off of work, it has given me some time to play around with some fun stuff in my free time.

My oldest son races mountain bikes in the Wisconsin Off Road Series (WORS).

I started working on a batch file to scrape his teams results from the website using WGET.

The output is pretty well text aligned but when a person has a DNF (Did not finish) or a NR (Not rated) because they never started the race, the output gets misaligned for those racers because they do not have a time for the race. Basically the text is off by 11 spaces. So I am wondering is there a way to get the text lined up again with JREPL.

Currently they have only raced in 4 races. After each race I will run this script.

Code: Select all

@echo off
TITLE Get WORS Results

SET /P "RACENUM=What race number do you want to retrieve? : "

FOR %%G IN (junior citizen_m citizen_f sport_m sport_f comp_m elite) DO (
   FOR /F "delims=" %%H IN ('wget -O - -q http://www.wors.org/results/2015/%RACENUM%/overall/%%G.htm ^|findstr /I /C:"Broken Spoke"') DO (
      >>ALL_Race_Results.txt echo %%H %RACENUM% %%G
   )
)


So if you see the output of race one you will see what I am talking about.

Code: Select all

    6        1 F 11-11   6060   Willow Kapitz          Manitowoc  WI              11     28:15.4      3:32.8  Broken Spoke Racing            1 junior
    6        3 M 13-13   6043   Kaleb Moore            Green Bay  WI              13     22:30.2      2:09.9  Broken Spoke                   1 junior
   42        1 M  4- 8   6034   Thaddeus Sahs          Sturgeon Bay  WI           8      31:29.5     11:09.2  Broken Spoke Bike Studio       1 junior
   54        4 M  4- 8   6061   Ash Kapitz             Manitowoc  WI               7     39:07.4     18:47.1  Broken Spoke Racing            1 junior
   19        2 M 45-49   4152   David Peck             Green Bay  WI              47     51:06.0      5:14.4  Broken Spoke                   1 citizen_m
   96       15 M 45-49   4055   Dean Scharie           Manitowoc  WI              46   1:01:42.7     15:51.0  Broken Spoke                   1 citizen_m
   21        1 F 25-34   5066   Desiree Schmidt        Green Bay  WI              28   1:08:53.8     16:04.6  Broken Spoke                   1 citizen_f
   23        5 F 45-54   5065   Deb Neuville           Sturgeon Bay  WI           54   1:10:34.0     17:44.8  Broken Spoke                   1 citizen_f
    3        3 M 30-34   2084   Zachariah Radey        Manitowoc  WI              31   1:09:36.9      2:04.6  Broken Spoke Cycling Team      1 sport_m
    4        4 M 30-34   2357   Daniel Hebert          De Pere  WI                31   1:09:41.6      2:09.3  Broken Spoke Studio            1 sport_m
    9        1 M 35-39   2063   Greg Halverson         Sheboygan Falls  WI        39   1:12:26.2      4:53.9  Broken Spoke Cycling           1 sport_m
   22        3 M 15-16   2077   Ethan Halverson        Sheboygan Falls  WI        16   1:14:35.5      7:03.2  Broken Spoke Cycling           1 sport_m
   42        4 M 25-29   2316   David Rossow           Green Bay  WI              28   1:17:56.3     10:24.0  The Broken Spoke Bike Studio,  1 sport_m
   48        7 M 35-39   2326   Jeremy Rennie          Green Bay  WI              37   1:18:31.5     10:59.2  Broken Spoke                   1 sport_m
  103       19 M 45-49   2078   Randal Sahs            Sturgeon Bay  WI           46   1:24:01.6     16:29.3  Broken Spoke bike Studio       1 sport_m
   15        4 F 45-54   3011   Toni House             Oneida  WI                 53   1:36:25.7     17:26.9  Broken Spoke                   1 sport_f
   23        5 F 25-34   3058   Jennifer Uttendorfer   Madison  WI                25   1:51:59.8     33:01.0  Broken Spoke                   1 sport_f
   11        1 M 35-39   1112   Eric Stanke            Green Bay  WI              39   1:42:00.7      6:16.6  Broken Spoke Cycling Team      1 comp_m
   21        1 M 50-54   1160   Michael Jones          Fish Creek  WI             54   1:45:24.3      9:40.3  Broken Spoke                   1 comp_m
   23        7 M 30-34   1024   George Kapitz          Manitowoc  WI              33   1:45:27.6      9:43.5  Broken Spoke Racing            1 comp_m
   47        7 M 40-44   1074   Scott Trierweiler      Marshfield  WI             41   1:52:36.9     16:52.9  Broken Spoke cycles            1 comp_m
  DNF          M 45-49   1180   Michael Anderson       Green Bay  WI              47               Broken Spoke                   1 comp_m
   NR          M 30-34   1037   Brandon Teske          Madison  WI                32               Broken Spoke                   1 comp_m


Is there any way to get the text to all line up when the racer has a DNF or NR using JREPL?

Then eventually I will change the output to a CSV so that I can import it into excel.

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: JREPL.BAT - regex text processor - successor to REPL.BAT

#88 Post by dbenham » 17 Jun 2015 22:53

Pretty rudimentary regex is all that is needed.

Code: Select all

jrepl "^(  DNF|   NR).{80}" "$&           " /f test.txt


Dave Benham

Squashman
Expert
Posts: 4486
Joined: 23 Dec 2011 13:59

Re: JREPL.BAT - regex text processor - successor to REPL.BAT

#89 Post by Squashman » 17 Jun 2015 23:34

Cool beans. That makes complete sense. Regular Expressions have not always been my forte. If I use them enough I remember how to use them. But if I have not used them in a while I forget how to use them.

Of course this now makes for a very very very long line in my batch file.

Code: Select all

@echo off
TITLE Get WORS Results

SET /P "RACENUM=What race number do you want to retrieve? : "

FOR %%G IN (junior citizen_m citizen_f sport_m sport_f comp_m elite) DO (
   FOR /F "delims=" %%H IN ('wget -O - -q http://www.wors.org/results/2015/%RACENUM%/overall/%%G.htm ^|findstr /I /C:"Broken Spoke" ^|jrepl "^(  DNF|   NR).{80}" "$&           "') DO (
      >>ALL_Race_Results.txt echo %%H %RACENUM% %%G
   )
)

foxidrive
Expert
Posts: 6031
Joined: 10 Feb 2012 02:20

Re: JREPL.BAT - regex text processor - successor to REPL.BAT

#90 Post by foxidrive » 18 Jun 2015 05:27

I didn't read to the end of the page - and my effort would make it even longer... but there ya go:

Code: Select all

jrepl "(^[ ]*?)(DNF|NR)(.{90})(.*)" "$1$2$3           $4" /f

Post Reply