Massaging Textual Data

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Message
Author
Samir
Posts: 384
Joined: 16 Jul 2013 12:00
Location: HSV
Contact:

Massaging Textual Data

#1 Post by Samir » 30 Oct 2015 18:07

I have a text file with the following data:

Code: Select all

Unl-Self             669.878  1419.48
Ul-Plus-Self          42.943    97.43
Supreme-Self         128.648   349.79
Diesel-Self           19.100    44.30
Automotive                 0     0.00
Beverage                   0     0.00
Candy                      5     5.21
Coffee                     0     0.00
Coke Product               1     0.99
Cigarettes                 0     0.00
Energy Drink               4     9.18
Fountain                   2     2.18
Grocery                    2     1.58
Cash Back Fe               0     0.00
Gatorade                   0     0.00
Arizona                    0     0.00
General Merc               0     0.00
Newspapers                 3     2.50
Pepsi Produc               1     0.79
Beer                       0     0.00
Tobacco                    0     0.00
HBA                        5     3.25
Novelty                    0     0.00
Ice Bags                   0     0.00
Electric Cig               0     0.00
Peanuts                    0     0.00
Phone Charge               0     0.00
HOT DOG ETC.               0     0.00
Fuel Correct               0     0.00
Phone card                 0     0.00
Rip It Energ               3     2.97
Unknown                    0     0.00
Gift Card                  0     0.00
This file continues on with other stuff I'm not interested in, but only after a blank line which can be used to designate the end of important data.

Each one of the items in column 1 correspond to another moniker. For example, 'Ice Bags' is 'Non-Fuel:Ice'.

Lines that have a 0.00 amount in both the second and third columns can be deleted from the final file.

What I want is this entire file changed to this format (essentially transposed) with the monikers for each item. Example:

Code: Select all

Fuel:87
669.878
1419.48
Fuel:89
42.943
97.43
Fuel:93
128.648
349.79
Fuel:Diesel
19.100
44.30
Non-Fuel:Candy
5
5.21
Non-Fuel:Coke
1
0.99
Non-Fuel:Energy Drink
4
9.18
Non-Fuel:Fountain
2
2.18
Non-Fuel:Grocery
2
1.58
Non-Fuel:Newspapers
3
2.50
Non-Fuel:Pepsi
1
0.79
Non-Fuel:HBA
5
3.25
Non-Fuel:Rip It
3
2.97
I've got a couple of ideas on how to do this and have spent a good part of the day thinking on how to make it not loop so much--loop once through each line, loop for each moniker change, etc.

A couple of other helpful facts. The numbers of lines to process does not change, nor does the position or name of each moniker. It could change, but a bunch of other things would break too, so it's not a real problem to use hardcoding if it makes the code easier to follow.

Can you help me make this lean and mean? Thank you in advance!

aGerman
Expert
Posts: 4678
Joined: 22 Jan 2010 18:01
Location: Germany

Re: Massaging Textual Data

#2 Post by aGerman » 31 Oct 2015 05:34

Something like that may work for you:

Code: Select all

@echo off &setlocal
set "txt_in=test.txt"
set "txt_out=test2.txt"

set "end="
for /f "delims=:" %%i in ('findstr /n "^$" "%txt_in%"') do if not defined end set /a "end=%%i-1"
if not defined end for %%i in ('type "%txt_in%"^|find /c /v ""') do set "end=%%i"

setlocal EnableDelayedExpansion
<"!txt_in!" >"!txt_out!" (
  for /l %%i in (1 1 %end%) do (
    set /p "ln="
    for /f %%j in ("!ln:~12,16!") do set "amount=%%j"
    if "!amount!" neq "0" (
      call :processProduct !ln:~,12!
      for /f %%j in ("!ln:~28,9!") do set "price=%%j"
      echo !product!
      echo !amount!
      echo !price!
    )
  )
)

exit /b

:processProduct
set "product=%*"
if "%product%"=="Unl-Self" (
  set "product=Fuel:87"
) else if "%product%"=="Ul-Plus-Self" (
  set "product=Fuel:89"
) else if "%product%"=="Supreme-Self" (
  set "product=Fuel:93"
) else if "%product%"=="Diesel-Self" (
  set "product=Fuel:Diesel"
) else set "product=Non-Fuel:%product%"
exit /b

Regards
aGerman

Samir
Posts: 384
Joined: 16 Jul 2013 12:00
Location: HSV
Contact:

Re: Massaging Textual Data

#3 Post by Samir » 31 Oct 2015 10:21

That's the logic I was thinking of too, but the if else loop has to be processed on each line. :( It's varying the amount of time taken with the last product type taking the longest, but it still has to loop each time.

So an idea that occurred to me before sleeping last night was that if each product item is in a fixed location and a fixed moniker, why do I even want to validate it with the source data? With that, a loop can be done just to take each line, get the amount and price and simply write in the moniker for each line based on a line number or something. But I haven't been able to think of a way to do that without looping again. Thoughts?

aGerman
Expert
Posts: 4678
Joined: 22 Jan 2010 18:01
Location: Germany

Re: Massaging Textual Data

#4 Post by aGerman » 31 Oct 2015 10:40

The bottleneck is not the "if-else" cascade but the "call :label". The latter is needed to remove trailing spaces though. And - no - it isn't processed for each line because the check for "amount not 0" is placed beforehand.

With that, a loop can be done just to take each line, get the amount and price and simply write in the moniker for each line based on a line number or something.

What about lines with an amount of 0? As far as I understood you wanded to remove these.

Regards
aGerman

aGerman
Expert
Posts: 4678
Joined: 22 Jan 2010 18:01
Location: Germany

Re: Massaging Textual Data

#5 Post by aGerman » 31 Oct 2015 11:17

OK, I guess that could be what you have in mind

Code: Select all

@echo off &setlocal
set "txt_in=test.txt"
set "txt_out=test2.txt"

setlocal EnableDelayedExpansion
<"!txt_in!" >"!txt_out!" (
  for %%i in (
    "Fuel:87"
    "Fuel:89"
    "Fuel:93"
    "Fuel:Diesel"
    "Non-Fuel:Automotive"
    "Non-Fuel:Beverage"
    "Non-Fuel:Candy"
    "Non-Fuel:Coffee"
    "Non-Fuel:Coke Product"
    "Non-Fuel:Cigarettes"
    "Non-Fuel:Energy Drink"
    "Non-Fuel:Fountain"
    "Non-Fuel:Grocery"
    "Non-Fuel:Cash Back Fe"
    "Non-Fuel:Gatorade"
    "Non-Fuel:Arizona"
    "Non-Fuel:General Merc"
    "Non-Fuel:Newspapers"
    "Non-Fuel:Pepsi Produc"
    "Non-Fuel:Beer"
    "Non-Fuel:Tobacco"
    "Non-Fuel:HBA"
    "Non-Fuel:Novelty"
    "Non-Fuel:Ice Bags"
    "Non-Fuel:Electric Cig"
    "Non-Fuel:Peanuts"
    "Non-Fuel:Phone Charge"
    "Non-Fuel:HOT DOG ETC."
    "Non-Fuel:Fuel Correct"
    "Non-Fuel:Phone card"
    "Non-Fuel:Rip It Energ"
    "Non-Fuel:Unknown"
    "Non-Fuel:Gift Card"
  ) do (
    set "ln=" &set /p "ln="
    if defined ln for /f "tokens=1,2" %%j in ("!ln:~12!") do (
      if "%%j" neq "0" (
        echo %%~i
        echo %%j
        echo %%k
      )
    )
  )
)

Samir
Posts: 384
Joined: 16 Jul 2013 12:00
Location: HSV
Contact:

Re: Massaging Textual Data

#6 Post by Samir » 31 Oct 2015 11:19

aGerman wrote:The bottleneck is not the "if-else" cascade but the "call :label". The latter is needed to remove trailing spaces though. And - no - it isn't processed for each line because the check for "amount not 0" is placed beforehand.

With that, a loop can be done just to take each line, get the amount and price and simply write in the moniker for each line based on a line number or something.

What about lines with an amount of 0? As far as I understood you wanded to remove these.

Regards
aGerman
Ahh, thank you for the clarification. Yes, 0 amount and 0.00 price lines can be ignored.

I'll try yours as is and see what I get. The switch statements will probably have to be expanded as most of the monikers are completely different from their values in the original text file.

Samir
Posts: 384
Joined: 16 Jul 2013 12:00
Location: HSV
Contact:

Re: Massaging Textual Data

#7 Post by Samir » 31 Oct 2015 11:20

aGerman wrote:OK, I guess that could be what you have in mind

Code: Select all

@echo off &setlocal
set "txt_in=test.txt"
set "txt_out=test2.txt"

setlocal EnableDelayedExpansion
<"!txt_in!" >"!txt_out!" (
  for %%i in (
    "Fuel:87"
    "Fuel:89"
    "Fuel:93"
    "Fuel:Diesel"
    "Non-Fuel:Automotive"
    "Non-Fuel:Beverage"
    "Non-Fuel:Candy"
    "Non-Fuel:Coffee"
    "Non-Fuel:Coke Product"
    "Non-Fuel:Cigarettes"
    "Non-Fuel:Energy Drink"
    "Non-Fuel:Fountain"
    "Non-Fuel:Grocery"
    "Non-Fuel:Cash Back Fe"
    "Non-Fuel:Gatorade"
    "Non-Fuel:Arizona"
    "Non-Fuel:General Merc"
    "Non-Fuel:Newspapers"
    "Non-Fuel:Pepsi Produc"
    "Non-Fuel:Beer"
    "Non-Fuel:Tobacco"
    "Non-Fuel:HBA"
    "Non-Fuel:Novelty"
    "Non-Fuel:Ice Bags"
    "Non-Fuel:Electric Cig"
    "Non-Fuel:Peanuts"
    "Non-Fuel:Phone Charge"
    "Non-Fuel:HOT DOG ETC."
    "Non-Fuel:Fuel Correct"
    "Non-Fuel:Phone card"
    "Non-Fuel:Rip It Energ"
    "Non-Fuel:Unknown"
    "Non-Fuel:Gift Card"
  ) do (
    set "ln=" &set /p "ln="
    if defined ln for /f "tokens=1,2" %%j in ("!ln:~12!") do (
      if "%%j" neq "0" (
        echo %%~i
        echo %%j
        echo %%k
      )
    )
  )
)
Yes! I knew there had to be a way to do this reducing the amount of looping! I didn't think of putting the monikers inside a for like that. I'll try this out and post back. :D

aGerman
Expert
Posts: 4678
Joined: 22 Jan 2010 18:01
Location: Germany

Re: Massaging Textual Data

#8 Post by aGerman » 31 Oct 2015 12:46

Bear in mind that the first code is much more robust. Every item will be part of the output as long as the quantity isn't 0. The product names for non-fuel are taken from the original list.
For the second code you have to make sure that the number, the order, and the product names of the items never change. The monikers always point to the same line numbers. It would be pretty hard to become aware of misassignments.
Even if the second code is faster it might be better to accept some seconds waste of time instead :wink:

Samir
Posts: 384
Joined: 16 Jul 2013 12:00
Location: HSV
Contact:

Re: Massaging Textual Data

#9 Post by Samir » 31 Oct 2015 12:55

aGerman wrote:Bear in mind that the first code is much more robust. Every item will be part of the output as long as the quantity isn't 0. The product names for non-fuel are taken from the original list.
For the second code you have to make sure that the number, the order, and the product names never change. The monikers always point to the same line numbers. It would be pretty hard to become aware of misassignments.
Even if the second code is faster it might be better to accept some seconds waste of time instead :wink:
I agree. Although since any changes in the number, order, or product names breaks code elsewhere, and I'm the only one that can change them, I think it will be safe to use your second solution. 8)

(Even in the first code if the products change, it will break it too because the monikers won't be correct, so either way I'd have to change the code.)

I'm about to test your second solution and see how it works. 8)

Aacini
Expert
Posts: 1914
Joined: 06 Dec 2011 22:15
Location: México City, México
Contact:

Re: Massaging Textual Data

#10 Post by Aacini » 31 Oct 2015 13:15

As a general rule, the direct access to an array element is much faster than a FOR loop with individual tests, and is less affected by a large number of elements. In the program below the moniker array is hard-coded, but it would be more convenient to store it in a text file.

The most difficult part of this program was to separate the product, units and amount fields, because the product may be comprised of several words.

Code: Select all

@echo off
setlocal EnableDelayedExpansion

rem Define the "moniker" array
for %%a in ( "Unl-Self=Fuel:87"  "Ul-Plus-Self=Fuel:89"  "Supreme-Self=Fuel:93"  "Diesel-Self=Fuel:Diesel"
             "Candy=Non-Fuel:Candy"  "Coke Product=Non-Fuel:Coke"  "Energy Drink=Non-Fuel:Energy Drink"
             "Fountain=Non-Fuel:Fountain"  "Grocery=Non-Fuel:Grocery"  "Newspapers=Non-Fuel:Newspapers"
             "Pepsi Produc=Non-Fuel:Pepsi"  "HBA=Non-Fuel:HBA"  "Rip It Energ=Non-Fuel:Rip It") do (
   for /F "tokens=1,2 delims==" %%b in (%%a) do (
      set "moniker[%%b]=%%c"
   )
)

for /F "delims=" %%a in (input.txt) do (
   set "product="
   set "units="
   set "amount="
   for %%b in (%%a) do (
      set "product=!product! !units!"
      set "units=!amount!"
      set "amount=%%b"
   )
   if "!units!" neq "0" (
      for /F "delims=" %%p in ("!product:~3!") do echo !moniker[%%p]!
      echo !units!
      echo !amount!
   )
)

Antonio

Samir
Posts: 384
Joined: 16 Jul 2013 12:00
Location: HSV
Contact:

Re: Massaging Textual Data

#11 Post by Samir » 31 Oct 2015 13:24

It's worked quite well except for one odd thing that looks like it's happening when there's two lines in a row with 0 amount. It doesn't seem like this should be an issue because the loop test is quite simple.

Here's the code I'm using and the test data file:

Code: Select all

@echo off &setlocal
set "txt_in=TOTLSOLD"
set "txt_out=TOTLTEST.TXT"

setlocal EnableDelayedExpansion
<"!txt_in!" >"!txt_out!" (
  for %%i in (
    "Fuel Sales:87"
    "Fuel Sales:89"
    "Fuel Sales:93"
    "Fuel Sales:Diesel"
    "Non-Fuel Sales:Automotive"
    "Non-Fuel Sales:Beverage"
    "Non-Fuel Sales:Candy"
    "Non-Fuel Sales:Coffee"
    "Non-Fuel Sales:Coke Prod"
    "Non-Fuel Sales:Cigarettes"
    "Non-Fuel Sales:Energy Drink"
    "Non-Fuel Sales:Fountain"
    "Non-Fuel Sales:Grocery"
    "Non-Fuel Sales:Cash Back Fe"
    "Non-Fuel Sales:Gatorade"
    "Non-Fuel Sales:Arizona"
    "Non-Fuel Sales:Newspapers"
    "Non-Fuel Sales:Pepsi Prod"
    "Non-Fuel Sales:Beer"
    "Non-Fuel Sales:Tobacco"
    "Non-Fuel Sales:HBA"
    "Non-Fuel Sales:Novelty"
    "Non-Fuel Sales:Ice"
    "Non-Fuel Sales:Elec Cig"
    "Non-Fuel Sales:Peanuts"
    "Non-Fuel Sales:Phone Charge"
    "Non-Fuel Sales:Hot Dog Etc"
    "Non-Fuel Sales:Fuel Correct"
    "Non-Fuel Sales:Phone Card"
    "Non-Fuel Sales:Rip It Energ"
    "Non-Fuel Sales"
    "Non-Fuel Sales:Gift Card"
  ) do (
    set "ln=" &set /p "ln="
    if defined ln for /f "tokens=1,2" %%j in ("!ln:~12!") do (
      if "%%j" neq "0" (
        echo %%~i
        echo %%j
        echo %%k
      )
    )
  )
)
Test data file TOTLSOLD:

Code: Select all

Unl-Self            1752.757  4555.44
Ul-Plus-Self          84.941   234.36
Supreme-Self         232.519   711.28
Diesel-Self           19.281    53.00
Automotive                 0     0.00
Beverage                  43    73.77
Candy                     51    38.27
Coffee                     2     3.38
Coke Product              16    26.14
Cigarettes                 9    48.45
Energy Drink              22    51.04
Fountain                   7     5.95
Grocery                   24    21.36
Cash Back Fe               0     0.00
Gatorade                  10    17.40
Arizona                    0     0.00
General Merc               0     0.00
Newspapers                 3     2.50
Pepsi Produc              26    39.44
Beer                       0     0.00
Tobacco                    7    14.50
HBA                       10     7.10
Novelty                    1     1.39
Ice Bags                  11    24.09
Electric Cig               0     0.00
Peanuts                    0     0.00
Phone Charge               0     0.00
HOT DOG ETC.               0     0.00
Fuel Correct               0     0.00
Phone card                 0     0.00
Rip It Energ               0     0.00
Unknown                    0     0.00
Gift Card                  0     0.00

NON-FUEL PRODUCTS
Description     Price      Qty   Total
MORE LINES
MORE LINES
MORE LINES
Results in TOTLTEST.TXT:

Code: Select all

Fuel Sales:87
1752.757
4555.44
Fuel Sales:89
84.941
234.36
Fuel Sales:93
232.519
711.28
Fuel Sales:Diesel
19.281
53.00
Non-Fuel Sales:Beverage
43
73.77
Non-Fuel Sales:Candy
51
38.27
Non-Fuel Sales:Coffee
2
3.38
Non-Fuel Sales:Coke Prod
16
26.14
Non-Fuel Sales:Cigarettes
9
48.45
Non-Fuel Sales:Energy Drink
22
51.04
Non-Fuel Sales:Fountain
7
5.95
Non-Fuel Sales:Grocery
24
21.36
Non-Fuel Sales:Gatorade
10
17.40
Non-Fuel Sales:Pepsi Prod
3
2.50
Non-Fuel Sales:Beer
26
39.44
Non-Fuel Sales:HBA
7
14.50
Non-Fuel Sales:Novelty
10
7.10
Non-Fuel Sales:Ice
1
1.39
Non-Fuel Sales:Elec Cig
11
24.09
Newspapers is missing and the monikers are shifted as a result.

Samir
Posts: 384
Joined: 16 Jul 2013 12:00
Location: HSV
Contact:

Re: Massaging Textual Data

#12 Post by Samir » 31 Oct 2015 13:32

Aacini wrote:As a general rule, the direct access to an array element is much faster than a FOR loop with individual tests, and is less affected by a large number of elements. In the program below the moniker array is hard-coded, but it would be more convenient to store it in a text file.

The most difficult part of this program was to separate the product, units and amount fields, because the product may be comprised of several words.

Code: Select all

@echo off
setlocal EnableDelayedExpansion

rem Define the "moniker" array
for %%a in ( "Unl-Self=Fuel:87"  "Ul-Plus-Self=Fuel:89"  "Supreme-Self=Fuel:93"  "Diesel-Self=Fuel:Diesel"
             "Candy=Non-Fuel:Candy"  "Coke Product=Non-Fuel:Coke"  "Energy Drink=Non-Fuel:Energy Drink"
             "Fountain=Non-Fuel:Fountain"  "Grocery=Non-Fuel:Grocery"  "Newspapers=Non-Fuel:Newspapers"
             "Pepsi Produc=Non-Fuel:Pepsi"  "HBA=Non-Fuel:HBA"  "Rip It Energ=Non-Fuel:Rip It") do (
   for /F "tokens=1,2 delims==" %%b in (%%a) do (
      set "moniker[%%b]=%%c"
   )
)

for /F "delims=" %%a in (input.txt) do (
   set "product="
   set "units="
   set "amount="
   for %%b in (%%a) do (
      set "product=!product! !units!"
      set "units=!amount!"
      set "amount=%%b"
   )
   if "!units!" neq "0" (
      for /F "delims=" %%p in ("!product:~3!") do echo !moniker[%%p]!
      echo !units!
      echo !amount!
   )
)

Antonio
Very creative solution!

In my original thinking, an array is the other data structure I was thinking to store the product/moniker table in and then somehow retrieve that. But updating the table if anything changes is still a chore--and it's a chore no matter what format we store the table in since it would have to be updated. Hence why I chose the route to just hard code it.

(Anytime a product would have to be updated, so would the moniker. In hard-coding, just the moniker would have to be updated. This also gives some flexibility as a small product change like Ice to Ice Bags would not require that the script be updated so long as it still corresponded with the same moniker.)

aGerman
Expert
Posts: 4678
Joined: 22 Jan 2010 18:01
Location: Germany

Re: Massaging Textual Data

#13 Post by aGerman » 31 Oct 2015 13:41

one odd thing

... is that General Merc is missing in the monikers :wink:
That's exactly why I told you that it will be hard to become aware of those faults.

Aacini
Expert
Posts: 1914
Joined: 06 Dec 2011 22:15
Location: México City, México
Contact:

Re: Massaging Textual Data

#14 Post by Aacini » 31 Oct 2015 13:47

Samir wrote:In my original thinking, an array is the other data structure I was thinking to store the product/moniker table in and then somehow retrieve that. But updating the table if anything changes is still a chore--and it's a chore no matter what format we store the table in since it would have to be updated. Hence why I chose the route to just hard code it.

(Anytime a product would have to be updated, so would the moniker. In hard-coding, just the moniker would have to be updated. This also gives some flexibility as a small product change like Ice to Ice Bags would not require that the script be updated so long as it still corresponded with the same moniker.)

I am afraid I don't understand your point. If the table needs to be updated it requires just one change: if it is in the program, change the program; if the table is in a data file, just change the data file. For example, this could be the table:

Code: Select all

Unl-Self=Fuel:87
Ul-Plus-Self=Fuel:89
Supreme-Self=Fuel:93
Diesel-Self=Fuel:Diesel
Candy=Non-Fuel:Candy
Coke Product=Non-Fuel:Coke
etc...

... and this table could be loaded into "moniker" array this way:

Code: Select all

for /F "tokens=1,2 delims==" %%b in (monikerTable.txt) do (
   set "moniker[%%b]=%%c"
)

This method have the advantage that you may use the table for other things; for example, you may sort it alphabetically, or add prizes to it, etc...

Antonio

Samir
Posts: 384
Joined: 16 Jul 2013 12:00
Location: HSV
Contact:

Re: Massaging Textual Data

#15 Post by Samir » 31 Oct 2015 14:19

aGerman wrote:
one odd thing

... is that General Merc is missing in the monikers :wink:
That's exactly why I told you that it will be hard to become aware of those faults.
Yeah, I just found the same thing. :lol:

It's never easy to set up something like this, and can be difficult when things change, but the idea is to set this in stone and run with it. The product descriptions haven't changed for over a year so it should be good.

I'm going to continue testing. 8)

Post Reply