Hello,
I need your help to extract missing row, this is the statement:
In a large ordered list of record I have always a couple of record with variable lenght like this example when the second record must have same name like xx_xx_xx and fix X1 after last underscore
FILEA_FILEB_FILEC.PDF
FILEA_FILEB_FILEC_X1SOME.PDF
I need to ectract all record without this match, here below a test file
file.txt:
ABCDEFG_RST1058_2021111M.pdf
CDEFGH_CJO0023_2021112M.pdf
CDEFGHO_QBC5638_2021121C.pdf
CDEFGHO_QBC5638_2021121C_X1I1234567.pdf
FGHOP_XYA7662_2022011C.pdf
FGHOP_XYA7662_2022011C_X1I23456785.pdf
EFGHOPQ_CJ21234_2021121CLKJH.pdf
EFGHOPQ_CJ21234_2021121CLKJH_X1I3456789.pdf
EFGHOPQ_CJ21234_2021121M_X1I4567890.pdf
FGHOPXR_CJU3971_2021120M.pdf
Results missing X1 record:
ABCDEFG_RST1058_2021111M.pdf
CDEFGH_CJO0023_2021112M.pdf
FGHOPXR_CJU3971_2021120M.pdf
Result only X1 record
EFGHOPQ_CJ21234_2021121M_X1I4567890.pdf
Thank you very much in advance
Dario
Extract missing record
Moderator: DosItHelp
Re: Extract missing record
I really think you could attempt this one on your own. Just one solution would be to do the following.
1) FOR /F command to read the text file.
2) FOR /F command to split apart the base file name into multiple FOR variable tokens.
3) IF TOKEN 4 is not blank test if tokens 1_2_3.pdf is in the file. If not echo file name.
4) IF TOKEN 4 is blank test if tokens 1_2_3_X1*.pdf is in the file. If not echo file name.
This is basically a similar concept to your last question. The only difference being you are reading a text file instead of parsing the DIR command.
viewtopic.php?f=3&t=10571&p=67827#p67827
1) FOR /F command to read the text file.
2) FOR /F command to split apart the base file name into multiple FOR variable tokens.
3) IF TOKEN 4 is not blank test if tokens 1_2_3.pdf is in the file. If not echo file name.
4) IF TOKEN 4 is blank test if tokens 1_2_3_X1*.pdf is in the file. If not echo file name.
This is basically a similar concept to your last question. The only difference being you are reading a text file instead of parsing the DIR command.
viewtopic.php?f=3&t=10571&p=67827#p67827
Re: Extract missing record
yes you are right, the problem is that with millions of records it takes too long, so I decided to do a dir of the directory first. I'll try to work on it and post the solution, see if I can solve the problem
Re: Extract missing record
With that big of a file you are going to see slow processing with the FOR /F command reading the file as well. The entire file is read into memory before it is parsed. Same with the FOR /F parsing the DIR command. The DIR command has to finish before the FOR /F begins parsing the output.
Your BEST bet is to use a basic for command to read the directory. Then use a FOR /F to split of the file name. So take my original pseudo code and just do a standard FOR command first.
Re: Extract missing record
The solution of this problem have a subtle trick!
I think this is the fastest method to solve this problem:
Output:
Antonio
I think this is the fastest method to solve this problem:
Code: Select all
@echo off
setlocal EnableDelayedExpansion
echo Results missing X1 record:
set "last="
(for %%f in (*.pdf) do (
for /F "tokens=1-4 delims=_." %%a in ("%%f") do (
if not defined last (
if "%%d" equ "pdf" (
set "last=%%~Nf"
) else (
echo %%f >&2
)
) else (
if "%%a_%%b_%%c" equ "!last!" (
set "last="
) else (
echo !last!.pdf
if "%%d" equ "pdf" (
set "last=%%~Nf"
) else (
echo %%f >&2
set "last="
)
)
)
)
)) 2> onlyX1.txt
if defined last echo %last%.pdf
echo/
echo Result only X1 record:
type onlyX1.txt
del onlyX1.txt
Code: Select all
Results missing X1 record:
ABCDEFG_RST1058_2021111M.pdf
CDEFGH_CJO0023_2021112M.pdf
FGHOPXR_CJU3971_2021120M.pdf
Result only X1 record:
EFGHOPQ_CJ21234_2021121M_X1I4567890.pdf
Re: Extract missing record
Thank you Antonio, it works fine, but the problem is a pdf directory is a windows share slow and takes long time to read 1 milion of pdf, so I prefer before do a dir /b > file.txt and work on a file.txt
Re: Extract missing record
As I said in my previous post, a base FOR command should be faster than reading a million line file into memory. The FOR command works on one file at a time. The FOR /F has to read the entire file into memory before it can begin working on it.
Regardless if you understand the code, you should easily be able to change Antonio's code to read the file. You would only have to change one line of code.