Page 1 of 2

Find files, concat, rename, convert, store in new folder

Posted: 17 Oct 2014 18:02
by dchall8
Hello all. I'm brand new to this forum. I've been around since DOS but abandoned it for Windows once they finally got that working (circa 2007 :) ), so I'm a real batch novice.

Several years ago I enlisted help from a forum like this one to help write a batch file. The batch worked like a charm for what it does. Now I need it to do something slightly different. What I start with is a CD containing TIF files which are named for the document name where the extension is the page number. These are scanned documents named, for example 01424.001, 01424.002, etc. depending on how many pages there are. What the batch file does is to go to the CD, collect all the TIF page files for one document and stores them inside a prepared destination folder, concatenates the collection into one file inside the destination folder, converts that file to a PDF format, and stores the new PDF file into the destination folder. Then it moves to the next document until it reaches the end of the files on the CD. When it is finished converting it deletes all the concatenated TIF files.

Now I have changed jobs. In my new situation I still need the original batch file for CD conversions; however, I need something slightly different, too. The people at the new job have been storing all these .001, .002... files in folders for about 10 years. They have a fairly good filing system by numbered volumes, but the volumes are buried inside other volumes. So to get to volume 645 I have to open volume 640-649 first. Then the files I need to convert are inside volume 645. There are almost 800,000 of the TIF files residing inside over 1,000 volume folders.

What I would like to have is a batch file (or something) that will convert this mass of files all at once. Something that will peek into the nest of folders, find the *.001 TIF files inside each folder, run the concatenation and conversion, and then store the new PDF file inside a new folder with the name of the original folder as part of the new name for the PDF. So the new PDF files that came from folder named 645 will be named something like "Volume 645 page 1424.pdf" and they will be located inside a folder named "645 Converted." And it would be incredible if the conversion routine would increment from folder to folder to folder running the conversions. This will be a one-time conversion, unless I change jobs again.

Here is the batch file I'm using now. The subroutines called tiffcp and tiff2pdf, along with other support routines, reside inside the c:\bin folder.
----------------------

@echo off & setlocal EnableExtensions ENableDelayedExpansion
set oldpath=%PATH%
set PATH=%PATH%;c:\bin;

:: Source location for the original files (SRC) on CD

set SRC=d:\



:: Destination location for the PDF files (DST)
set DST=c:\Destination\



:: Commands are only echoed until %DEB% ist set to nothing
set DBG=ECHO/

::set "DBG="
pushd %SRC%
for /F "tokens=*" %%A in ('dir /B/A-D/ONE "*.001"') do (
set "PG="
for /F "tokens=*" %%B in (
'dir /B/A-D/ONE "%%~nA.0*"') do set PG=!PG! %%~nxB

::Concatenate the files into the Destination folder with a .TIF extension
tiffcp -c lzw !PG! %DST%%%~nA.TIF

::Convert the TIF file to a PDF
tiff2pdf -o %DST%%%~nA.PDF %DST%%%~nA.TIF

::Delete the TIF files
DEL %DST%%%~nA.TIF
)
POPD

::Reset the path to the original path
set PATH=%oldpath%

------------------------

Re: Find files, concat, rename, convert, store in new folder

Posted: 17 Oct 2014 18:36
by foxidrive
dchall8 wrote:What I would like to have is a batch file (or something) that will convert this mass of files all at once. Something that will peek into the nest of folders, find the *.001 TIF files inside each folder, run the concatenation and conversion, and then store the new PDF file inside a new folder with the name of the original folder as part of the new name for the PDF. So the new PDF files that came from folder named 645 will be named something like "Volume 645 page 1424.pdf" and they will be located inside a folder named "645 Converted."


It should be doable but details are needed.

will be named something like "Volume 645 page 1424.pdf"


What is the exact naming scheme, where will the temp files be processed (%temp% folder?)
and does any set of files have more than 99 files, because the script is only using filenames up to .099

This also assume every extension has three numerals.

"645 Converted."


Where are these folders being created? Do you need to reproduce the original folder structure or do you want the new folders
to all be created in a single folder?

Re: Find files, concat, rename, convert, store in new folder

Posted: 18 Oct 2014 00:00
by dchall8
foxidrive wrote:
dchall8 wrote:What I would like to have is a batch file (or something) that will convert this mass of files all at once. Something that will peek into the nest of folders, find the *.001 TIF files inside each folder, run the concatenation and conversion, and then store the new PDF file inside a new folder with the name of the original folder as part of the new name for the PDF. So the new PDF files that came from folder named 645 will be named something like "Volume 645 page 1424.pdf" and they will be located inside a folder named "645 Converted."


It should be doable but details are needed.

will be named something like "Volume 645 page 1424.pdf"


What is the exact naming scheme, where will the temp files be processed (%temp% folder?)


The scheme would be Volume [name of original volume] page [name of original file].pdf. Underscores could be used to fill the gaps if necessary.
Where will the files be processed? Does it make a difference? It must have because the original writer chose to process them in the destination folder.

and does any set of files have more than 99 files, because the script is only using filenames up to .099

This also assume every extension has three numerals.

Excellent catch! I never noticed that before. Every extension only has three numerals. The average file size is 4 pages. I have seen many files over 100 pages, but not in the counties where I am working. I once saw a file just over 1,000 pages but only once, and that was for a **censored** on several thousand properties pooled together. I've seen thousands of these files and only 1 over 1,000 pages. The county I'm working in now would not have a reason for such a document.
(why would the word m o r t g a g e be censored??)
"645 Converted."

Where are these folders being created? Do you need to reproduce the original folder structure or do you want the new folders
to all be created in a single folder?


For now they are going to be created on my hard drive. I usually use the C root to make it easier. I will move them to a server for general access once they are qual checked. I suppose it is not critical to recreate the original folder structure, but the naming convention already in use makes it easy to find these files. Why? What do you have in mind? Maybe I'm missing something that could simplify this.

Re: Find files, concat, rename, convert, store in new folder

Posted: 18 Oct 2014 04:23
by foxidrive
dchall8 wrote:will be named something like "Volume 645 page 1424.pdf"

The scheme would be Volume [name of original volume] page [name of original file].pdf. Underscores could be used to fill the gaps if necessary.
Where will the files be processed? Does it make a difference?

What is this volume that you are referring to? The volume label of the hard drive?

The location of processing isn't really important but you mentioned that the original script used a specific location.
Every extension only has three numerals. I have seen many files over 100 pages, but not in the counties where I am working.
(why would the word m o r t g a g e be censored??)


Ok. Dunno why that word is censored - maybe it's another word when it's not english.

For now they are going to be created on my hard drive. I usually use the C root to make it easier. I will move them to a server for general access once they are qual checked. I suppose it is not critical to recreate the original folder structure, but the naming convention already in use makes it easy to find these files. Why? What do you have in mind? Maybe I'm missing something that could simplify this.


I'm asking the questions so I know how to re-write the script, hopefully without the need to change it multiple times because some details are unclear.

The PDF files can all go into a single folder in c:\ so that makes it a simpler script - just the source of the volume name is unclear.

Oh, also - the files are stored on a hard drive and not a cdrom, correct? Is there enough free space to do the temporary work on the same hard drive?

Re: Find files, concat, rename, convert, store in new folder

Posted: 18 Oct 2014 22:27
by dchall8
foxidrive wrote:
dchall8 wrote:will be named something like "Volume 645 page 1424.pdf"

The scheme would be Volume [name of original volume] page [name of original file].pdf. Underscores could be used to fill the gaps if necessary.
Where will the files be processed? Does it make a difference?

What is this volume that you are referring to? The volume label of the hard drive?


The images are pages in a collection of books. Each book is referred to simply as volume 1, volume 2, etc. What I need to do is be able to find the pdf image for book volume 645 page 593 (for example). That particular page will be the first page of a multi-page document. For example they would be located in a folder named 645. If the document was 4 pages long, the image files would have the names 00593.001, 00593.002, 00593.003, and 00593.004. These images would represent pages 593, 594, 594, and 596 in the actual book. The next document would begin at volume 645 page 597.

For now they are going to be created on my hard drive. I usually use the C root to make it easier. I will move them to a server for general access once they are qual checked. I suppose it is not critical to recreate the original folder structure, but the naming convention already in use makes it easy to find these files. Why? What do you have in mind? Maybe I'm missing something that could simplify this.


I'm asking the questions so I know how to re-write the script, hopefully without the need to change it multiple times because some details are unclear.


Sure. Please ask the questions. I thought maybe you had an insight that I was missing.

The PDF files can all go into a single folder in c:\ so that makes it a simpler script - just the source of the volume name is unclear.

Oh, also - the files are stored on a hard drive and not a cdrom, correct? Is there enough free space to do the temporary work on the same hard drive?


The source files we're talking about on this project are on my C drive. It's a 1TB drive with about 600GB free as of now. There should be plenty of room left even after the conversion. If this is an issue, I can move all but about 200GB off my HD.

The folders are named as follows. Folder 120, 130, 140, 150, etc. up to about 700 or so. Inside folder 120 would be 10 folders named 120, 121, 122, 123, 124, 125, 126, 127, 128, and 129. All the page images from book volume 121 would be inside the folder named 121. Each book has over 900 pages but darned few have more than 1,000 pages.

These document all have a dual numbering system. They have the volume and page numbers and a document number. If the originator of the documents had used the document number consistently when scanning the images, I would be taking a different approach. Within the past few years they have been consistent using the doc number. I have converted 25,000 images into about 5,000 pdf files in one huge folder. I have a cross reference database so I can convert the PDF file names from the document number into a
Volume/page format. We humans seem to read volume and page titles more easily than a 7-digit document number. Once the files are renamed (using Bulk File Renamer), they can be put into folders with numbers (700, 701, 702, etc.). For the folders numbered 120 through 699 the image files were not named consistently. Sometimes the images are named with the page number and sometimes with the document number. If I were to dump all the files into one folder for processing, I would retain the page numbers but lose the book volume name information. Actually I would not be able to do that because you can't have files with the same name inside a folder. But as long as the page number-named documents come from, say, folder named 133 and go into a new folder named 133 Converted, then I can find them and read them as PDFs.

I'm glad you think this is doable. I think it is, too. The concept is simple but I'm not familiar enough with the batch coding to handle the file structure. The conversion part is written, perhaps with a few slight tweaks as you noted earlier. I believe this new process can use that conversion routine. This new one should
1. Enter the file structure for example 120>120 and create a new folder named 120 Converted
2. Retrieve all the files named 00001.0XX and collect them.
3. Concatenate and convert the tif files into a PDF and leave it inside the new folder (120 Converted)
4. Find the remaining documents inside 120>120 and repeat steps 3 and 4 until all the documents are converted.
5. Delete any temp files.
6. Then move to folder 120>121 and repeat steps 1-5.
7. Repeat steps 1-6 up to 120>129
8. Then move to folder 130 and repeat steps 1-7
9. Then repeat all those steps for the folders up to 699.

If I was writing this in FORTRAN IV (my last programming class in 1973), that's how I would do it.

Re: Find files, concat, rename, convert, store in new folder

Posted: 19 Oct 2014 04:26
by foxidrive
What is this volume that you are referring to? The volume label of the hard drive?


The images are pages in a collection of books. Each book is referred to simply as volume 1, volume 2, etc.


Where do you see this volume name?

Re: Find files, concat, rename, convert, store in new folder

Posted: 20 Oct 2014 07:34
by dchall8
I see the volume names in the folder hierarchy. Here are three screen shots drilling down into the folders. Maybe this will help.

This first one shows how they collected the folders for the volumes numbered 150 through 539. There are more numbered just like that as you scroll down.
Image


This next one shows what it looks like inside the folder named 150-159. The folders inside are numbered by the book volumes 150 through 159.
Image


This one shows the individual document pages. These are tif files and will open with any image viewer if you change the extension to .tif.
Image

That last shot is unusual in that the document name/number, 3666, has so many pages. It is 41 pages long, so the next document, 3667 begins just below the bottom of that particular window. It is also unusual in that there are more than 1,000 pages in that volume.

Re: Find files, concat, rename, convert, store in new folder

Posted: 20 Oct 2014 07:53
by Squashman
So the volumes are just the folder names. Could have just said that in your first post. It wasn't very clear.

Re: Find files, concat, rename, convert, store in new folder

Posted: 20 Oct 2014 10:38
by ShadowThief
dchall8 wrote:So to get to volume 645 I have to open volume 640-649 first. Then the files I need to convert are inside volume 645.

Re: Find files, concat, rename, convert, store in new folder

Posted: 20 Oct 2014 15:07
by dchall8
Picture is worth a 1,000 words. :D

Re: Find files, concat, rename, convert, store in new folder

Posted: 20 Oct 2014 15:19
by foxidrive
ShadowThief wrote:
dchall8 wrote:So to get to volume 645 I have to open volume 640-649 first. Then the files I need to convert are inside volume 645.


The problem for me was that he used the term folder and then used a different term volume, and continued talking about volumes.

The people at the new job have been storing all these .001, .002... files in folders for about 10 years. They have a fairly good filing system by numbered volumes

Re: Find files, concat, rename, convert, store in new folder

Posted: 20 Oct 2014 15:31
by ShadowThief
I saw "scanned documents organized into volumes" and my brain immediately went to comic books, which are also organized in almost the exact same way that's described here, so I happened to understand the problem purely by accident.

Re: Find files, concat, rename, convert, store in new folder

Posted: 20 Oct 2014 15:49
by foxidrive
Test this - change the SRC variable to a test location with a few folders and it should pause after creating each PDF for you to check.
If it creates a couple ok then they should all work and you can remove the pause

One possible problem is that if a single set of page number files has more than around 900 files then the TIF and PDF file creation is likely to fail for that
set of files - as the maximum length of a variable is 8191 and there are too many filenames to fit in it.

EDIT: There were a few edits - the page number you've shown in the image is 4 digits long and this uses 4 digits for each page number.
If the pages exceeds 9999 then it will be an issue. Should the page number be the full 8 digits long?

Code: Select all

@echo off & setlocal EnableExtensions ENableDelayedExpansion
set "oldpath=%PATH%"
set "PATH=%PATH%;c:\bin;"

:: Source location for the original files

set "SRC=%userprofile%\documents\Orginal Deeds"

:: Destination location for the PDF files (DST)
set "DST=c:\Destination"
md "%dst%" 2>nul


:: Commands are only echoed until %DEB% ist set to nothing - this is broken
set DBG=ECHO/

::set "DBG="

pushd "%SRC%"

for /f "delims=" %%z in ('dir /b /s /ad') do (
   pushd "%%z"

   if exist *.001 (
      for /F "delims=" %%A in ('dir /B/A-D/ONE "*.001"') do (
         set "pagenum=%%~nA"
         set "pagenum=!pagenum:~4!"
         set "PG="
            for /F "delims=" %%B in ('dir /B/A-D/ONE "%%~nA.*"') do set PG=!PG! %%~nxB

              ::Concatenate the files into the Destination folder with a .TIF extension
              tiffcp -c lzw !PG! %%~nA.TIF

              ::Convert the TIF file to a PDF
              tiff2pdf -o "%dst%\Volume %%~nxz page !pagenum!.pdf" %%~nA.TIF

              ::Delete the TIF files
              DEL %%~nA.TIF
            )
   pause
   )
   popd



     )
POPD


::Reset the path to the original path
set "PATH=%oldpath%"


Re: Find files, concat, rename, convert, store in new folder

Posted: 20 Oct 2014 15:50
by Squashman
Well the old saying is three heads are better than one. Unless we are talking Greek mythology which in that case it would be bad.

Re: Find files, concat, rename, convert, store in new folder

Posted: 20 Oct 2014 15:57
by foxidrive
Squashman wrote:Well the old saying is three heads are better than one. Unless we are talking Greek mythology which in that case it would be bad.


hehe

ShadowThief wrote:I saw "scanned documents organized into volumes" and my brain immediately went to comic books, which are also organized in almost the exact same way that's described here, so I happened to understand the problem purely by accident.


I learned to read with comics. ;) It was a great way to get kids to read.