Page 1 of 1

Re: Surprising DIR results when using wildcard

Posted: 28 Aug 2012 07:06
by WimSKW
[mod edit] Split Topic Surprising DIR results when using wildcard

Now that we know where this behaviour comes from, is there a way to work around it?
I have a folder with "*.XML" and "*.XML_Bak" files. I wanted to loop over all "*.XML" files like this:

Code: Select all

for %a in ("*.XML") DO ECHO %a
but it processed the ".XML_Bak" files also. Sure enough, the short 8.3 filename of the "*.XML_Bak" files also ended in ".XML".
How can I restrict the for loop to only treat the "*.XML" files, preferably without adding an extra test like IF /I {%~xa}=={.XML}?

Thanks.
-=Wim=-

Re: Surprising DIR results when using wildcard

Posted: 28 Aug 2012 12:09
by dbenham
Use FINDSTR regex to filter the results

Code: Select all

for /f "eol=: delims=" %F in ('dir /b /a-d *.xml ^| findstr /rix ".*\.xml"') do @echo %F


Dave Benham

Re: Surprising DIR results when using wildcard

Posted: 28 Aug 2012 18:36
by Liviu
WimSKW wrote:I have a folder with "*.XML" and "*.XML_Bak" files. [...] How can I restrict the for loop to only treat the "*.XML" files, preferably without adding an extra test like IF /I {%~xa}=={.XML}?

Actually, adding that extra IF would be my recommended solution.

Code: Select all

for %F in (*.xml) do @if /i "%~xF"==".xml" echo "%~F"

Dave's alternative would work in most all cases but, as any code using pipes (dir | findstr), it could fail on filenames containing characters outside the current codepage, while the plain "for" loop handles those fine.

Couple more semi-related notes:

- it is possible to disable short 8.3 names creation on NTFS drives you have control over (http://support.microsoft.com/kb/121007); however, AFAIK this does not retroactively remove already existing 8.3 aliases;

- if you have a file named just ".xml" then by the formal rules, that's a name of ".xml" and no extension; however, by some historical accident I guess, both "dir *." and "dir *.xml" would list that file as matching, and the "for" loop wrongly parses it as an empty name with ".xml" extension; in case that's a concern in your use case, you'd also need to check for a not empty %~nF.

Liviu

Re: Surprising DIR results when using wildcard

Posted: 28 Aug 2012 18:55
by Ed Dyreen
Liviu wrote:by some historical accident I guess, both "dir *." and "dir *.xml" would list that file as matching
I thought the asterisk here meant
* Repeat: zero or more instances of any character
So both "dir *." and "dir *.xml" should list ".xml" as matching.

Re: Surprising DIR results when using wildcard

Posted: 28 Aug 2012 20:46
by Liviu
Ed Dyreen wrote:
Liviu wrote:by some historical accident I guess, both "dir *." and "dir *.xml" would list that file as matching
I thought the asterisk here meant
* Repeat: zero or more instances of any character
So both "dir *." and "dir *.xml" should list ".xml" as matching.

With that interpretation, the first "dir *." shouldn't match since the "*." pattern ends with a "." while ".xml" does not. With any interpretation (that I can think of) one of the two patterns should _not_ match since the "." dot is an extension separator, and "xml" could conceivably be considered part of the name, or the extension, but not _both_.

The official wildcarding rules no longer seem to be easy to find nowadays, since MS retired a lot of their legacy (in their words "obsolete") online documentation. My recollection is that a filename of ".xml" was supposed to be parsed as a name of ".xml" and no extension (as opposed to an empty name with ".xml" extension). This is indirectly supported by http://msdn.microsoft.com/en-us/library/aa365247.aspx
Naming Files, Paths, and Namespaces wrote:All file systems follow the same general naming conventions for an individual file: a base file name and an optional extension, separated by a period.
which implies that the file name can not be empty, and it's the extension which is optional. However, like many things Windows, this is not followed consistently. For example, running "start .txt" at the command prompt will attempt to open the file ".txt" using the program associated with files of ".txt" extension.

Liviu

Re: Surprising DIR results when using wildcard

Posted: 29 Aug 2012 05:47
by WimSKW
I created a few test files and tried it...

Code: Select all

D:\>dir d:\test\*.xml
 Volume in drive D is Data
 Volume Serial Number is C6F9-6F22

 Directory of d:\test

29/08/2012  09:45                68 FILE_20120802_yoab_001.xml_bak
29/08/2012  09:46                75 FILE_20120801_apgh_005.Xml_BAK
29/08/2012  09:46                75 FILE_20120801_apgh_005.Xml
29/08/2012  09:45                43 FILE_20120801_stsi_001.XML
29/08/2012  09:45                68 FILE_20120802_yoab_001.xml
29/08/2012  09:47               170 FILE_20120810_cvia_167.Xml
               6 File(s)            499 bytes
               0 Dir(s)   2.169.847.808 bytes free

D:\>
...but it doesn't seem to work. The files with "xml_bak" extension still show.


dbenham wrote:Use FINDSTR regex to filter the results

Code: Select all

for /f "eol=: delims=" %F in ('dir /b /a-d *.xml ^| findstr /rix ".*\.xml"') do @echo %F
Dave's solution *does* work. Here's the output:

Code: Select all

D:\test>for /f "eol=: delims=" %F in ('dir /b /a-d *.xml ^| findstr /rix ".*\.xml"') do @echo %F
FILE_20120801_apgh_005.Xml
FILE_20120801_stsi_001.XML
FILE_20120802_yoab_001.xml
FILE_20120810_cvia_167.Xml

D:\test>
However, I want to avoid the "dir" command because the directory contains almost 100K files, while continuously new files are being added. I have a feeling that "dir" is going to be slow in this case. I definitely will keep the FINDSTR regex solution in the back of my mind. Maybe I can let the FOR loop (wrongly) spit out the "*.xml_bak" files and do the check afterwards in the processing part.


Liviu wrote:Actually, adding that extra IF would be my recommended solution.
I've started playing with the extra IF and it works. But as you mentioned, not in a general way. In fact in my script I created a variable (SET strFileFilter=*.XML) which I use in the FOR loop. This variable will need to change when I execute the script on another directory. Take "*.docx" for instance. The original problem would never occur here, so the extra IF is not necessary in this case. Mmm... creating generic scripts is tough :-)
(About the ".XML" file, that was a typo. I was referring to files with a ".XML" extension.)


While trying to figure this all out, I noticed that when I'm a little more specific in the strFileFilter variable it works. It seems that if you can construct the filter in such a way that it probably never matches any 8.3 filename, the problem does not appear.
A filter of "FIL*.XML" apparently is enough. Being curious about why this is the case, I found the explanation in the Overview part of http://en.wikipedia.org/wiki/8.3_filename.
My lesson learned: when filtering on files with an extension of 3 characters or less, chances are that you get in conflict with the 8.3 filenames so be as specific as possible in the filename part. If this is not possible, add an extra check genre IF or FINDSTR regex.

Thanks guys!
-=Wim=-

EDIT: It seems that a filter of "FIL*.XML" is not enough, at least not for the first 4 files. As mentioned in the above wikipedia article, the first 4 files have a short filename that contains the first 6 characters from the long filename, then a ~ and a digit, followed by a period and the first 3 characters of the extension. The filter should at least contain the first 7 characters of the long filename to avoid conflicts with the short filename. In this case I could use "FILE_????????_*.xml". Annoying problem.... :-(

Re: Surprising DIR results when using wildcard

Posted: 29 Aug 2012 10:02
by Liviu
WimSKW wrote:Dave's solution *does* work.

It does work, with the caveat that characters outside the current codepage may trip it. For example (to see the characters as copied below set the command prompt font to something like Lucida Console):

Code: Select all

C:\tmp>chcp
Active code page: 437

C:\tmp>dir /b *.xml
©©©.xml
‹€›.xml

C:\tmp>for %F in (*.xml) do @if /i "%~xF"==".xml" echo "%~F"
"©©©.xml"
"‹€›.xml"

C:\tmp>for /f "eol=: delims=" %F in ('dir /b /a-d *.xml ^| findstr /rix ".*\.xml"') do @echo %F
ccc.xml
<?>.xml

C:\tmp>
Note that piping through findstr returns the wrong name for the first file, and a name both wrong and illegal for the second.

WimSKW wrote:I've started playing with the extra IF and it works. But as you mentioned, not in a general way. In fact in my script I created a variable (SET strFileFilter=*.XML) which I use in the FOR loop. This variable will need to change when I execute the script on another directory. Take "*.docx" for instance. The original problem would never occur here, so the extra IF is not necessary in this case. Mmm... creating generic scripts is tough :-)

That's correct. Only recourse would be to write two loops, one plain, the second with the extra IF, then check the extension for being longer than 3 characters and decide which loop to use. Of course, the loops could both call the same :subroutine to process each file, so the code waste would not be very large.

Liviu