Randomly selected subset of lines from a text file

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Post Reply
Message
Author
d360991
Posts: 8
Joined: 18 Mar 2012 13:12

Randomly selected subset of lines from a text file

#1 Post by d360991 » 25 Nov 2012 22:46

Folks,

I've done a lot of research and cannot seem to determine how to write a simple batch file that will select a subset of all the lines in a text file and echos on the screen. I am thinking of the following three steps:

Code: Select all

For each line in a text file
{
   Generate random number
   If random number < treshold, echo line
}


If I want the subset to be 10%, I will set the treshold to 0.10 assuming that the random numbers that DOS will generate will be between 0 and 1.

Thank you for your help,
-Roger

Aacini
Expert
Posts: 1914
Joined: 06 Dec 2011 22:15
Location: México City, México
Contact:

Re: Randomly selected subset of lines from a text file

#2 Post by Aacini » 25 Nov 2012 22:55

A couple points needs to be cleared. Suppose that a file contain 100 lines, and that you want the 20% of they, that is 20 lines. Mean this that lines 81 to 100 is an acceptable result, but 82 to 100 not because it have only 19 lines? If the request is 99% just two results are possible: lines 1-99 or lines 2-100. Is this correct? If so, the Batch file below solve your problem:

Code: Select all

@echo off
setlocal EnableDelayedExpansion

rem Usage: Get_PercentOfFile_ percent filename
rem Get a percentage of lines from filename

rem Get number of lines in file
for /F %%a in ('find /c /v "" ^< %2) do set numLines=%%a
rem Get the desired percentage of lines
set /A desiredLines=numLines*%1/100
rem Get a random starting position for such subset
set /A skipLines=(numLines-desiredLines)*%random%/32768
set skip=
if %skipLines% gtr 0 set skip=skip=%skipLines%
rem Show the subset of lines
for /F "%skip% delims=" %%a in (%2) do (
   echo %%a
   set /A desiredLines-=1
   if !desiredLines! equ 0 goto continue
)
:continue

However, if you want that every line in the subset be randomly selected, then the method must be changed this way:

Code: Select all

@echo off

rem Usage: Get_PercentOfFile_ percent filename
rem Get a percentage of lines from filename

rem Get number of lines in file
for /F %%a in ('find /c /v "" ^< %2) do set numLines=%%a
rem Get the desired percentage of lines
set /A desiredLines=numLines*%1/100
rem Create a "copyThisLine" array with the "desired lines" number of elements
rem with randomly-generated number of line
:nextElem
   set /A randNum=numLines*%random%/32768 + 1
   if defined copyThisLine[%randNum%] goto nextElem
   set copyThisLine[%randNum%]=TRUE
   set /A desiredLines-=1
if %desiredLines% gtr 0 goto nextElem
rem Show the subset of randomly selected lines
for /F "tokens=1* delims=:" %%a in ('findstr /N "^" %2') do (
   if defined copyThisLine[%%a] echo %%b
)
Previous method should be adjusted to delete lines from a full array, instead of insert lines in an empty array, if the number of required lines is greater than 50% of the file. Otherwise, the program would take too long...

I hope it helps.

Antonio

Post Reply