Safely parse nearly any CSV with pure batch
Posted: 30 May 2019 23:07
Quite some time ago I developed a hybrid JScript/batch parseCSV.bat utility to assist with parsing CSV files with FOR /F. It could handle any CSV as long as you didn't need to parse more than 32 tokens, and no line approached 8191 bytes.
Much later, Aacini started a thread where routines were developed to easily parse more than 32 columns. But it did nothing to handle delimiters, quotes, or newlines in values, or empty values.
Recently I decided to demonstrate how to combine the two techniques, such that you can safely parse any CSV as long as no line approaches the 8191 limit.
But the above relies on CSCRIPT JScript to prepare each line for parsing by FOR /F. Over at StackOverflow I developed a technique to parse a CSV using pure batch. The key hurdle was how to differentiate quoted delimiter literals from field delimiters. The solution is a variant of a technique developed by jeb that relies on the difference between how phase 2 treats quoted "^,^," vs unquoted ^,^,
The StackOverflow pure batch CSV parser can handle most CSV complications:
test.csv
The code below demonstrates how to parse the CSV above and store each column for each row in a 2 dimensional array. Since it supports delayed expansion, you can do pretty much anything you want with each token within the inner-most loop.
The only things the code cannot handle are lines approaching 8191 bytes, and newlines within values. I don't think it is possible to solve either of these remaining problems with pure batch.
--OUTPUT--
Dave Benham
Much later, Aacini started a thread where routines were developed to easily parse more than 32 columns. But it did nothing to handle delimiters, quotes, or newlines in values, or empty values.
Recently I decided to demonstrate how to combine the two techniques, such that you can safely parse any CSV as long as no line approaches the 8191 limit.
But the above relies on CSCRIPT JScript to prepare each line for parsing by FOR /F. Over at StackOverflow I developed a technique to parse a CSV using pure batch. The key hurdle was how to differentiate quoted delimiter literals from field delimiters. The solution is a variant of a technique developed by jeb that relies on the difference between how phase 2 treats quoted "^,^," vs unquoted ^,^,
The StackOverflow pure batch CSV parser can handle most CSV complications:
- A line can begin with any character
- A field may be empty
- A field value may contain any poison characters
- A field value may contain the field delimiter
- A field value may contain doubled quote literals
- A field may contain ^ and/or !, and delayed expansion can be used
- Line length cannot approach 8191
- No field may contain newline
- Cannot parse more than 32 tokens
test.csv
Code: Select all
C1,C2,C3,C4,C5,C6,C7,C8,C9,C10,C11,C12,C13,C14,C15,C16,C17,C18,C19,C20,C21,C22,C23,C24,C25,C26,C27,C28,C29,C30,C31,C32,C33,C34,C35,C36
;,,"""quoted ,"" and non quoted ,",(<@a@c&|>)^,"(<@a@c,&|>)^ ""(<@a@c,&|>)^""",(<@a@c&|>)^!,"(<@a@c,&|>)^! ""(<@a@c,&|>)^!""",,,,,,,,,,,,,,,,,,,,,,,,,,,,,Last Column!
The only things the code cannot handle are lines approaching 8191 bytes, and newlines within values. I don't think it is possible to solve either of these remaining problems with pure batch.
Code: Select all
@echo off
setlocal enableDelayedExpansion
call :defineFor For36InA A 36 ","
set "row=0"
for /f usebackq^ delims^=^ eol^= %%A in ("test.csv") do (
set /a row+=1
call :prepareLine A
for /f delims^=^ eol^= %%A in ("!ln!") do (
setlocal disableDelayedExpansion
%For36InA% (
endlocal
for /l %%C in (1 1 36) do call :decodeAndStoreToken array.!row!.%%C "," !$%%C!
)
)
)
:: Show results
for /l %%R in (1 1 %row%) do for /l %%C in (1 1 36) do echo Row %%R Col %%C = [!array.%%R.%%C!]
exit /b
:prepareLine C
::
:: Prepares the CSV line in FOR variable %%C to be parsed by FOR /F
:: and stores the result in variable ln
::
:: Quoted , literals within a field are encoded as @c
:: @ literals are encoded as @a
:: All fields are enclosed within quotes
:: Doubled quote literals "" within a field are undoubled to "
::
:: Delayed expansion must be enabled when :prepareLine is called
::
:: If CSV delim is not , then modify lines as indicated by %= comment =%
::
setlocal disableDelayedExpansion
for %%. in (.) do set "ln=%%%1"
set "ln=,%ln:"=""%," %= Replace , with actual delim =%
set "ln=%ln:^=^^^^%"
set "ln=%ln:&=^&%"
set "ln=%ln:|=^|%"
set "ln=%ln:<=^<%"
set "ln=%ln:>=^>%"
set "ln=%ln:!=^^!%"
set "ln=%ln:,=^,^,%" %= Replace , with actual delim =%
set ^"ln=%ln:""="%^"
set "ln=%ln:"=""%"
set "ln=%ln:@=@A%"
set "ln=%ln:^,^,=@D%" %= Replace , with actual delim =%
endlocal & set "ln=%ln:""="%" !
set "ln=!ln:,,"=,,!" %= Replace , with actual delim =%
set "ln=!ln:",,=,,!" %= Replace , with actual delim =%
set "ln=!ln:~2,-2!"
set "ln=!ln:^=^^^^!"
set "ln=%ln:!=^^^!%"
set "ln=!ln:""="!"
set "ln="!ln:,,=","!"" ! %= Replace , with actual delim =%
exit /b
::==============================================================================================
:: The remainder of this script is constant for any CSV with a fixed number of fields per line,
:: regardless how many columns, or what delimiters are used, etc.
:decodeAndStoreToken RtnVar Delim TokenName
::
:: Decodes FOR /F %%TokenName and stores result in RtnVar
:: @D becomes Delim
:: @A becomes @
::
for %%. in (.) do set "%1=%%~%~3" !
if defined %1 (
set "%1=!%1:@D=%~2!"
set "%1=!%1:@A=@!"
)
exit /b
:defineFor ForMacroName InputVar TokenCount [DelimChars]
::
:: Defines a macro to be used for parsing an arbitrary number of tokens from
:: a FOR variable string. The macro always parses one additional token to hold
:: any remainder of the line that lies beyond the TokenCount tokens.
::
:: ForMacroName = The name of the macro variable to be created.
::
:: InputVar = The name of the FOR variable that contains the string of tokens.
::
:: TokenCount = The number of tokens to parse.
:: The maximum value is 2304 (256*9)
::
:: DelimChars = An optional string of one or more characters, each of which
:: is treated as a token delimiter. Default is "<tab><space>".
:: If <space> is included in the string, then it must be the
:: last character in the string.
::
:: Tokens are accessed by $n variables.
:: For example, %%%$45% would represent the 45th token.
::
:: FOR /F modifiers may be freely used. For example, %%~nx%$10% would treat the
:: 10th token as a file path, and would expand to the file name and extension.
::
:: Normally, a single FOR /F is limited to 31 tokens, but the macro supports
:: many more, theoretically as many as 2304. However, each line to be parsed
:: must be less than 8191 characters in length.
::
:: This function may be called with delayed expansion enabled or disabled.
:: It is generally recommended that the macro be used with delayed expansion
:: disabled so that tokens containing ! are not corrupted.
::
:: This function automatically calls :defineForChars to define enough $n
:: variables to satisfy the TokenCount+1 tokens.
::
:: Example usage - Suppose you want to parse a well behaved CSV file named
:: test.csv that contains 300 columns. All lines must have the same number of
:: columns, and no column value may contain a comma.
::
:: The following code will correctly parse each data line of test.csv:
::
:: @echo off
:: setlocal disableDelayedExpansion
:: call :defineFor For300InA A 300 ","
:: for /f "skip=1 delims=" %%A in (test.csv) do %For300InA% (
:: echo token 1 = %%%$1%
:: echo token 2 = %%%$2%
:: echo ...
:: echo token 300 = %%%$300%
:: )
::
:: If the first token might begin with any character, including the default
:: EOL character, then the FOR /F line should be changed as follows:
::
:: for /f skip^=1^ delims^=^ eol^= %%A in (test.csv) do %For300InA% (
::
if %$max%0 gtr %~30 goto :defineForInternal
set /a "$max=(%~3)/256+1"
call :defineForChars %$max%
:defineForInternal
setlocal enableDelayedExpansion
set "delims=%~4"
if not defined delims set "delims= "
set "in=%~2"
set "macro="
set /a max=31, end=0
for /l %%N in (1 31 %~3) do (
if %%N neq 1 set "in=!$%%N!"
set /a end+=31
if !end! gtr %~3 set /a "max=%~3-%%N+1"
set "macro=!macro! for /f "eol=!delims:~0,1! tokens=1-!max!* delims=!delims!" %%!$%%N! in ("%%!in!") do"
)
for /f "delims=" %%A in ("!macro! ") do endlocal & set "%~1=%%A"
exit /b
:defineForChars Count
::
:: Defines variables to be used as FOR /F tokens, from $1 to $n, where n = Count*256
:: Also defines $max = Count*256.
:: No other variables are defined or tampered with.
::
:: Once defined, the variables are very useful for parsing lines with many tokens, as
:: the values are guaranteed to be contiguous within the FOR /F mapping scheme.
::
:: For example, you can use $1 as a FOR variable by using %%%$1%.
::
:: FOR /F "TOKENS=1-31" %%%$1% IN (....) DO ...
::
:: %%%$1% = token 1, %%%$2% = token 2, ... %%%$31% = token 31
::
:: This routine never uses SETLOCAL, and works regardless whether delayed expansion
:: is enabled or disabled.
::
:: Three temporary files are created and deleted in the %TEMP% folder, and the active
:: code page is temporarily set to 65001, and then restored to the starting value
:: before returning. Once defined, the $n variables can be used with any code page.
::
for /f "tokens=2 delims=:." %%P in ('chcp') do call :DefineForCharsInternal %1
exit /b
:defineForCharsInternal
set /a $max=%1*256
>"%temp%\forVariables.%~1.hex.txt" (
echo FF FE
for %%H in (
"0 1 2 3 4 5 6 7 8 9 A B C D E F"
) do for /l %%N in (1 1 %~1) do for %%A in (%%~H) do for %%B in (%%~H) do (
echo %%A%%B 0%%N 0D 00 0A 00
)
)
>nul certutil.exe -decodehex -f "%temp%\forVariables.%~1.hex.txt" "%temp%\forVariables.%~1.utf-16le.bom.txt"
>nul chcp 65001
>"%temp%\forVariables.%~1.utf8.txt" type "%temp%\forVariables.%~1.utf-16le.bom.txt"
<"%temp%\forVariables.%~1.utf8.txt" (for /l %%N in (1 1 %$max%) do set /p "$%%N=")
for %%. in (dummy) do >nul chcp %%P
del "%temp%\forVariables.%~1.*.txt"
exit /b
Code: Select all
Row 1 Col 1 = [C1]
Row 1 Col 2 = [C2]
Row 1 Col 3 = [C3]
Row 1 Col 4 = [C4]
Row 1 Col 5 = [C5]
Row 1 Col 6 = [C6]
Row 1 Col 7 = [C7]
Row 1 Col 8 = [C8]
Row 1 Col 9 = [C9]
Row 1 Col 10 = [C10]
Row 1 Col 11 = [C11]
Row 1 Col 12 = [C12]
Row 1 Col 13 = [C13]
Row 1 Col 14 = [C14]
Row 1 Col 15 = [C15]
Row 1 Col 16 = [C16]
Row 1 Col 17 = [C17]
Row 1 Col 18 = [C18]
Row 1 Col 19 = [C19]
Row 1 Col 20 = [C20]
Row 1 Col 21 = [C21]
Row 1 Col 22 = [C22]
Row 1 Col 23 = [C23]
Row 1 Col 24 = [C24]
Row 1 Col 25 = [C25]
Row 1 Col 26 = [C26]
Row 1 Col 27 = [C27]
Row 1 Col 28 = [C28]
Row 1 Col 29 = [C29]
Row 1 Col 30 = [C30]
Row 1 Col 31 = [C31]
Row 1 Col 32 = [C32]
Row 1 Col 33 = [C33]
Row 1 Col 34 = [C34]
Row 1 Col 35 = [C35]
Row 1 Col 36 = [C36]
Row 2 Col 1 = [;]
Row 2 Col 2 = []
Row 2 Col 3 = ["quoted ," and non quoted ,]
Row 2 Col 4 = [(<@a@c&|>)^]
Row 2 Col 5 = [(<@a@c,&|>)^ "(<@a@c,&|>)^"]
Row 2 Col 6 = [(<@a@c&|>)^!]
Row 2 Col 7 = [(<@a@c,&|>)^! "(<@a@c,&|>)^!"]
Row 2 Col 8 = []
Row 2 Col 9 = []
Row 2 Col 10 = []
Row 2 Col 11 = []
Row 2 Col 12 = []
Row 2 Col 13 = []
Row 2 Col 14 = []
Row 2 Col 15 = []
Row 2 Col 16 = []
Row 2 Col 17 = []
Row 2 Col 18 = []
Row 2 Col 19 = []
Row 2 Col 20 = []
Row 2 Col 21 = []
Row 2 Col 22 = []
Row 2 Col 23 = []
Row 2 Col 24 = []
Row 2 Col 25 = []
Row 2 Col 26 = []
Row 2 Col 27 = []
Row 2 Col 28 = []
Row 2 Col 29 = []
Row 2 Col 30 = []
Row 2 Col 31 = []
Row 2 Col 32 = []
Row 2 Col 33 = []
Row 2 Col 34 = []
Row 2 Col 35 = []
Row 2 Col 36 = [Last Column!]
Dave Benham