Safely parse nearly any CSV with parseCSV.bat v1.3
Posted: 19 Jun 2014 00:15
It is fairly common that someone wants to parse CSV using FOR /F. This is a simple task if you know all columns are populated, and there are no commas, newlines, or quotes within values. Assume there are 4 columns:
But things become more difficult if any of the following conditions occur:
1 - Values may be empty, with consecutive commas. FOR /F treats consecutive delimiters as one, so it will throw off the column assignment.
2 - Quoted values may contain commas. FOR /F will incorrectly treat a quoted comma as a column delimiter.
3 - Quoted values may contain newlines. FOR /F will break the line at the newline and incorrectly treat the one row as two.
4 - Quoted values may contain paired quotes that represent one quote. For example, "He said, ""Hello there"". A method is needed to convert "" into ".
Then there is a secondary problems that can crop up if delayed expansion is enabled.
5 - A FOR variable %%A will be corrupted if it contains ! (or sometimes ^) if delayed expansion is enabled when the variable is expanded.
There are fairly easy solutions for some of these issues, but solving all of them is extremely difficult (and slow) with pure batch.
I have written a hybrid JScript/batch utility called parseCSV.bat that makes it easy and relatively efficient to correctly parse nearly any CSV file with FOR /F.
EDIT 2017-02-19: Updated to version 1.1 to enable CSV transformations for purposes other than parsing by FOR /F. Also now supports escape sequences.
EDIT 2017-02-20: Updated to version 1.2 to fix a bug concerning /E with /Q:N, added the /U option, and improved performance
EDIT 2019-05-25: Updated to version 1.3 to fix nasty bug - /D was not working when combined with /E
parseCSV.bat
I have also written a script that defines a macro to assist with parsing the most problematic CSV files. See http://www.dostips.com/forum/viewtopic.php?f=3&t=1827 for background information about batch macros with arguments.
EDIT 2015-01-08: Updated to version 1.1 to fix a silly mistake in the Example usage code
define_csvGetCol.bat
Usage is extremely simple if you know there are no commas or newlines in any values, and delayed expansion is not needed:
test1.csv
test1.bat - no delayed expansion, no commas or newlines in values
--OUTPUT1--
It is also quite simple when commas are in values if you know of a character that does not exist in any value. Simply specify a unique character for the output delimiter.
test2.csv
test2.bat - no delayed expansion, no newlines or pipes in values EDIT: Note that the entire option must be quoted if the delimiter is a poison character
--OUTPUT2--
It only takes a bit more code if values may contain newlines or if you don't know of a character that does not appear in any value. This solution encodes newlines, delimiters, and slashes as \N, \D, and \S. Delayed expansion is needed within the loop to decode the values, so ! and ^ must be escaped as ^! and ^^.
test3.csv
test3.bat - Allow virtually any valid CSV, without using a macro.
--OUTPUT3--
test4.bat - Allow virtually any valid CSV, but now use the %csvGetCol% macro.
Output is identical to test3.bat
If the CSV file is very large, then it is much more efficient to save the output of parseCSV.bat to a temporary file, and then use the FOR /F loop to read the temporary file.
There are still a couple inherent limitations that are true for all FOR /F usage:
1) A single FOR /F cannot parse more than 32 columns. Update 2019-05-30: See this post for methods to enable parseCSV to parse any number of columns.
2) Batch line length restriction of 8191 characters can still be a problem.
Dave Benham
Code: Select all
@echo off
for /f "tokens=1-4 delims=," %%A in (test.csv) do (
echo ----------------------
echo A=%%~A
echo B=%%~B
echo C=%%~C
echo D=%%~D
echo(
)
1 - Values may be empty, with consecutive commas. FOR /F treats consecutive delimiters as one, so it will throw off the column assignment.
2 - Quoted values may contain commas. FOR /F will incorrectly treat a quoted comma as a column delimiter.
3 - Quoted values may contain newlines. FOR /F will break the line at the newline and incorrectly treat the one row as two.
4 - Quoted values may contain paired quotes that represent one quote. For example, "He said, ""Hello there"". A method is needed to convert "" into ".
Then there is a secondary problems that can crop up if delayed expansion is enabled.
5 - A FOR variable %%A will be corrupted if it contains ! (or sometimes ^) if delayed expansion is enabled when the variable is expanded.
There are fairly easy solutions for some of these issues, but solving all of them is extremely difficult (and slow) with pure batch.
I have written a hybrid JScript/batch utility called parseCSV.bat that makes it easy and relatively efficient to correctly parse nearly any CSV file with FOR /F.
EDIT 2017-02-19: Updated to version 1.1 to enable CSV transformations for purposes other than parsing by FOR /F. Also now supports escape sequences.
EDIT 2017-02-20: Updated to version 1.2 to fix a bug concerning /E with /Q:N, added the /U option, and improved performance
EDIT 2019-05-25: Updated to version 1.3 to fix nasty bug - /D was not working when combined with /E
parseCSV.bat
Code: Select all
@if (@X)==(@Y) @end /* harmless hybrid line that begins a JScrpt comment
::************ Documentation ***********
::parseCSV.bat version 1.3
:::
:::parseCSV [/option]...
:::
::: Parse stdin as CSV and write it to stdout in a way that can be safely
::: parsed by FOR /F. All columns will be enclosed by quotes so that empty
::: columns may be preserved. It also supports delimiters, newlines, and
::: escaped quotes within quoted values. Two consecutive quotes within a
::: quoted value are converted into one quote by default.
:::
::: Available options:
:::
::: /I:string = Input delimiter. Default is a comma (,)
:::
::: /O:string = Output delimiter. Default is a comma (,)
:::
::: The entire option must be quoted if specifying poison character
::: or whitespace literals as a delimiters for /I or /O.
:::
::: Examples: pipe = "/I:|"
::: space = "/I: "
:::
::: Standard JScript escape sequences can also be used.
:::
::: Examples: tab = /I:\t or /I:\x09
::: backslash = /I:\\
:::
::: /E = Encode output delimiter literal within value as \D
::: Encode newline within value as \N
::: Encode backslash within value as \S
:::
::: /D = escape exclamation point and caret for Delayed expansion
::: ! becomes ^!
::: ^ becomes ^^
:::
::: /L = treat all input quotes as quote Literals
:::
::: /Q:QuoteOutputFormat
:::
::: Controls output of Quotes, where QuoteOutputFormat may be any
::: one of the following:
:::
::: L = all columns quoted, quote Literals output as " (Default)
::: E = all columns quoted, quote literals Escaped as ""
::: N = No columns quoted, quote literals output as "
:::
::: The /Q:E and /Q:N options are useful for transforming data for
::: purposes other than parsing by FOR /F
:::
::: /U = Write unix style lines with newline (\n) instead of the default
::: Windows style of carriage return and linefeed (\r\n).
:::
:::parseCSV /?
:::
::: Display this help
:::
:::parseCSV /V
:::
::: Display the version of parseCSV.bat
:::
:::parseCSV.bat was written by Dave Benham. Updates are available at the original
:::posting site: http://www.dostips.com/forum/viewtopic.php?f=3&t=5702
:::
::************ Batch portion ***********
@echo off
if "%~1" equ "/?" (
setlocal disableDelayedExpansion
for /f "delims=: tokens=*" %%A in ('findstr "^:::" "%~f0"') do echo(%%A
exit /b 0
)
if /i "%~1" equ "/V" (
for /f "delims=:" %%A in ('findstr /bc:"::%~nx0 version " "%~f0"') do echo %%A
exit /b 0
)
cscript //E:JScript //nologo "%~f0" %*
exit /b 0
************ JScript portion ***********/
var args = WScript.Arguments.Named,
stdin = WScript.Stdin,
stdout = WScript.Stdout,
escape = args.Exists("E"),
literalQ = args.Exists("L"),
escapeQ = (args.Item("Q")&&args.Item("Q").toUpperCase()=="E"),
quoteCol = (args.Item("Q")&&args.Item("Q").toUpperCase()=="N") ? '' : '"',
delayed = args.Exists("D"),
inDelim = args.Item("I") ? eval('"'+args.Item("I")+'"') : ",",
outDelim = args.Item("O") ? eval('"'+args.Item("O")+'"') : ",",
newline = args.Exists("U") ? "\n" : "\r\n",
quote = false,
ln, c, n, out;
while (!stdin.AtEndOfStream) {
ln=stdin.ReadLine();
out="";
if (!quote) stdout.Write(quoteCol);
for (n=0; n<ln.length; n++ ) {
c=ln.charAt(n);
if (c == '"') {
if (literalQ) {
if (escapeQ) c+='"';
} else if (quote && ln.charAt(n+1) == '"') {
n++;
if (escapeQ) c+='"';
} else {
quote=!quote;
continue;
}
}
else if (c == inDelim && !quote) c=quoteCol+outDelim+quoteCol;
else {
if (escape) {
if (c == outDelim) c="\\D";
if (c == "\\") c="\\S";
}
if (delayed) {
if (c == "!") c="^!";
if (c == "^") c="^^";
}
}
out+=c;
}
out += (quote) ? ((escape) ? "\\N" : newline) : quoteCol+newline;
stdout.Write(out);
}
EDIT 2015-01-08: Updated to version 1.1 to fix a silly mistake in the Example usage code
define_csvGetCol.bat
Code: Select all
::define_csvGetCol.bat version 1.1
::
:: Defines variable LF and macro csvGetCol to be used with
:: parseCSV.bat to parse nearly any CSV file.
::
:: This script must be called with delayedExpansion disabled.
::
:: The %csvGetCol% macro must be used with delayedExpansion enabled.
::
:: Example usage:
::
:: @echo off
:: setlocal disableDelayedExpansion
:: call define_csvGetCol
:: setlocal enableDelayedExpansion
:: for /f "tokens=1-3 delims=," %%A in ('parseCSV /d /e ^<test.csv') do (
:: %== Load and decode column values ==%
:: %csvGetCol% A "," %%A
:: %csvGetCol% B "," %%B
:: %csvGetCol% C "," %%C
:: %== Display the result ==%
:: echo ----------------------
:: for %%V in (A B C) do echo %%V=!%%V!
:: echo(
:: )
::
:: Written by Dave Benham
::
:: Delayed expansion must be disabled during macro definition
:: Define LF to contain a linefeed (0x0A) character
set ^"LF=^
^" The empty line above is critical - DO NOT REMOVE
:: define a newline with line continuation
set ^"\n=^^^%LF%%LF%^%LF%%LF%^^"
:: Define csvGetCol
:: %csvGetCol% envVarName "Delimiter" FORvar
set csvGetCol=for %%# in (1 2) do if %%#==2 (%\n%
setlocal enableDelayedExpansion^&for /f "tokens=1,2*" %%1 in ("!args!") do (%\n%
endlocal^&endlocal%\n%
set "%%1=%%~3"!%\n%
if defined %%1 (%\n%
for %%L in ("!LF!") do set "%%1=!%%1:\N=%%~L!"%\n%
set "%%1=!%%1:\D=%%~2!"%\n%
set "%%1=!%%1:\S=\!"%\n%
)%\n%
)) else setlocal disableDelayedExpansion ^& set args=
test1.csv
Code: Select all
"value1 with ""quotes""",value2: No problem!,value3: 2^3=8,value4: (2^2)!=16
value1,,value3,value4
value1,,,value4
value1,,,
,,,value4
Code: Select all
@echo off
for /f "tokens=1-4 delims=," %%A in ('parseCSV ^<test1.csv') do (
echo -------------
echo(A=%%~A
echo(B=%%~B
echo(C=%%~C
echo(D=%%~D
echo(
)
Code: Select all
-------------
A=value1 with "quotes"
B=value2: No problem!
C=value3: 2^3=8
D=value4: (2^2)!=16
-------------
A=value1
B=
C=value3
D=value4
-------------
A=value1
B=
C=
D=value4
-------------
A=value1
B=
C=
D=
-------------
A=
B=
C=
D=value4
test2.csv
Code: Select all
"value1 with ""quotes""","value2, No problem!","value3, 2^3=8","value4, (2^2)!=16"
value1,,value3,value4
value1,,,value4
value1,,,
,,,value4
Code: Select all
@echo off
for /f "tokens=1-4 delims=|" %%A in ('parseCSV "/o:|" ^<test2.csv') do (
echo -------------
echo(A=%%~A
echo(B=%%~B
echo(C=%%~C
echo(D=%%~D
echo(
)
Code: Select all
-------------
A=value1 with "quotes"
B=value2, No problem!
C=value3, 2^3=8
D=value4, (2^2)!=16
-------------
A=value1
B=
C=value3
D=value4
-------------
A=value1
B=
C=
D=value4
-------------
A=value1
B=
C=
D=
-------------
A=
B=
C=
D=value4
test3.csv
Code: Select all
"2^3=8","(2^2)!=16","Success!",Value4
value1,value2,value3,value4
,,,value4
"value1","value2","value3","value4"
"He said, ""Hey cutie.""","She said, ""Drop dead!""","value3 line1
value3 line2",c:\Windows
Code: Select all
@echo off
setlocal enableDelayedExpansion
:: Define LF to contain a linefeed (0x0A) character
set ^"LF=^
^" The empty line above is critical - DO NOT REMOVE
for /f "tokens=1-4 delims=," %%A in ('parseCSV /e /d ^<test3.csv') do (
%== Load columns with encoded values. The trailing ! is important ==%
set "A=%%~A"!
set "B=%%~B"!
set "C=%%~C"!
set "D=%%~D"!
%== Decode values ==%
for %%L in ("!LF!") do for %%V in (A B C D) do if defined %%V (
set "%%V=!%%V:\N=%%~L!"
set "%%V=!%%V:\D=,!"
set "%%V=!%%V:\S=\!"
)
%== Print results ==%
echo ---------------------
for %%V in (A B C D) do echo(%%V=!%%V!
echo(
)
Code: Select all
---------------------
A=2^3=8
B=(2^2)!=16
C=Success!
D=Value4
---------------------
A=value1
B=value2
C=value3
D=value4
---------------------
A=
B=
C=
D=value4
---------------------
A=value1
B=value2
C=value3
D=value4
---------------------
A=He said, "Hey cutie."
B=She said, "Drop dead!"
C=value3 line1
value3 line2
D=c:\Windows
Code: Select all
@echo off
:: Delayed expansion must be disabled during macro definition
setlocal disableDelayedExpansion
call define_csvGetCol
:: Delayed expansion must be enabled when using %csvGetCol%
setlocal enableDelayedExpansion
for /f "tokens=1-4 delims=," %%A in ('parseCSV /e /d ^<test3.csv') do (
%== Load and decode column values ==%
%csvGetCol% A "," %%A
%csvGetCol% B "," %%B
%csvGetCol% C "," %%C
%csvGetCol% D "," %%D
%== Print results ==%
echo ---------------------
for %%V in (A B C D) do echo(%%V=!%%V!
echo(
)
If the CSV file is very large, then it is much more efficient to save the output of parseCSV.bat to a temporary file, and then use the FOR /F loop to read the temporary file.
There are still a couple inherent limitations that are true for all FOR /F usage:
1) A single FOR /F cannot parse more than 32 columns. Update 2019-05-30: See this post for methods to enable parseCSV to parse any number of columns.
2) Batch line length restriction of 8191 characters can still be a problem.
Dave Benham