Yes, the ? makes the * non-greedy. That command finds lines with at least 9 commas and turns them into blank lines. The search regex can be dramatically simplified, and the command modified to preserve only lines that have at least 9 commas:
Code: Select all
jrepl "(.*?,){9}.*" "$0" /jmatch /f test.csv
But that doesn't really help solve the problem if you plan on using just a single jrepl pass. Whenever a logical line is split into two, then the first \LF must be turned into a space, but the last one should not, yet the line with the valid \LF did not have 9 commas
The solution is to use 3 jrepl passes. The first pass identifies "valid" \LF and doubles them. The second pass converts \CR\LF that do not precede \LF into a space, and the final pass converts \LF\LF back into \LF. I use [\s\S] instead of a dot so that it can match \LF.
Code: Select all
jrepl "([\s\S]*?,){9}.*\n" "$&\n" /x /m /f test.csv | jrepl "\r?\n\n|\r?\n" "$&| " /m /t "|"" | jrepl "\n\n" "\n" /x /m /o test.csv.new
move /y test.csv.new test.csv >nul
I put the word "valid" in quotes because it is impossible to know for sure which \LF is valid if unwanted \LF can occur in the first column, and can also occur in the last column.
The above assumes the first occurring \LF after 9 commas is valid, so it does not allow \LF in the last column.
Using jSCRIPT with the /J and /JBEG options, it is possible to do everything in one pass
I keep a running count of which column I am in, as identified by commas. If I find a (\CR)\LF while in the 10th column, then I reset the count to 1 and preserve it, otherwise I replace it with a space. I seriously abuse the condition?expression:expression construct with an ugly hack, but it works
Code: Select all
jrepl ",|\r?\n" "field+=1;$0|field==10?((field=1)==1?$0:''):' '" /j /jbeg "var field=1" /t "|" /m /f test.csv /o -
One final note - All the discussion so far assumes that all commas are column delimiters. But the CSV format allows for comma literals within values if the field is quoted. In reality, a field can also contain \LF if it is quoted, though not many parsers support that feature. The other complicating feature is quote literals are doubled within a quoted field. I don't think I would use JREPL to support these features with this problem, and I certainly wouldn't attempt it with pure batch. I would probably write a custom JScript script.
I've written the hybrid
parseCSV.bat that allows FOR /F to parse nearly any properly formatted CSV. But I don't see how it helps with this problem given that the CSV is not valid. However, the JScript within might be a good starting point in developing a custom JScript solution to this problem.
Dave Benham