I imagine FindRepl.bat could handle this well, but I don't really know it well.
JREPL.BAT can definitely handle this.
There are some complicating factors to consider.
1) Apostraphe should not be removed within contractions or possessives: "can't" should not become "can t"
2) Hyphenated words should remain hyphenated: "mother-in-law" should not become "mother in law"
3) Words split across multiple lines via hyphen should be collapsed into a single word. The text may be double (or more) spaced as in your example. "book-\n\nkeeper" should become "bookkeeper".
4) All words should be converted to lower case, unless it is a proper noun. But I don't know how to detect proper nouns.
JREPL can solve the problem in 3 steps:
1) Use JREPL to remove all unwanted punctuation and white space, and put each word on a speparate line.
The /M option is needed because I search across lines. The /I option is used to ignore case. The /X option is used to enable use of \n in replacement expression. I use the /T option to process a list of find/replace pairs. The captured expression numbering is odd because each alternate gets an implicit number.
- collapse a hyphenated word accross multiple lines into a single word on one line:
"([a-z])-(?:\r?\n)+([a-z])" --> "$2$3"
- replace consecutive white space, optionally with punctuation before or after, with a single new line:
"[^a-z0-9]*\s+[^a-z0-9]*" --> "\n"
- remove leading punctuation from the beginning of the first line:
"^[^a-z0-9]+" --> ""
- remove trailing punctuation at the end in case last line is missing \n:
"[^a-z0-9]+$" --> "\n"
2) sort the result with SORT
3) Use JREPL to remove duplicates and convert everything to lower case. The /J option allows use of toLowerCase() method in replacement value.
Code: Select all
jrepl "([a-z])-(?:\r?\n)+([a-z])/[^a-z0-9]*\s+[^a-z0-9]*/^[^a-z0-9]+/[^a-z0-9]+$" ^
"$2$3/\n//\n" /i /m /x /t "/" /f test.txt | ^
sort | ^
jrepl "(.*\n)\1*" "$1.toLowerCase()" /i /j /m
The above has the following limitations:
1) I cannot detect proper nouns, so they lose their capital letters.
2) I cannot detect when a naturally hyphenated word like "mother-in-law" is split across multiple lines. So "mother-\n\nin-law" incorrectly becomes "motherin-law"
3) I assume only the 26 English letters are used. Non-English letters in the extended ASCII range are treated as punctuation, and will be stripped if they appear before or after white space.
There are probably other issues that I am not aware of - language parsing is complicated.
Here is the test text that I used:
Code: Select all
("First line, with comma character.")
Second line with punctuation mark!
Third line with question mark?
Fourth line with two "double quotes".
This 5th line has an ordinal number.
Hyphenated words are not a high-minded concept.
A word may be split across two sep-
arate lines using a hyphen.
Contractions mustn't be altered.
("Last line without newline at end.")
And here is my result:
Code: Select all
5th
a
across
altered
an
are
at
be
character
comma
concept
contractions
double
end
first
fourth
has
high-minded
hyphen
hyphenated
last
line
lines
mark
may
mustn't
newline
not
number
ordinal
punctuation
question
quotes
second
separate
split
third
this
two
using
with
without
word
words
Dave Benham