A tokenization is defined to decompose a given string into multiple strings in a unique way.
When using a simple (delimiter based only) tokenizer, there are only two possible ways to achieve that:
1) Each delimiter starts a new token, so empty words are tokens.
2) The first character in a word starts a new token, so you empty words aren't recognized as tokens.
For whatever reasons Mocrosoft decided to use the second definition.
It's just a guess, so i don't know for sure, but:
It might be beacause you it is easier to simulate the first tokenizing using the second (you just need to replace each seperator with a seperator and any no-delimiter-character when calling for, and remove the first character accessing that value), than the other way around (you have to remove the empty words, so you first need to know how many tokens you will get - which is anything but trivial) .
Also it is more flexible when accessing tokens defined by length, using some well defined characters for padding, which was used for example to store bank account numbers and bank routing numbers; example:
Code: Select all
1234567890,1234567890,1234567890
1, 12, 123
1000 ,1001 ,10002
penpen