Script to detect type (encoding) of files

Message

siberia-man · #1 Post by **siberia-man** » 15 Oct 2021 12:11

Few days ago I decided to practice in insane programming and wrote the script. It's based on the idea I found in the answer on StackOverflow I guess given by our colleague npocmaka_. I extended his idea and enabled encoding detection for files with and without BOM. Also the script doesn't fail on empty files and directories and reports properly.

Examples:

Code: Select all

>type example-utf8-with-bom.txt
я╗┐In math the Greek letter ╧А stands for 3.1415926

>type example-utf8-without-bom.txt
In math the Greek letter ╧А stands for 3.1415926

>file-detect-enc.bat example-utf8-*.txt
example-utf8-with-bom.txt: UTF-8
example-utf8-without-bom.txt: UTF-8

Honestly, it looks like a very cheap and slow version of file (unix). However it works and I didn't find any issue so far.

Here is the latest version to the writing moment. Actual version is available by this link: https://github.com/ildar-shaimordanov/c ... ct-enc.bat.

Code: Select all

::Usage: file-detect-enc [OPTIONS] FILE...
::
::Detect type (encoding) of FILEs.
::
::  -b, --brief  Don't prepend filenames to output

@echo off

setlocal enabledelayedexpansion

set "enc_brief="
if /i "%~1" == "-b"      set "enc_brief=1"
if /i "%~1" == "--brief" set "enc_brief=1"
if defined enc_brief shift /1

if "%~1" == "" goto :print_usage

:: ========================================================================

:: The following settings are based on information from the table
:: https://en.wikipedia.org/wiki/Byte_order_mark#Byte_order_marks_by_encoding
set "enc_val_EFBBBF=UTF-8"
set "enc_val_FEFF=UTF-16BE"
set "enc_val_FFFE=UTF-16LE"
set "enc_val_0000FEFF=UTF-32BE"
set "enc_val_FFFE0000=UTF-32LE"
set "enc_val_2B2F76=UTF-7"
set "enc_val_F7644C=UTF-1"
set "enc_val_DD736673=UTF-EBCDIC"
set "enc_val_0EFEFF=SCSU"
set "enc_val_FBEE28=BOCU-1"
set "enc_val_84319533=GB-18030"

:: ========================================================================

set "enc_hexfile=%TEMP%\enc_hexfile"

:enc_begin_loop
if "%~1" == "" (
	del /f /q "%enc_hexfile%" 2>nul
	goto :EOF
)

for %%f in ( "%~1" ) do (
	call :detect_type "%%~f"

	if defined enc_brief (
		if defined enc_found echo:!enc_found!
	) else (
		echo:%%~f: !enc_found!
	)
)

shift /1
goto :enc_begin_loop

:: ========================================================================

:print_usage
for /f "usebackq tokens=* delims=:" %%s in ( "%~f0" ) do (
	if /i "%%s" == "@echo off" goto :EOF
	echo:%%s
)
goto :EOF

:: ========================================================================

:detect_type
set "enc_found="

set "enc_srcfile=%~1"

if not exist "%~1" (
	echo:File not found: "%~1">&2
	exit /b 1
)
if exist "%~1\" (
	set "enc_found=directory"
	goto :EOF
)
if %~z1 equ 0 (
	set "enc_found=empty"
	goto :EOF
)

:: https://stackoverflow.com/a/16238102/3627676
:: https://ss64.com/nt/certutil.html
:: https://www.dostips.com/forum/viewtopic.php?p=57918#p57918
:: https://docs.microsoft.com/en-gb/windows/win32/api/wincrypt/nf-wincrypt-cryptbinarytostringa
certutil -encodehex -f "%enc_srcfile%" "%enc_hexfile%" 4 >nul || (
	echo:Internal error: !errorlevel!>&2
	exit /b 1
)

set "enc_utf8_sequence="
set /a enc_utf8_require=0

set "enc_firstline=1"
for /f "usebackq delims=" %%s in ( "%enc_hexfile%" ) do (
	rem Most files (especially binaries) have in their beginning the
	rem magic number, or header, the group of bytes identifying the
	rem file type. Here we can analyze the header for magic number
	rem existence and quit immediately, if it's found. Otherwise,
	rem we continue analysis with the same line.
	if defined enc_firstline for /f "usebackq tokens=1-4" %%a in ( '%%s' ) do (
		set "enc_firstline="
		set "enc_bytes=%%a%%b%%c%%d"

		for /l %%n in ( 8, -2, 4 ) do for %%s in (
			enc_val_!enc_bytes:~0^,%%n!
		) do (
			set "enc_found=!%%s!"
			if defined enc_found goto :EOF
		)
	)

	rem https://en.wikipedia.org/wiki/UTF-8#Encoding
	rem 0000-007f		00-7f	-----	-----	-----
	rem 0080-07ff		c0-df	80-bf	-----	-----
	rem 0800-ffff		e0-ef	80-bf	80-bf	-----
	rem 10000-10ffff	f0-f7	80-bf	80-bf	80-bf
	for %%b in ( %%s ) do if 0x%%b lss 0x80 (
		rem 00-7f
		set "enc_utf8_sequence="
		set /a enc_utf8_require=0
	) else if 0x%%b gtr 0xf7 (
		rem f8-ff
		set "enc_utf8_sequence="
		set /a enc_utf8_require=0
	) else (
		rem 80-f7
		set "enc_utf8_sequence=!enc_utf8_sequence!%%b"

		if 0x%%b geq 0xf0 (
			rem f0-f7
			set /a enc_utf8_require=3
		) else if 0x%%b geq 0xe0 (
			rem e0-ef
			set /a enc_utf8_require=2
		) else if 0x%%b geq 0xc0 (
			rem c0-df
			set /a enc_utf8_require=1
		) else if !enc_utf8_require! gtr 0 (
			rem 80-bf
			set /a enc_utf8_require-=1
			if !enc_utf8_require! equ 0 (
				set "enc_found=UTF-8"
				goto :EOF
			)
		)
	)
)
goto :EOF

:: ========================================================================

:: EOF

AR Coding · #2 Post by **AR Coding** » 15 Oct 2021 12:28

Nice work!

Just wondering: why is there 2 "/f" s in the for Loop by the

Code: Select all

:print_usage

label?

siberia-man · #3 Post by **siberia-man** » 15 Oct 2021 12:58

why is there 2 "/f" s in the for Loop

Misprint. But it works. However need to be fixed.

updated
Thank you for thorough code review. I fixed in repo and here.

siberia-man · #4 Post by **siberia-man** » 15 Oct 2021 13:43

One more error I've just found and fixed. Some variables were named as bom_val_*, whereas they must be named as enc_val_*. Quite silly mistake.

#5 Post by **aGerman** » 16 Oct 2021 05:05

Ildar,

I might be wrong but I have the impression that Christopher Wellons' lengths array could help you to determine the required number of bytes of a character.
https://github.com/skeeto/branchless-ut ... ter/utf8.h
The index for a certain length in the array is (value of the byte >> 3). Probably you can put as much as reasonably fits into a single SET /A statement each.

Steffen

siberia-man · #6 Post by **siberia-man** » 16 Oct 2021 07:23

Hi Steffen,

Thank you for the hint and link to the code. Also I found the full explanation here https://nullprogram.com/blog/2017/10/06/. It's really impressive work. I am going to meditate on this.

#7 Post by **aGerman** » 16 Oct 2021 08:15

Hi Ildar,
Unfortunately Batch lacks quite some syntactic features. If we have bad luck, the code gets too complicated and eats up the gain in performance, idk. Good work so far anyways. Thanks for sharing!

Steffen

#8 Post by **aGerman** » 17 Oct 2021 07:12

Proof of concept
*** obsolete code removed, see further down ... ***

This is only for the evaluation of UTF-8. It doesn't include the BOM check, the file name is currently hard-coded, and it behaves fundamentally differently:
- It relies on getting a file with at least 4 bytes length.
- It checks up to 511 bytes of a file, and it continues as long as it doesn't find an invalid sequence.
- It treats ASCII to be valid, since ASCII is indeed valid UTF-8.
I suspect that using GOTO to perform the loop makes it quite slow. I didn't find any good workaround though. Ideally we could use a FOR /L loop.

Steffen

siberia-man · #9 Post by **siberia-man** » 17 Oct 2021 13:06

Hi Steffen

I guess, it doesn't work properly. I replaced the second line

Code: Select all

set "file=u8nobom.txt"

with

Code: Select all

set "file=%~f0"

and it returned

is UTF-8
Press any key to continue . . .

And it's slower than my script. I'm afraid we won't get any advantage with the branchless-utf8 technique.

#10 Post by **aGerman** » 17 Oct 2021 13:26

Hi Ildar,

I remarked the main differences:

aGerman wrote: ↑
17 Oct 2021 07:12
- It checks up to 511 bytes of a file, and it continues as long as it doesn't find an invalid sequence.
- It treats ASCII to be valid, since ASCII is indeed valid UTF-8.

So, yes, it is absolutely correct that the script code is valid UTF-8. ASCII is a subset of UTF-8. That's where your implementation still fails ¯\_(ツ)_/¯
However, I agree that using GOTO is a bad idea in terms of performance. Maybe I'll find a better way one day.

Steffen

siberia-man · #11 Post by **siberia-man** » 17 Oct 2021 14:10

aGerman wrote: ↑
17 Oct 2021 13:26
I remarked the main differences:

Sorry. I didn't notice that.

#12 Post by **aGerman** » 17 Oct 2021 14:22

No problem, Ildar.

Well, actually there's a possibility to use a FOR /L. We need to create another cmd process though.

Code: Select all

@if "%~1"=="::check::" (goto check) else echo off &setlocal DisableDelayedExpansion

set "file=u8nobom.txt"

set "me=%~fs0"
setlocal EnableDelayedExpansion
:: write all hex values in a row without spaces
certutil -encodehex -f "!file!" "!temp!\!file!.hex" 12 >nul || exit /b 1
:: read only up to 1023 characters (limit of set /p)
<"!temp!\!file!.hex" set /p "s="
del "!temp!\!file!.hex"
:: try to determine the encoding by a Byte Order Mark
set "bomFFFE0000=UTF-32 LE" &set "bom0000FEFF=UTF-32 BE" &set "bomDD736673=UTF-EBCDIC" &set "bom84319533=GB-18030"
set "bom2B2F7638=UTF-7"     &set "bom2B2F7639=UTF-7"     &set "bom2B2F762B=UTF-7"      &set "bom2B2F762F=UTF-7"
set "bomEFBBBF=UTF-8"       &set "bomF7644C=UTF-1"       &set "bom0EFEFF=SCSU"         &set "bomFBEE28=BOCU-1"
set "bomFFFE=UTF-16 LE"     &set "bomFEFF=UTF-16 BE"
set "enc="&for %%i in (8 6 4) do if not defined enc for %%j in (bom!s:~^,%%i!) do set "enc=!%%j!"
if defined enc (
  echo !enc!
  pause
  goto :eof
)

:: compute the length of s
set s=A%s%
set "lastX=0"
for /l %%i in (12 -1 0) do (
  set /a "lastX|=1<<%%i"
  for %%j in (!lastX!) do if "!s:~%%j,1!"=="" set /a "lastX&=~1<<%%i"
)
:: last index is the length (rounded to the greatest even number that is not greater than the length), minus 7 (because we want to read 8 hex chars at a time, but 1 char is appended for length measuring)
set /a "lastX=(lastX/2)*2-7"
if %lastX% lss 1 (
  REM maybe work around strings with less than 4 bytes here
  goto :eof
)
:: perform the UTF-8 check
"%comspec%" /q /d /von /c "!me!" ::check::
if errorlevel 2 (        REM errorlevel 2..7
  echo UTF-8 (w\ multibyte sequences^)
) else if errorlevel 1 ( REM errorlevel exactly 1
  echo ASCII only
) else (                 REM errorlevel 0
  echo no UTF-8
)
pause
goto :eof

:check
:: Lengths array of multibyte sequences, based on https://github.com/skeeto/branchless-utf8/blob/master/utf8.h, but bytes 0x00..0x07 and 0xF8..0xFF are explicitly treated as invalid
set /a "L0=8,L1=1,L2=1,L3=1,L4=1,L5=1,L6=1,L7=1,L8=1,L9=1,L10=1,L11=1,L12=1,L13=1,L14=1,L15=1,L16=0,L17=0,L18=0,L19=0,L20=0,L21=0,L22=0,L23=0,L24=2,L25=2,L26=2,L27=2,L28=3,L29=3,L30=4,L31=8"
:: Shift array used to shift out lengths that don't belong to the expected continuation bytes, initialize return value
:: x1..x4 are the indexes of the bytes we want to read in the hex string (begins with 1 because an "A" is prepended for length measuring)
set /a "Sh0=0,Sh1=9,Sh2=6,Sh3=3,Sh4=0,ret=0,x1=1,x2=3,x3=5,x4=7"
for /l %%i in () do (
  REM indexes in the lengths array
  for /f "tokens=1-4" %%j in ("!x1! !x2! !x3! !x4!") do set /a "i1=0x!s:~%%j,2!>>3,i2=0x!s:~%%k,2!>>3,i3=0x!s:~%%l,2!>>3,i4=0x!s:~%%m,2!>>3"
  REM lengths, new indexes in the hex string, and updated return value
  set /a "n1=L!i1!,n2=L!i2!,n3=L!i3!,n4=L!i4!,x1+=n1*2,x2=x1+2,x3=x1+4,x4=x1+6,ret|=n1"
  REM the length of the first byte must not be 0, while the lengths of all continuation bytes have to be all 0, also no byte shall be less than 0x08 or greater than 0x0F7
  set /a "chk=(^!n1)|(((n2<<6)|(n3<<3)|n4)>>Sh!n1!)|((n1|n2|n3|n4)>>3)"
  if !chk! neq 0 exit 0
  if !x1! gtr !lastX! exit !ret!
)

Steffen

siberia-man · #13 Post by **siberia-man** » 18 Oct 2021 01:55

My earliest thought was to use FINDSTR. But it has the very important "Character limits for command line search strings" described here: https://ss64.com/nt/findstr-escapes.html. I tried to play with the recommendation given there with no any success. I tried with FINDSTR /r "RE" FILENAME or TYPE FILENAME | FINDSTR /r "RE". In the second case I presumed that input comes in binary mode.

That is unlikely good approach recalling all limitations:

Code: Select all

findstr /r "[\x80-\xf8]" FILENAME

One more thing I could suggest is to use negation but I am not sure that this solution is 100% reliable:

Code: Select all

findstr /r "[^\x01-\x7f\xf8-\xff]" FILENAME

#14 Post by **aGerman** » 18 Oct 2021 02:52

Hmm, I think you actually shouldn't use TYPE. Needs some further investigation, but it seems to me that TYPE already changes the encoding. FINDSTR might be an option to find low order bytes which may indicate binary data. Besides of the limits you mentioned, I think it can't be used to determine UTF-8 because it's critical to know the order of lead- and continuation bytes. The byte values as such are used in all kind of encodings.

Steffen

siberia-man · #15 Post by **siberia-man** » 18 Oct 2021 03:47

I tested TYPE on some examples like:

Code: Select all

type example-utf8-without-bom.txt | hexdump -C

It showed the correct sequence cf 80. I know it's not complete proof.

Probably something like setlocal enabledelayedexpansion & findstr /r "[U+0080-U+10FFFF]" in pseudo-code could help us but I am not sure. And I'm a bit lazy now to puzzle on this.(-:

it's critical to know the order of lead- and continuation bytes

totally agree.

DosTips.com

Script to detect type (encoding) of files

Script to detect type (encoding) of files

Re: Script to detect type (encoding) of files

Re: Script to detect type (encoding) of files

Re: Script to detect type (encoding) of files

Re: Script to detect type (encoding) of files

Re: Script to detect type (encoding) of files

Re: Script to detect type (encoding) of files

Re: Script to detect type (encoding) of files

Re: Script to detect type (encoding) of files

Re: Script to detect type (encoding) of files

Re: Script to detect type (encoding) of files

Re: Script to detect type (encoding) of files

Re: Script to detect type (encoding) of files

Re: Script to detect type (encoding) of files

Re: Script to detect type (encoding) of files