Multi-process Advanced Encryption Standard (AES)

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Message
Author
Magialisk
Posts: 104
Joined: 25 Jul 2013 19:00

Re: Multi-process Advanced Encryption Standard (AES)

#46 Post by Magialisk » 26 Nov 2013 10:12

I was all ready to post a reply contradicting your finding, because I tested it myself in single-thread operation, but then I realized what you were actually saying. If you're running on a PC that only *has* a single CPU core, numCPU will in fact get set to zero by the machine's own count, not the user's input, and the mod operation will fail.

I built in a number of validations against users "inputting" stupid values for the number of threads, such as zero, negatives, or more CPUs than the machine has, but it never occurred to me to check whether the machine only has one core. I guess you left the hyperthreading turned off on your 10-year old P4 eh? :wink:

In fact, if the machine only has one core it will trip my "if numCPU==0" validation, however the result of that expression is just to reset numCPU to NUM_OF_PROC-1, so you're back to zero...

Thank you very much for pointing this out, the solution is obvious and I'll update the first post in a moment. In this block of code:

Code: Select all

IF "%~3"=="" set numCPU=%NUMBER_OF_PROCESSORS%-1
IF %numCPU% LEQ 0 set /a numCPU=%NUMBER_OF_PROCESSORS%-1
IF %numCPU% GTR %NUMBER_OF_PROCESSORS% set /a numCPU=%NUMBER_OF_PROCESSORS%-1
The second IF statement should be changed to:

Code: Select all

IF %numCPU% LEQ 0 set /a numCPU=1
This will penalize anyone who enters a number like "-5" as the CPU count, forcing them to run in single-process mode instead of the previous which defaulted to maximum cores, however in my opinion that's a just punishment for being so parametrically challenged! :lol:

einstein1969
Expert
Posts: 960
Joined: 15 Jun 2012 13:16
Location: Italy, Rome

Re: Multi-process Advanced Encryption Standard (AES)

#47 Post by einstein1969 » 26 Nov 2013 11:10

Thanks for fix.

But there is a problem.

The old version with 2 PID and the trick that I haved suggest run on my monoprocessor (an AMD MK-36)

with this timing: ...AES Encryption Complete (Elapsed Time: 00:01:04.16)

The new version of 25/nov with 2 processes encrypt with double time!

"Time: 00:01:55.04"

With the new version if i use a single process go down at Time: 00:02:57.82 :shock:

I have investigated and i have found that the solution is to add a delay in this part of code

Code: Select all

:waitOnPipesClosed
ping 192.0.2.0 -n 1 -w 500 > nul
IF EXIST _pipe* GOTO waitOnPipesClosed


with this trick the time go to Time: 00:00:46.31 (30% less than older version!) 8)

also you should use one more PID for using the cpu on multiprocessor system. In this manner there is a gain in multiprocessor too!

In addition, if you apply the tricks that I've used in the other thread you could fall further by 50% or more. 8)

Einstein1969

Magialisk
Posts: 104
Joined: 25 Jul 2013 19:00

Re: Multi-process Advanced Encryption Standard (AES)

#48 Post by Magialisk » 26 Nov 2013 13:25

I probably should have put some more explanation here when I uploaded the new code. The ping delay you added to the "wait" function is exactly why that subroutine exists at all, instead of just being inline code in the main function. In my experiments, on my two machines, adding a ping delay there does absolutely no good at all. However, my gut feeling told me that it should do some good (see next paragraph), particularly on lower core machines, so I left it as a separate function that could be tweaked to tune to your system.

Additionally, running an extra PID on higher multiprocessor systems does more harm than good. The main controller program does its work of filling the pipes and then goes into a very tight "IF NOT DONE GOTO" loop. This loop consumes an entire CPU core, unless you do something to slow it down, for example using the ping delay we're both talking about. For this reason 1 less PID than cores gives the best results on my higher core count machines. Otherwise the 1 controller plus N PIDs are fighting for N cores, and performance overall goes down.

In my own testing on my 12-thread machine the N-1 PIDs and no ping delay was the best timing I could get, but on a different machine, especially one with only 1-2 cores, there could be significant advantages to adding a long ping delay to the controller, giving a second PID a chance to do some work in the downtime.

So long story short, I agree with your observations, and wrote the code in a way that accommodates this tweaking, however the tweaks do not scale as well up to large multi-core machines. Logically you would think slowing down the controller and allowing a 12th PID some run time would be a good idea, but on my 12-thread CPU it doesn't help :? I was foolish in not explaining these details when I posted the code, but thank you for pointing it out for others who might want to test and tweak.

I also took a second look at the optimizations you proposed in the linked post, but I believe I'm missing something? The two optimizations you proposed there are minimizing CMD env size, which I did incorporate in this version of the code (thanks again BTW that approach gave me HUGE gains) and the second optimization was "not sparse monodimensional fixed size arrays of integer". I honestly don't even know what that second thing means, but if you have an example that I could study I'd love to make the algorithm even faster.

einstein1969
Expert
Posts: 960
Joined: 15 Jun 2012 13:16
Location: Italy, Rome

Re: Multi-process Advanced Encryption Standard (AES)

#49 Post by einstein1969 » 26 Nov 2013 15:13

Magialisk wrote:I probably should have put some more explanation here when I uploaded the new code. The ping delay you added to the "wait" function is exactly why that subroutine exists at all, instead of just being inline code in the main function. In my experiments, on my two machines, adding a ping delay there does absolutely no good at all. However, my gut feeling told me that it should do some good (see next paragraph), particularly on lower core machines, so I left it as a separate function that could be tweaked to tune to your system.

Additionally, running an extra PID on higher multiprocessor systems does more harm than good. The main controller program does its work of filling the pipes and then goes into a very tight "IF NOT DONE GOTO" loop. This loop consumes an entire CPU core, unless you do something to slow it down, for example using the ping delay we're both talking about. For this reason 1 less PID than cores gives the best results on my higher core count machines. Otherwise the 1 controller plus N PIDs are fighting for N cores, and performance overall goes down.

In my own testing on my 12-thread machine the N-1 PIDs and no ping delay was the best timing I could get, but on a different machine, especially one with only 1-2 cores, there could be significant advantages to adding a long ping delay to the controller, giving a second PID a chance to do some work in the downtime.

So long story short, I agree with your observations, and wrote the code in a way that accommodates this tweaking, however the tweaks do not scale as well up to large multi-core machines. Logically you would think slowing down the controller and allowing a 12th PID some run time would be a good idea, but on my 12-thread CPU it doesn't help :? I was foolish in not explaining these details when I posted the code, but thank you for pointing it out for others who might want to test and tweak.

I also took a second look at the optimizations you proposed in the linked post, but I believe I'm missing something? The two optimizations you proposed there are minimizing CMD env size, which I did incorporate in this version of the code (thanks again BTW that approach gave me HUGE gains) and the second optimization was "not sparse monodimensional fixed size arrays of integer". I honestly don't even know what that second thing means, but if you have an example that I could study I'd love to make the algorithm even faster.


In the old algorithm I had tried the delay with the PING and actually post gave no advantage.

With the new algorithm I tested with multiple processes and will always take the same time on a single processor. So I had taken for granted that we climbed (but only slightly). But apparently I was wrong. The fact is that this PING is necessary on single processor. So you can activate when you need to do it.

As for the other tips you can also see the code in the equivalent function of Antonio of ":aes_start". They are used much less Set /A so it's a much better performance. That trick is to climb up to 50% on a monoprocessor. You just have to take that technique to your implementation. I've used with the variables % in my demo of raytracing.

This is an attemp (to end and check, is not finished and there may be errors) I'd done to reduce the number of sets / A some time ago.

Code: Select all

FOR /L %%a IN (1,1,%numRounds%) DO (
   FOR /L %%b IN (0,1,3) DO (
       
      IF %%b gtr 0 (set /A newc=-%%b+4, newc1=1-%%b, newc2=2-%%b, newc3=3-%%b) else set /A newc=0, newc1=1-%%b, newc2=2-%%b, newc3=3-%%b
      IF !newc1! LSS 0 set /A newc1+=4
      IF !newc2! LSS 0 set /A newc2+=4
      IF !newc3! LSS 0 set /A newc3+=4

      for /f "tokens=1,2,3,4 delims= " %%d in ("!state[%%b][0]! !state[%%b][1]! !state[%%b][2]! !state[%%b][3]!") do set /a tempstate[%%b][!newc!]=!S[%%d]!, tempstate[%%b][!newc1!]=!S[%%e]!, tempstate[%%b][!newc2!]=!S[%%f]!, tempstate[%%b][!newc3!]=!S[%%g]!
   )
   rem set stateTarget=tempstate
   FOR /L %%c IN (0,1,3) DO (
      set stateTarget=tempstate
      IF NOT %%a==%numrounds% (
         FOR %%d IN (!tempstate[0][%%c]!) DO FOR %%e IN (!tempstate[1][%%c]!) DO FOR %%f IN (!tempstate[2][%%c]!) DO FOR %%g IN (!tempstate[3][%%c]!) DO set /a state[0][%%c]=!G2[%%d]!^^^^!G3[%%e]!^^^^!tempstate[2][%%c]!^^^^!tempstate[3][%%c]!, state[1][%%c]=!tempstate[0][%%c]!^^^^!G2[%%e]!^^^^!G3[%%f]!^^^^!tempstate[3][%%c]!, state[2][%%c]=!tempstate[0][%%c]!^^^^!tempstate[1][%%c]!^^^^!G2[%%f]!^^^^!G3[%%g]!, state[3][%%c]=!G3[%%d]!^^^^!tempstate[1][%%c]!^^^^!tempstate[2][%%c]!^^^^!G2[%%g]!
         set stateTarget=state
      )
      set /A keyword=%%a*4+%%c
      FOR %%d IN (!keyword!) DO FOR %%e IN (!stateTarget!) DO set /a state[0][%%c]=!key[0][%%d]!^^^^!%%e[0][%%c]!, state[1][%%c]=!key[1][%%d]!^^^^!%%e[1][%%c]!, state[2][%%c]=!key[2][%%d]!^^^^!%%e[2][%%c]!, state[3][%%c]=!key[3][%%d]!^^^^!%%e[3][%%c]!
   )
)


If you work on this optimization then we can try another. But it's better if you do this before,because it 's useful for later anyway.

a question:

What are the four caret? ^^^^
became one?

Einstein1969

Magialisk
Posts: 104
Joined: 25 Jul 2013 19:00

Re: Multi-process Advanced Encryption Standard (AES)

#50 Post by Magialisk » 26 Nov 2013 17:00

einstein thanks for the clarification. I'm still trying to digest your changes since the logical flow of the loops is different I need to make sure I understand the new flow. It took me a while to redesign my algorithms to combine the Sub and Shift, and combine Mix and XOR. Now I have to understand how your rearrangement of those functions is working.

As you can see I didn't implement everything that Antonio did in his version of the code either because that's his code and this is my code, and I want to understand what I'm doing and write it myself. You guys have both provided great suggestions and eventually once I chew on them a bit I've been able to incorporate the stylistic improvements. If I just copied your code or his directly though, it would run faster but it wouldn't do me much good in an educational sense :D

P.S. - I was going to propose a compromise regarding the ping delay. I think the delay subroutine should use the CPU count and if its equal to 1-3 add some delay there. So something like:
1-3 CPU: controller + 'n' PIDs, 500ms sleep
4+ CPU: controller + 'n-1' PIDs, no sleep

Have you tested whether 500ms is the best value, as I would suspect longer delays (to a point) could give better results. Evan an extra 2000ms wait at the end of an operation for the controller to wake back up would be nothing if the longer sleep throughout let the operation finish 10 seconds faster overall... Let me know what you think or if you've done any testing. In fact, maybe that's what I need to do on my higher core counts, is sleep for longer... I think I'll poke at this and see what happens.

einstein1969
Expert
Posts: 960
Joined: 15 Jun 2012 13:16
Location: Italy, Rome

Re: Multi-process Advanced Encryption Standard (AES)

#51 Post by einstein1969 » 27 Nov 2013 06:25

Magialisk wrote:einstein thanks for the clarification. I'm still trying to digest your changes since the logical flow of the loops is different I need to make sure I understand the new flow. It took me a while to redesign my algorithms to combine the Sub and Shift, and combine Mix and XOR. Now I have to understand how your rearrangement of those functions is working.

As you can see I didn't implement everything that Antonio did in his version of the code either because that's his code and this is my code, and I want to understand what I'm doing and write it myself. You guys have both provided great suggestions and eventually once I chew on them a bit I've been able to incorporate the stylistic improvements. If I just copied your code or his directly though, it would run faster but it wouldn't do me much good in an educational sense :D

P.S. - I was going to propose a compromise regarding the ping delay. I think the delay subroutine should use the CPU count and if its equal to 1-3 add some delay there. So something like:
1-3 CPU: controller + 'n' PIDs, 500ms sleep
4+ CPU: controller + 'n-1' PIDs, no sleep

Have you tested whether 500ms is the best value, as I would suspect longer delays (to a point) could give better results. Evan an extra 2000ms wait at the end of an operation for the controller to wake back up would be nothing if the longer sleep throughout let the operation finish 10 seconds faster overall... Let me know what you think or if you've done any testing. In fact, maybe that's what I need to do on my higher core counts, is sleep for longer... I think I'll poke at this and see what happens.


Magialisk wrote:einstein thanks for the clarification. I'm still trying to digest your changes since the logical flow of the loops is different I need to make sure I understand the new flow. It took me a while to redesign my algorithms to combine the Sub and Shift, and combine Mix and XOR. Now I have to understand how your rearrangement of those functions is working.

As you can see I didn't implement everything that Antonio did in his version of the code either because that's his code and this is my code, and I want to understand what I'm doing and write it myself. You guys have both provided great suggestions and eventually once I chew on them a bit I've been able to incorporate the stylistic improvements. If I just copied your code or his directly though, it would run faster but it wouldn't do me much good in an educational sense :D


Take all the time you need to make the changes.

The code written by others is always difficult to read. It requires a huge effort, but this effort opens the mind. Sometimes you do not need to know what makes that piece of code. Sometimes you need only know the technique used. Understood the technique you can fix anything with that technique. It 'also obvious that then we have to use our head to write something because the goal is the mastery of what we're doing.

So I agree with you not to copy but to parrot to understand and make our own and then put it into practice.


P.S. - I was going to propose a compromise regarding the ping delay. I think the delay subroutine should use the CPU count and if its equal to 1-3 add some delay there. So something like:
1-3 CPU: controller + 'n' PIDs, 500ms sleep
4+ CPU: controller + 'n-1' PIDs, no sleep

Have you tested whether 500ms is the best value, as I would suspect longer delays (to a point) could give better results. Evan an extra 2000ms wait at the end of an operation for the controller to wake back up would be nothing if the longer sleep throughout let the operation finish 10 seconds faster overall... Let me know what you think or if you've done any testing. In fact, maybe that's what I need to do on my higher core counts, is sleep for longer... I think I'll poke at this and see what happens.


when pool the finish of all processes we can choice a low frequency of polling for consume less CPU.

- High frequency consume CPU but give readily when finish all process. The time passing from the end of all process and when we take note is low. what should be down?

- Low frequency consume low CPU but whe take note of the end of all process after a long time. This delay the execution time. How much is the consumption of the CPU?

How do you see the parameters to find are two. You have to reach a compromise.

Low CPU + Low Rensponsivity in this case is the goal.

But ping has a problem for timing < 500 ms (I've shown this here)

the command used is: " ping 192.0.2.0 -n 1 -w %delay% > nul "

At 500 ms the usage of cpu is very low (measured on my machine 3.2%). On multicore system this value must be divided for number of core. ie on dual processor is <1,6% about. On quad core <0.8% (<1%) etc (this value of multiprocessor/core system are calculated)

at 1000ms use 1.7% ( this value is measured)

at 2000ms use 1.0% about (this value is measured) (<0,25% su quad core, this value is calculated)

I have used typeperf with 20sec. samples on idle system to misure.

Edit: In reality on a multicore system the formula is different from the divide by the number of cores. The consumption is less

einstein1969

Magialisk
Posts: 104
Joined: 25 Jul 2013 19:00

Re: Multi-process Advanced Encryption Standard (AES)

#52 Post by Magialisk » 27 Nov 2013 11:02

einstein1969 wrote:At 500 ms the usage of cpu is very low (measured on my machine 3.2%).
at 1000ms use 1.7% ( this value is measured)
at 2000ms use 1.0% about (this value is measured)
This is exactly the type of data I needed, thank you very much for providing it. Because CPU usage is so low even at 500ms delay, there seems to be little reason to delay any longer. Even a 2s sleep only improves child performance by (max) 2.3%, reducing elapsed time by 1.4 seconds out of every minute of operation. The 1s sleep is a little better balanced, (max) 1.5% speedup, cutting out 0.93s per minute of operation.

If I suspected anyone were going to use this to encrypt files that take 5 minutes to process, and do it on single-core processors, I'd lean towards the 1s delay. Otherwise, for short data sets that take 10s-30s to encrypt the longer sleep durations won't pay for themselves. I'll modify the code to add 500ms sleep on low (1-2) core count machines. Thanks!

einstein1969 wrote:- High frequency consume CPU but give readily when finish all process. The time passing from the end of all process and when we take note is low. what should be down?
- Low frequency consume low CPU but whe take note of the end of all process after a long time. This delay the execution time. How much is the consumption of the CPU?
This is what I was getting at above, and in my previous post. A longer delay should use less CPU (but only on low core counts, and as you showed not *much* less even in that case) at the expense of not detecting the finish of the child threads as rapidly. If the performance increase was high enough, it would be worth an extra delay at the end before noticing the children were finished, that delay would pay for itself through faster execution. However, your data and my calculations show this will usually not be the case to justify a longer sleep, unless you're encrypting large files over several minutes. Thanks again for all of your contributions, especially the tests on a single processor machine.

Post Reply