VoiceAction does not use two different modes. It entirely
depends on the program you are using it on.
It assumes that you speak one word at a time. That is, if
you want to say "Welcome to the land of .." then as
soon as you say 'Welcome' the hand animation will display a stop*
and it will process 'Welcome'.
On 586+ processors this takes 1 or 2 seconds. ( With most time
consuming features disabled).
After the processing is over hand will point a finger indicating
you to say the next word.
For eyes free dictation, if 'Play_ the wave_file_for_'Start''
is filled up then it will play that wave file.
For this case its silly of course that you keep a voice
recording which says 'start' after each word you dictate, instead
recording of a small instrumental beep is advisable and kill
sound good ( like how about a guitar chord ?).
*Note that the 'Hand Animation' images for start and stop are also programmable.