I still don't understand how to solve the problem though.

I can calculate the required delay in order to make the processed signal align, but it still applied processing to the audio buffer (sound card) lengths, or "relative the host playhead", rather than relative to the audio start.

If there was something that I could do in the beginning of the audio file in order to inform the processing that "this is the audio start". But I can't seem to find anything.
