I think the problem is that you want to make a mask, transform it, then combine it with something else. You can definitely do this, but it takes a bit more work.

(these all require a recent +dev build, though most of the techniques can be used in 5.965 too)

Here's one way to do it:

Track 2 has a video.
Track 1 has a green background generator => text overlay => track opacity/zoom/pan (modified to fill background with green rather than black, see comment inside it) => Chroma-key, which combines track 2 with track 1.

This is pretty fast but doesn't allow partial transparency.

Here's another way, which is probably about as fast but requires going to RGBA:

Track 2 has a video
Track 1 has a black background generator => text overlay => track opacity/zoom/pan => custom RGBA red to alpha channel converter => image overlay

This allows partial transparency, but because red is copied to alpha, you are limited in what colors are available to you in your overlay image. You could tweak which channel is used in order to allow some different combinations, but it's not ideal either.
