I bet that the data might be camera information. A stream file [STR] can interleave more than just audio and video. it's actually a container format with many "tracks" that is up the the programmer to fill. Usually there is a flag in the header to seperate system data and user-created data. This way system data can be shuffuled to the MDEC can the user frames can be processed.
(Ok, I need to pull out my PSX notes from about 8 years ago.....)
Run down of the STR format...
When the CD-ROM is streaming data, you can bypass the BIOS and get data from point A (CD) to point B (Memory) using DMA. Most of the time this is just a way to quickly shovel movie data directly into the MDEC decoder chip without having to lock the whole machine. A STR file is mastered into mode 2 data, meaning that you get more data per CD Sector, at the loss of some data integrity subchannels. Let's just hope that there is isn't a scratch in the mode 2 data, or you might be having a hard time.....
A STR format is basically an image of the sectors directly on the disk, I'm guessing you guys already knew that. Because each mode 2 sector is 2048 bytes long, each STR section (or "frame", I think the PSX devs called them) is also 2048 bytes long
Each frame has a header, 32 bytes long, and the the rest is data. The data can be anything, (2048-32=2016 bytes left over for whatever you want) but the header must show up at the beginning of each "sector" (or frame) so that the PSX's CD-ROM data controller doesn't go nuts trying to figure out what to do with the streaming data once it's read. (MDEC data is a little more time-crititical so it's nice to let the CD-ROM controller know what DMA path to fire the video stream down)
The header has a breakdown as follows.
The first two bytes is the STR "magic number" It starts with "0110xxxxxxxxxxxx" The other bits don't seem to be used an any real way.
The second two bytes is the data type. The only thing that is important here is the very first bit (MSB). If it's a "1" than the data in the frame is a PSX hardware format. (for example, MDEC), and if it's a "0" that means the data is user-created and what's in it is anyone's game. If this header section shows that this frame contains user-defined data, the rest of the two bytes can be filled with custion header data for the 2016 bytes of data contained in the frame.
For those following along, the rest of the 32 bit header is as follows.
-Sector number (2 bytes)
-Sector size (2 bytes)
-Frame number (4 bytes)
-Frame size (4 bytes)
-Scratchpad (16 bytes)
Now, if the frame contains MDEC data, it's header is a little different. First the MSB in the type field is set to "1" (actually, the type is 0x8001(? I think. I don't have an STR file handy to check)) and all the other frame header format is the same, save for the 16 byte scratchpad. This is changed where the first two bytes are movie width and the second two bytes are movie hight. This is where you are going to find FF7's "funny" movie resolution.
I hope this breaks down the frames for you.
Also, another thing.The user-data interspaced in FF7's movie data is most likely camera or animation data. (CLUT data for the drop in colordepth?) You will have to take that unknown and compare it to PC data related to the moves, as I'm pretty sure they couldn't jam the extra streaming data into the PC movie formats.
Also, be aware, the user-defined header may contain valueble information, such as telling the system to switch colordepth druing movieplay back, or to kick off a script event.
That should be helpful...
---EDIT---
I've got conficting notes when it comes to XA-Audio interleaving
If the sector is formatted for XA Audio data, there might be an 8 byte header before the "real" header and a trailing 280 byte padding the bottom of the frame.
Does this help?