Author Topic: Two ideas: Binary data schema and auto-reverse-engineering tool  (Read 584 times)

picklejar

  • Crazy poster
  • *
  • Posts: 143
  • Karma: 10
    • View Profile
Idea #1: Language-agnostic Binary data schema

So... most of our projects involve consuming, editing, and writing binary data like Final Fantasy game asset files.

This community has been awesome as far as bringing people together to reverse engineer the file formats, publish their findings on Wiki, and sharing source code.

So now, it would be super nice if we could go just one step further and to create actual schema definitions, in a structured format. "That way, we could auto-generate documentation. And most importantly, I wouldn't have to develop code to parse the data, because it can just be parsed automatically, by my program, which by the way is written in... uh, my preferred programming language..."

Yeah. Exactly. That's why we don't have schema files. Everyone has different needs, prefers different programming languages, etc.

But still... just imagine the possibilities if we could agree on a language-agnostic schema. This is a common problem and there are solutions for it. For example, perhaps we could use one of these: https://en.wikipedia.org/wiki/Data_exchange#Data_exchange_languages

Even if the language-agnostic schema isn't super-friendly to your preferred programming language, it would not matter if there were utilities to generate documentation for you to read and understand it in a user-friendly way, and if there were utilities to translate the schema to data structures in your preferred programming language. Right?

Well, for new projects, yeah, that might be fantastic. But... for existing projects... well, suppose we did come up with such a plan and even developed the tools to generate docs and code in everyone's preferred language. Would we expect the authors of existing tools to re-factor their code to use the new auto-generated code? Probably not.

But, if you are such an author, then you probably have some good insight into what kind of features you would want such auto-generated code to have, in your preferred programming language. So, perhaps you would be interested in contributing ideas, or even code, to translating universal schema to "code in my preferred language".

And long-term, imagine the possibilities with future tools!

I'm sure this community has talked about this before numerous times, so my main question is, do we have a forum topic dedicated to this? If not, can we create one? And if there are any individual threads where this has been discussed at length, can you point to them?

Idea #2:

It would be cool to have a tool to auto-reverse-engineer new file formats. The tool could use a repository of known file formats / characteristics (like "magic numbers", other patterns to look for, etc.). It could even be interactive and show different interpretations to the user and let the user guide the tool to choose the right interpretation. Finally, it could output schemas for the user. What format should the schema be in, though? See idea #1 above. (See?! That's another reason we need it!)

codemann8

  • Cool newbie
  • *
  • Posts: 52
  • Karma: 3
    • View Profile
Tbh, I've been wanting to create such a program since the dawn of time. The various nuances, from file to file, as I'm sure you're aware of are the reason I've never went down such a path. There is however, already an existing HEX editor, 010 Editor, that can "kinda" do what you've explained. You can define the schema and you can apply it to whatever file you open in the program. It does lack a certain element of "smart" features that make this pattern recognition less flexible and dynamic but it's a start.
« Last Edit: 2019-04-30 18:12:11 by codemann8 »

picklejar

  • Crazy poster
  • *
  • Posts: 143
  • Karma: 10
    • View Profile
Another challenge to keep in mind: There's a difference between the "original data" and "serialized data". So there are really two different schemas we're talking about here.

As a simple example, suppose a game had an "array of all Weapon objects", and each Weapon had a name and an attack value, and suppose the total number of weapons is not set in stone (until file-write time).


The schema for the "original data" might be something simple like:

schema for "AllWeapons" object:
  • weapons: List of Weapon
sub-schema for "Weapon" object:
  • name: string
  • attackPower: uint16


The schema for the "serialized data", however, might be:

schema for "all-weapons.bin" binary file:
  • fileSize: uint64
  • numWeapons: uint32
  • weaponSections: sequence of ${numWeapons} WeaponSections
  • @integrityCheck: ${_data}.length == ${fileSize}
sub-schema for "WeaponSection" (serialized/chunk)
  • sectionSize: uint32
  • nameLength: uint32
  • name: string of length {$nameLength}
  • attackPower: uint16
  • @integrityCheck: ${_data}.length == ${sectionSize}

So the "schemas" are different. The latter "schema" has "size" information, but that's only needed for deserialization logic.

Also, the serialized data might be encoded with gzip encoding or lzs encoding, whereas the original data might not.

Having the schema of just the original data is not good enough, because there are multiple ways to serialize it.

Having the schema of just the serialized data is much better, but ideally the schema should also somehow define how it is deserialized into the original data.

Also, deserializers can often work by reading data "in order", but serializers often work backwards, for example, they often write the total file size at the beginning of the file, which is the last thing they know.

So the challenge is, how a schema can define all of this stuff, instead of just one format, or the other format, or the transformation in between.

And yeah, as you pointed out codemann8, each file will definitely has its own nuances, so it definitely needs to support more than just the primitive data types like uint8 and such.

But hey... I'm sure we could come up with something!

picklejar

  • Crazy poster
  • *
  • Posts: 143
  • Karma: 10
    • View Profile
codemann8: yeah, that hex editor "010 Editor" looks pretty sweet, thanks for sharing! here's a screenshot for others interested:


sithlord48

  • No life
  • *
  • Posts: 1522
  • Karma: 33
  • Dark Lord of the Savegame
    • View Profile
    • Blackchocobo
This is why I started the ff7tk so we can have one code base that works everywhere granted yes it needs some more parts but alot is there.
There is documentation for all the current parts here http://sithlord48.github.io/ff7tk/
And ff7tk has full translations for the 5 released languages as well as the retranslation.
« Last Edit: 2019-05-01 21:39:34 by sithlord48 »

picklejar

  • Crazy poster
  • *
  • Posts: 143
  • Karma: 10
    • View Profile
Was discussing this in Discord with antiquechrono and... check this out:

https://kaitai.io/

sithlord48

  • No life
  • *
  • Posts: 1522
  • Karma: 33
  • Dark Lord of the Savegame
    • View Profile
    • Blackchocobo
Well one problem i see with that is you can't write and they do not have a timeline for adding write support using the schemes http://doc.kaitai.io/faq.html#writing
No need to add some random dependency, just use something like xml or json since it can be easily parsed and has very wide support

picklejar

  • Crazy poster
  • *
  • Posts: 143
  • Karma: 10
    • View Profile
Thanks for looking at that. Very good point. So, kaitai is worthless for editing, then, until they add write/serialization support.

niemasd

  • Fast newbie
  • *
  • Posts: 12
  • Karma: 3
    • View Profile
I would definitely be interested in making this happen and helping contribute to it. I've been writing not-so-formal schema for the file formats I've been implementing in my tool's Wiki, so feel free to take what you want from there: https://github.com/niemasd/PyFF7/wiki