Author Topic: Micky's code and FF7 text conversion (small discussion time?)  (Read 7539 times)

Cyberman

  • *
  • Posts: 1572
    • View Profile
All right, there are a few burning 'issues' with the text in the system.
First we don't really need to convert the FF7 text to ascii (just the other way around) however this is not a good idea to convert FF7 just to ascii (my view).
I believe what is needed is unicode to FF7 text conversion instead.
Reasons?
1) this expands the number of symbols that can be used in the engine to an international set of symbols for testing or what have you
2) it will help debug (debunk) the proper Kanji Katakana Hiragana etc. symbols used

We want SYMETRICAL functions for this.   That is we should be able to take Unicode and make an FF7 string, then take an FF7 string and make Unicode output.  For the actual engine this is not particularly useful generating unicode symbols from FF7 text that is.  The former is useful for handling XML data to be inserted into the engine for debugging.  However generating unicode to some sort of debugging console I think is useful for finding out the Japanese encoding.  Anything the FF7 to unicode function cannot translate should be hexencode like Postscript (IE 0xEA, 0x02 is emitted "<EA02>").

Since I do not have the Japanese Version of FF7 (or International version), this makes for a few problems for me, I am not able to do any of the symbology matching with the code set. I'm fairly certain the US version and the Japanese version have a similiar set of symbols, the US version likely just has the 2 byte prefix symbols removed for the Katakana Hiragana and Kanji symbols.  Micky made a table to convert FF7 code to Unicode, I'm not sure if this is a full symbol set (all 256 base codes).  I don't think it covers the prefix symbols for multibyte encoded symbols (like Japanese symbols would be).  If it covers the full 256 symbol set, then running FF7 Japanese will require the engine to actually know it's running the Japanese version (which to be blunt would suck because I would rather the engine be version agnostic for that).

This really only applies for debugging dialog output.

Now what to call this function.
FF7TextToDebug?
FF7TextToUnicode?
For debug output it would be good for specialized text for cloud tifa cid etc output (party1 party2 party3) color indicators etc. so that there is some sense to what someone is looking at.  For converting to just unicode I suppose that hexcode output will work (IE <B2> <B3> <D2> instead of {DialogBegin} {DialogEnd} {Grey}).

Do we need 2 functions (debug and unicode output) or just one with a flag of the type of output?
FF7TextToUnicode(void *FF7Text, int Conversion)
where conversion is
FF7None {any symbols that cannot convert to unicode are ignored}
FF7Unicode {any symbols that connot convert to unicode are in HEX format}
FF7Debug {any symbols that cannot convert to unicode are in detailed debug format}

Before the going with this dicusion becomes weird I am reading this FAQ about unicode and Linux, obviously we want to make this code more universal than would be otherwise using windows specific code. :D
For UTF8 enoding I don't think this will be too big an issue since most of the symbols in FF7 fit within the UCS encoding ranges (maybe even Circle Square Triangle X too ;) ).

Anyhow this is something that likely needs to be done. As far as I know there are only 3 official translations. Europe, US, and Japan? Unless Europe had further subdivided versions? I can't seem to find a list of the number of offical translations.  So if anyone knows what languages it was officially translated into, I would be glad to here about this.

Cyb

Vehek

  • *
  • Posts: 215
    • View Profile
Well, the Great FF7 FAQ has 1.02 versions of English, French, German, and Spanish PC versions. I'd assume that the PSX version also had translations into those other languages.
« Last Edit: 2007-02-09 00:10:54 by Vehek »

NeXaR

  • *
  • Posts: 42
    • View Profile
I can confirm spanish version. BTW, maybe watching the ending credits video can help, usually it mentions the credits for all the localization teams in europe area, I think.

Cyberman

  • *
  • Posts: 1572
    • View Profile
Back to Unicode. I've been doing a bit of research. Apart from SQL and PHP that is. In any case I've found a killer library that I am thinking is a bit over kill however any one on the project please let me know your humble opinion (or not I don't care).   International Components for Unicode seem to cover everything but the kitchen sink.  A lighter weight library or one that doesn't require people to compile the big ICU library would be great. Perhaps simple UTF16 support? Since it's not likely we need the myriad of other scripts available immediately (unless people start translating FF7 into bantonese and inuit languages). I suppose this is a good start in any case, wchar support under Linux OS shouldn't be hard.  Off to play encode and decode et al.

Cyb

halkun

  • Global moderator
  • *
  • Posts: 2097
  • NicoNico :)
    • View Profile
    • Q-Gears Homepage
I'm game ^_^

If someone can find the Japanese encoding for FF7, I'll write up a unicode table for it.

I could use some brushing up on my kanji anyway. Anyone know any Japanese romhackers?

Akari

  • Moderator
  • *
  • Posts: 766
    • View Profile
I'm game ^_^

If someone can find the Japanese encoding for FF7, I'll write up a unicode table for it.

I could use some brushing up on my kanji anyway. Anyone know any Japanese romhackers?

I have japanese version of FF7 and some times ago I handle with japanese version of Xenogears. I think FF7 use are very much the same system. Here my table for xeno. http://omake.ru/bakari/xenogears/xeno.tbl

Cyberman

  • *
  • Posts: 1572
    • View Profile
All righty halkun Unicode Hiragana table, Unicode Katakana table to start. The Kanji symbol maps are a bit harder to find. I've been looking for these and feel like I've been running in circles (See Kanji see Kanji run like the wind and never get a dang code map for it).  If you can wade through the unicode 16 bit data map and find the descript of the symbols good luck (mutter).
You can search for Hiragana and Katakana but I'm not sure what to search for in it for the Kanji symbols.  I'm not positive but I think these are the code tables you need you can look for code map sections here.

Cyb

This turned out to be somewhat fun but quite a bit of digging involved (sigh).
« Last Edit: 2007-02-09 16:38:28 by Cyberman »

halkun

  • Global moderator
  • *
  • Posts: 2097
  • NicoNico :)
    • View Profile
    • Q-Gears Homepage
I have one better....

I have a program that can cross-refrence unicode, S-JIS, EUC, and Nelson (dictionary) numbers that can be inputted with  the kun-reading, on-reading, radical type or number, stroke order, or handwritten sample.

However, for me to start, I need the kanji lookup (DTE?) table for FF7, kind of like what Akari posted, but FF7's charset, unless they are the same. I don't want to start on this is a particular table *might* be the lookup for FF7.

Micky

  • *
  • Posts: 300
    • View Profile
I think you've got the idea right why I choose to do the ff7-unicode translation. One reason was of course that once the string is decompressed/decoded, as it is required for the kernel.bin strings that my code displays, you've got a normal wchar string that you can view in a debugger. If you'd leave it in ff7 format you'd have to do the conversion every time you want to view or print it without the ff7 font. Then of course if people want to include unofficial localisations: if they're using qgears they can be in unicode, or even their locally preferred 8-bit charset, and there is a loss-less conversion possible to display them. And for hacks of the original you just need another font-unicode lookup table.
On the other hand, if new characters are required for displaying text, for example for all the European languages with their accents and umlauts, or with Greek or Cyrillic letters you can extend the font with the required characters without loosing any of the existing characters - 16 bit should be enough for encoding the glyphs for all living languages.
You'll have to decide if you need a special library for unicode, though. The most FF7 does is displaying fixed strings, so in most cases you should be ok with the wchar versions of the str* functions and wsprintf, and they're already handled by most C libraries. That should be enough for most Latin-based scripts and EA scripts. From what I understand it gets nasty with Hebrew or Arabic writing, where you have to handle lots of ligatures, or even changing the writing direction inside a sentence. But I'm sure that can wait until more of qgears is working, and until you've got support from somebody who knows about those scripts.
By the way, the Kanji lookup can be quite hard. I've got another game where they removed all unnecessary characters from the Kanji font. The easiest way there was to get somebody who can read Kanji to look at the font texture and type it into a text-file.
Oh, and one idea for "special" characters: There is a special vendor-specific area where you could allocate those codes (for example the player-name macros or the playstation buttons), and in the XML input/output you could just encode them as entities (&cloud; &tifa; &psx_cross; & psx_circle; ) or directly as unicode-codes (&#xfe00; .... ).

My ff7->unicode table was made from the font graphics, by reading each line of the texture and writing down the unicode character visible at that position. Because the strings in the kernel.bin seem to use different escape codes from the field strings the expansion should be in two phases: kernel.bin or field text escape code expansion and then a conversion of the result into unicode. Or some mixture. I'm sure you'll work something out.

BTW: UTF-8 is mostly useful for fixed strings that you're not going to edit, or that you at most concatenate. But then again, with the amount of memory available nowadays you could work on all strings expanded into 16 bit, and only encode/decode to utf-8 when saving to disc or when loading xml files. And for example libxml converts a document automatically from whatever encoding it is saved in into utf-8, so you don't have to worry about that.

Accented characters can be saved in two forms with Unicode: for example ä could be the character itself or as "a" + umlaut. To avoid the work you could require the text for localisations to pre-composed and normalised to avoid too much work on your side. I think some editors give you a choice about how to save out an unicode document.
« Last Edit: 2007-02-09 18:06:15 by Micky »