-
September 19, 2010, 09:14 PM
#1
DB files schema autodetection
ETW uses db files, which are nasty binaries without any embedded schema information. So far the only way to figure them out was manually with hex editor. Here's a quick guide to schema autodetection - it seemed hopeless at first, but it turned out to work fairly decently.
First you need to extract the file from .pack. Pack File extractor will do that just fine. Once extracted db file contains:
- Version number header. Either byte 01, or nine bytes FC FD FE FF xx 00 00 00 01.
- Uint32 number of rows.
- Actual data.
Every row contains identical fields, but we don't yet have any idea where any particular field or even row begins. Fortunately there are very few data types, and if row fields are (TypeA TypeB TypeC), it will just repeat cyclically known number of times until reaching the end.
There are very few data types in use:
- Strings - encoded as uint16 length (in characters), and then uint16 individual unicode codepoints. Characters are 99% pure ASCII, so it typically looks like 03 00 6C 00 6F 00 6C 00. Empty string is 00 00.
- Nullable strings - either 00 to indicate null, or 01 string. Non-null empty string is 01 00 00 - of whatever use that might be. Example 01 03 00 6C 00 6F 00 6C 00.
- Booleans - one byte 00 or 01. No other values possible.
- Int32 - four bytes, low-endian, signed
- Float (single precission) - four bytes
That's pretty much it. There's a tiny number of places that use 2-byte integers, or possibly something weirder, but nearly all tables can be decoded with just these.
It is really easy to check what could possibly be at given position:
- Strings have variable length so if we make a mistake total length of data file won't match. Also checking for ASCII is nearly fool-proof once you allow a few exceptions per string. False positive rate will be very low.
- Distinguishing (Boolean, String) combination from optional string. If boolean is always 01 (true), they will look identicaly, but then it's not a terribly interesting field. If optionality is ever used, total length will not match if we make a mistake.
- Distinguishing int32 from floats - now it's far easier than you think. Ints almost always have top bits all-zero or all-one. Float format is (sign bit, 8 bits for exponent, the rest). If it's 00000000, it fits both types equally well. If it's 80000000 it's almost certainly a float (negative zero). Otherwise look at exponent bits (x>>23)&0xFF - value 128 there means 1.0, so it will almost alawys be something near that. Let's say 96 to 160. For int it will almost always be 0 or 255. Just check every value of a field and you'll pretty much guaranteed to get >90% indication. The only mistakes in practice are when all numbers are 0.
- Distinguishing int32/floats from multiple booleans - there are a few long runs of booleans in db. They are pretty easy, as 01 01 are extremely rare in top two bytes of either int32 or floats, and any value from 02 to FF makes it absolutely certain it's not a boolean. The only problem would be again a column with just zeros, which could be int32, float, or four booleans (or two empty strings, or four null strings) just as well.
This is all you need. Just iterate over every possible schema type, and check if it works. Now combination count is exponential, but a bit of dynamic programming and we're detecting types of 90% db tables in seconds. Some extra tricks will be needed to make autodetection work for massive tables like unit_stats, or tables with weird data types like int16, but I got every table I wanted without running into any of them. As verification is very fast even for such massive tables, it's always possible to specify it manually and have program make sure it works.
I'm writing all this because I have as much clue about Windows programming as I have about Klingon ballet, so I cannot contribute any actual code to DBEditor.
If someone wants the code, email me. It's very messy, and you're probably better off with this explanation than with my code.
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules