BCSV (File format)
The content described on this page is 100% documented. |
---|
BCSV stands for Binary Comma Separated Values and is the most common data format used in both Super Mario Galaxy games. Some older GameCube titles, such as Luigi's Mansion and Donkey Kong Jungle Beat, use this data format as well. As the name suggests, BCSV is a binary variant of comma-separated values (CSV). This means that the data is laid out in a table-like structure. The column names are hashed for faster access. The data is flatbuffer-like and is loaded directly into memory, meaning that it does not have to be deserialized first. The game supports reading data as signed and unsigned integers (8, 16 and 32 bit), single-precision floats and strings. All BCSV files are padded to the nearest 32 byte boundary with '@' (0x40). There is no consistent file extension for BCSV data. Instead, the game contains various BCSV, BANMT, BCAM, PA and TBL files. BCSV files that use TBL as their file extension are expected to be sorted in ascending order by some specific field. Each string is a null-terminated SHIFT-JIS (Codepage 932) encoded string.
Header
Each BCSV file starts with a header:
Offset | Type | Description |
---|---|---|
0x00 | u32 | Entry count |
0x04 | u32 | Field count |
0x08 | u32 | Offset to the entry data section |
0x0C | u32 | The size of each entry in bytes |
Fields Section
Right after the header comes the list of fields. The structure of a single field is as follows:
Offset | Type | Description |
---|---|---|
0x00 | u32 | Name hash |
0x04 | u32 | Bitmask (often the data type's max value) |
0x08 | u16 | Offset to the data under this field in an individual entry |
0x0A | u8 | Data shift amount |
0x0B | u8 | The type of data that this field uses |
Data types
Fields may cover one of the following data types:
Name | ID | Size (in bytes) | Description |
---|---|---|---|
LONG | 0x00 | 4 | 32-bit integer. Signedness is not specified. ANDed with the bitmask and shifted right by the field's shift amount. |
STRING | 0x01 | 32 | Embedded string. Deprecated. Use STRING_OFFSET instead. |
FLOAT | 0x02 | 4 | Single-precision floating-point value. |
LONG_2 | 0x03 | 4 | 32-bit integer. Signedness is not specified. ANDed with the bitmask and shifted right by the field's shift amount. |
SHORT | 0x04 | 2 | 16-bit integer. Signedness is not specified. ANDed with the bitmask and shifted right by the field's shift amount. |
CHAR | 0x05 | 1 | 8-bit integer. Signedness is not specified. ANDed with the bitmask and shifted right by the field's shift amount. |
STRING_OFFSET | 0x06 | 4 | 32-bit offset into string table. |
It's debated that there may be a NULL data type (ID of 7) but there is no way of confirming this info at the time.
Field Order & Entry Size
For efficiency and hardware limitations, the field offsets and total entry size are calculated depending on a special ordering of the fields. This only affects the order of the data in an entry and not the order of the fields in this section. When saving, the tool should ensure that the field offsets and total entry size are calculated depending on this order: STRING < FLOAT < LONG < LONG_2 < SHORT < CHAR < STRING_OFFSET. A sample implementation from pyjmap can be found on Github which shows how to calculate these properly. Another sample would be from libbcsv which shows the order of the data types.
Data Section
Contains the individual data entries. The structure of their data is specified by the BCSV's fields. Each entry is aligned to four bytes. When saving, the tool should ensure that they are written based off the Field Order. The amount of data entries should be header.entrycount * header.fieldcount
.
String Pool
Right after the data comes the string pool which contains all strings used within the BCSV. Here are some rules regarding the string pool:
- ALWAYS starts directly after the Data Section.
- String entries are UNIQUE.
- String entries are NUL terminated.
- It should always be possible to find the table's start by using
header.entrydataoff + header.entrycount * header.entrysize
.