BCSV (File format)

From Luma's Workshop
Jump to navigation Jump to search
The content described on this page is 100% documented.

BCSV stands for Binary Comma Separated Values and is the most common data format used in both Super Mario Galaxy games. Some older GameCube titles, such as Luigi's Mansion and Donkey Kong Jungle Beat, use this data format as well. As the name suggests, BCSV is a binary variant of comma-separated values (CSV). This means that the data is laid out in a table-like structure. The column names are hashed for faster access. The data is flatbuffer-like and is loaded directly into memory, meaning that it does not have to be deserialized first. The game supports reading data as signed and unsigned integers (8, 16 and 32 bit), single-precision floats and strings. All BCSV files are padded to the nearest 32 byte boundary with '@' (0x40). There is no consistent file extension for BCSV data. Instead, the game contains various BCSV, BANMT, BCAM, PA and TBL files. BCSV files that use TBL as their file extension are expected to be sorted in ascending order by some specific field. Each string is a null-terminated SHIFT-JIS (Codepage 932) encoded string.

Header

Each BCSV file starts with a header:

Offset Type Description
0x00 u32 Entry count
0x04 u32 Field count
0x08 u32 Offset to the entry data section
0x0C u32 The size of each entry in bytes

Fields Section

Right after the header comes the list of fields. The structure of a single field is as follows:

Offset Type Description
0x00 u32 Name hash
0x04 u32 Bitmask (often the data type's max value)
0x08 u16 Offset to the data under this field in an individual entry
0x0A u8 Data shift amount
0x0B u8 The type of data that this field uses

Data types

Fields may cover one of the following data types:

Name ID Size (in bytes) Description
LONG 0x00 4 32-bit integer. Signedness is not specified. ANDed with the bitmask and shifted right by the field's shift amount.
STRING 0x01 32 Embedded string. Deprecated. Use STRING_OFFSET instead.
FLOAT 0x02 4 Single-precision floating-point value.
LONG_2 0x03 4 32-bit integer. Signedness is not specified. ANDed with the bitmask and shifted right by the field's shift amount.
SHORT 0x04 2 16-bit integer. Signedness is not specified. ANDed with the bitmask and shifted right by the field's shift amount.
CHAR 0x05 1 8-bit integer. Signedness is not specified. ANDed with the bitmask and shifted right by the field's shift amount.
STRING_OFFSET 0x06 4 32-bit offset into string table.

It's debated that there may be a NULL data type (ID of 7) but there is no way of confirming this info at the time.

Field Order & Entry Size

For efficiency and hardware limitations, the field offsets and total entry size are calculated depending on a special ordering of the fields. This only affects the order of the data in an entry and not the order of the fields in this section. When saving, the tool should ensure that the field offsets and total entry size are calculated depending on this order: STRING < FLOAT < LONG < LONG_2 < SHORT < CHAR < STRING_OFFSET. A sample implementation from pyjmap can be found on Github which shows how to calculate these properly. Another sample would be from libbcsv which shows the order of the data types.

Data Section

Contains the individual data entries. The structure of their data is specified by the BCSV's fields. Each entry is aligned to four bytes. When saving, the tool should ensure that they are written based off the Field Order. The amount of data entries should be header.entrycount * header.fieldcount.

String Pool

Right after the data comes the string pool which contains all strings used within the BCSV. Here are some rules regarding the string pool:

  • ALWAYS starts directly after the Data Section.
  • String entries are UNIQUE.
  • String entries are NUL terminated.
  • It should always be possible to find the table's start by using header.entrydataoff + header.entrycount * header.entrysize.