SimpleDataFormat

This is a binary stream format that can be used to store a datastructure orgranized as a smalltalk-like dictionary. This dictionary can contain dictionaries, arrays, 7-bit ASCII strings, integers (1,2,4,8 byte), floats (ascii or in IEEE) date-time, blobs (binary data), comments??,or another SimpleDataStructure. This format will be self describing, given the knowledge of just these types. It should also be powerful enough to describe most of the structures needed by the operating system. The goal is to have a powerfull language that can just read in this tree with just one command and begin to access it. It will also allow the language to serialize this structure in one command as well.

8/17/98 - More has been fleshed out about this format. I originally implemented a format like this in my early tests. However, I quickly learned I would need to refer to objects previously defined or undefined for the real linker. This caused me to change my mind on the format. The format will now treat all of the items as objects. When a slot in a dictionary refers to its data, it will actually be refering to a file-object-id. This refers to a record previously or not previously defined. In fact, the object format will look very much like the format in memory now as well and the slots will have type indicators (GET, SET, GETSTACK, etc....). All of these types will be considered to be individual object records.

So, the first proto that gets written out will state that a proto is getting written.....(CONTINUE HERE)

12/27/99 - I think the above date is wrong. At any rate, only reference id's to previous objects were added. We will now go all the way and really mimic the dictiionary format that is present in the system. That means that there will be references to the data slots by the keys and those data slots will have action types (GET, SET, GETSTACK, etc.).

(the comment field and embedded SimpleDataFormat stream are in question)

The format will be in binary and it will be well defined. The reason it will be in binary rather than ascii isn't for space reasons, it is more for reasons of integrity. If the format was ascii, anybody with and editor could place data in the file. This makes checking it more difficult and slow. There will be utilities and editors to edit these files and insure that their simple format is intact. Then higher level programs can verify that the representation of the data is valid (i.e it is a valid code module, etc.)

Most of the files used by the system will be in this format. For example, the programs, libraries, etc. will all be in this format. That way, any program should have the ability to modify these files easily.

CRC's are still in question, as well as MD5, etc. These routines will have to skip an element/block of bytes in order to work. This is not the most optimum of solutions. This will mean that there will be an extra compare when moving the pointer. So, when scanning the data, the routine will need, the start of the data, the pointer to the area to skip, and the length (including that area). For the CRC, this bill be a 4-byte area, for the MD5, this will be a 16 byte area. On second thought, this doesn't sound to bad and it gives the implementer flexibility when building the CRC/MD5. Now, since there will be multiple checksums, there will need to be a convention on which one gets the priority when doing reads or writes.

6/7/98 - I'll just tag them on the end. They are optional and a CRC is done first, if there is one.
Also, do I need to

The types stored
--------------

Dictionaries - (key,values)
Arrays
ASCII - (7 bit only)
Integers - (1,2,4,8 byte)
Floats - (IEEE standard and ASCII)
Date-Time
Blobs

(Also, there will be a simple token for each type, followed by an optional param or count (arrays/dictionaries),
and then there will be a double check by having an end token for things with counts)

The format
----------

All Data is in network byte order.

SimpleDataFormat <- <Header> <Dictionary> [CRC] [MD5]

Header:
"AqUa SimpleDaTa\nddd\n???\n"
<4 byte length>
This is a 20 byte header. The ddd represents a version without the decimal in ascii '103' represents 1.03
The ??? is an options indicator in ASCII. Only the first ? is defined as a 'c'. This indicates a CRC follows the
block.

The next item should be a dictionary

Nil <- <NilToken>
Dictionary <- <Start of Dictionary> <4 byte Int; Reference ID> <4 byte Int; Number of Data Items>   <4 byte Int; Number of Slots>(Data Items) (SlotItems) <End of Dictionary>
DataItems <- <Object>
DictionaryItem <- <String> <1 byte Int; Action Type> <1 byte Int; Data Index>
Array <- <Start of Array> <4 byte length> <Object>* <EndOfArray>
String <- (<Start of 2 byte String> <2 byte length> <Ascii Character>* <End of String>) |
                (<Start of 4 byte String><4 byte length> <Ascii Characters>* <End of String> |
                (<Start of Internal String - lower 7 bits are length> <4 byte Int; Reference ID> <Ascii Characters>* <End of String>
Object <- <Dictionary> <Array> <String> <Float> <DateTime> <Integer> <Blobs> <SimpleDataFormatBlob><Reference>
Integer <- <Start of 4 byte Int> <4 byte Int>| <Start of 8 byte Int> <8 byte Int>
SimpleDataFormatBlob <- <StartOfSDFB == SimpleDataFormat>
Blob <- <Start of Blob> <4 byte Unsigned length> <Bytes of Data>* <End of Blob>
DateTime <- <Start of Date Time> "YYYY" "MM" "DD" "HH" "MM" "SS" "sss"   // sss is milliseconds
AsciiCharacters <- "\000 - \0x7f"
Float <- ??
ReferenceDefinition <- <StartOfReferenceDefinition><4 byte Int>
Reference <- <StartOfReference><4 byte Int>

ActionType <- (#define PROTO_ACTION_GET          0
#define PROTO_ACTION_SET          1
#define PROTO_ACTION_GETFRAME     2
#define PROTO_ACTION_SETFRAME     3
#define PROTO_ACTION_PASSTHRU     4
#define PROTO_ACTION_CODE         5
#define PROTO_ACTION_INTRINSIC    6 )

StartOfDictionary <- "{"
EndOfDictionary <- "}"
StartOfSDFB <- "A" Just use the A in Aqua
StartOfArray <- "("
EndOfArray <- ")"
Start Of Internal String <- "\0x80" A byte with high bit set, lower seven bits are length
StartOf2ByteString <- "["
StartOf4ByteString <- "<"
EndOfString <- "~"
StartOfDateTime <- "T"
StartOfBlob <- "B"
EndOfBlob <- "b"
StartOf4ByteInt <- "I"
StartOf8ByteInt <- "L"
StartOfReferenceDefinition <- "R"
StartOfReference <- "r"
NilToken <- "N"

CRC <- CRC32_POLY 0x04c11db7, The register is preinitialized to 0xffffffff. The end CRC is not complemented. I believe this is also the network byte order crc

5/16/98 - First creation
5/17/98 - 5/18/98 - Continued
6/5/98 - Continued with more info (types/stored) and format
6/6/98 - More work and info transfered from paper. Figured out the details on CRC and wheter to support multi byte intsc
6/7/98 - The basic Idea is down. Need to just work out the module format.
6/15/98 - Made a few changes to what characters were used as delimiters a while ago. Now I'm changing the Array delimiters, I'm also making sure things are correct
7/8/98 - Added the change for the Todo
8/17/98 - Decided to change the format to handle the "mulitple links to the same object" requirement (cycles in the graph) This removes the todo of wanting unique strings since that is done as a side effect.
3/29/99 - Added, the nil object, this is just as important.
9/25/99 - Added the 'reference' object so that this indicates it refers to an object identified as an integer'
9/26/99 - Changed the format so that there are two references. Also, I added the reference object to the string and to the dictionary as part of the definition (instead of using 'R'. It was ambiguous.)