笔记 01 - 序列化协议（JSON Thrift AVRO）

Item: Designing Data-Intensive Applications
Rating: 5
Author: 姚钢强

2018-12-22 18:40:01 已编辑北京

这篇书评可能有关键情节透露

声明：此文内容全部来自《Designing Data-Intensive Applications》，这只是我感兴趣部分的笔记梳理，极力建议去读原著。

好的系统是可演化的，即系统可以容易地做变更，前后兼容性好。对于系统设计而言，序列化协议的选择尤为重要。

如何理解兼容：

向后兼容：新代码可以正确读取老代码写入的数据向前兼容：老代码可以正确读取新代码写入的数据向后兼容一般比较容易做到，因为写新代码的工程师知道老代码的数据格式，可以在代码层明确的兼容。正好相反，向前兼容就麻烦一些，需要老代码忽略新的新代码带来的数据格式的变更。接下来可以探讨一些常见的序列化协议

很多编程语言都有自己的序列化协议，例如 Python 的 pickle，Java 的 java.io.Serializable优点：使用方便，因为这些协议可以使用极少的代码把内存里的数据保存和恢复缺点：

优点：可读性很好，使用简单。

缺点：

空间的小幅度减小的好处是否值得失去编码可读性。为什么空间减小幅度小，因为不得不处理属性的名称。

{
    "userName": "Martin",
    "favoriteNumber": 1337,
    "interests": ["daydreaming", "hacking"]
}

例如使用 MessagePack 进行编码

编码是使用 schema 序号代替属性名称，因为不用在处理属性名称，可以使编码长度大大减小。同时也做了特殊的优化，例如使用 variable-length integers。只要使用 optional 标记，前后兼容性都很好。例如同样的数据使用 Thrift 编码。

// schema
struct Person {
    1: required string userName,
    2: optional i64 favoriteNumber, 3: optional list<string> interests
}

这个协议比较有趣，没有 field tag 和 datatype，使编码后的 size 更小。那 Avro 是如何读取数据确定类型的呢？先看下同样的数据是如何编码的

// schema
record Person {
    string userName;
    union { null, long } favoriteNumber = null; array<string> interests;
}

解析编码好的二进制数据是通过 reader schema 来知道 datatype 和数据对应关系。而且会与 writer schema 进行比对。例如虽然读写双方的 schema 不同，但是可以通过 Field name 的对比进行数据解析。

在 Thrift 和 Protocol Buffers 是 Reader schema 无须知道 Writer Schema 的。如何让 Reader schema 知道 Reader schema 呢？

Large file with lots of records with same schema
Database with individually written records: include a version number at the beginning of every encoded record, and to keep a list of schema versions in your database. A reader can fetch a record, extract the version number, and then fetch the writer’s schema for that version number from the database.
Sending records over a network connection：When two processes are communicating over a bidirectional network connection, they can negotiate the schema version on connection setup and then use that schema for the lifetime of the connection.

知乎曾经使用 Avro 作为 RPC 的通信协议，但是每次建立连接时的 schema 交换会造成服务的不稳定，被废弃了。现在选用的 Thrift。

序列化协议的选择和新的序列化协议的设计需要慎重，不同场景下的需求是不一样的。可能需要考虑到

这篇小总结抛砖引玉，希望能给大家一些引导。

如果我写得东西给你带来了帮助，或者有什么疑问，欢迎来《打开引擎盖》继续讨论。