Skip to content

CAN should normalise the AST before it is hashed

CAN should store and address notes semantically, which means notes with two different structures but the same meaningful content should have the same AST. And the same address.

For a first pass, the address-note handler should apply the following transformations to the AST before addressing and storage:

Each of these transformations removes some aspect of the source text which has no impact on the semantic meaning of the AST while normalising away common places for differences to sneak in. One thing this pipeline doesn't handle is double-spaces after full stops within paragraphs. This can be handled later, possibly using ghe unist natural language specification & tooling.

Previously it was decided that CAN should trim each text node, but this leads to meaningful data loss - especially around links and other nodes. Removing this doesn't significantly worsen the normalisation problem, so I think it's easier than solving the natural language serialisation problem.