A trucker delivering palletized cargo to a pier would have to remove each item from the pallet and place it on the dock. Longshoreman would then replace the items on the pallet for lowering into the hold, where other longshoreman would break down the pallet once more and stow each individual item - all at a cost so high that shippers knew not to send pallets to begin with. – The Box, Marc Levinson. P143
Such deliberate inefficiency is astounding. It seems utterly insane to do all that unnecessary work but the labour unions feared they were going to be automated out of a job so they instituted rules to guarantee there would be work for them to do.
One might be forgiven for thinking that this kind of behaviour would never happen today. We’re modern, right? Umm … no.
Consider how data is repeatedly transformed as it moves between systems.
Data being passed between two systems has to be translated because the two systems don’t use the same data structures. In some cases there is a corporate standard data model but it’s only used for transmission over the wire, not internally within systems. The individual systems want to mitigate the impact that external dependencies have on the internals of the system, so the common data model is accepted at the interface and transformed for use in the mission-critical internal algorithms.
So now when data is passed between two systems it gets translated several times. Just like the palletized cargo on the docks.
Let’s take a concrete example. Let’s assume that Murex needs to send trades to an internal risk system. The risk system uses a custom data format optimized for size and speed. Murex produces its own proprietary XML called MXML. And the enterprise defines its own standard data format known as BankML, based on FpML. To send data from Murex to the risk system, the following happens
- Murex transforms its internal data into MxML
- MxML is translated into BankML
- BankML is translated into the risk system’s data format
Each of these translations increases the likelihood of errors such as rounding, inversion, truncation and omission. And each of these data transformations has to be defined and implemented, often by different people on different teams.
In my experience, much of the work on internal systems is spent implementing (and debugging) data transformations. The ETL tools of yesteryear were supposed to address this grunt-work but they failed to deliver on their promises. SOAP and XML tried to formalise the interface definition but it was cumbersome to work with.
The latest fad to step up to the plate is JSON over REST which is less formal, but instead introduces implicit data structure dependencies. And JSON schemas are just as bad as XSD, if not worse. In fact, many corporate Enterprise API standards are trying to make RESTful interfaces more SOAPy which feels dirty (pun intended).
I don’t think this is a solvable problem. Defining a single data model to rule them all is a fool’s errand, as FpML demonstrates. FIX does a pretty good job within it’s narrow domain but I haven’t seen any systems that use FIX internally - only as a wire format.
I’d love to see better libraries and tooling to make these transformations easier to define, maintain and implement. Anything that reduces the total lifetime cost of data transformations would be a huge benefit to large enterprises.
But don’t sell me a bloated ETL tool.