TLDR: nowadays any software service depends on others. The problem is your data ends up being distributed and there is no easy way to join those different sources easily. I’m proposing a “joinable API” to fix that problem.
We have been talking about the vendor lock-in for years, where once you start using a provider it’s nearly impossible to move to a different one. I think the software world is like that no matter the software you use: once your software is running on production changing it is a really expensive and painful task.
Before the explosion of SaaS providers we had data exporting tools so you could more or less get your data in a standard format and import to another system. Today that’s not totally true: try to get all “your data” from your google analytics account. Don’t worry, those providers have tools that can access your data easily.
10 years ago was easy to see a monolithic services running on just one big machine with everything needed in there (user management, accounting, the service itself) today you start a company and the one week later your software is talking to several services through HTTP APIs.
That’s good, you spend your time in the thing you provide more value instead of doing billing code.
Eventually, someone will ask a question like “hey, how many users do we have in our XXXX tier, have signed up 3 times during the last month and also have ….”
So you either 1) use a platform that collects everything or 2) a one that fetches data from all those services and put data in a place you can query. in any case, querying data from a third party services is a pain. Here is the point where a lot of scrapping and “CSV file transfer” horror stories start.
Most of those operations are what we call joins. It’s easy to do when all the data is in the same place but hard to do in a distributed system.
So how do we keep those services and at the same time the possibility of using our data? why don’t we request to our service providers to have a Joinable API apart of full, easy to use and documented data dumps and their regular service API (more on this in this excellent talk)
Let’s imagine you have a regular OLTP database like Postgres, it’d be nice to be able to do:
SELECT * FROM MyTable JOIN Service('billingservice.com', 'transactions') ON ... WHERE date = yesterday() ...
fetching remote data is already supported by Postgres using FDW (and all databases have a way to fetch remote data) but the problem is getting the data from the service.
The service would need to provide:
This is what most of the distributed databases do, it’s not rocket science, a common well-tested pattern that every single service should expose so you don’t have data lock-in.