Data Science: More About Science, Less About Data
Given: Data Scientists spend ca 60% of their time preparing, cleaning and governing data (source ).
Given: putting Data Engineering skills into the required mix makes it so much harder to hire Data Scientists.
Given: many companies have Big Data platforms unifying their data in their source-to-Warehouse or Lake architecture. These same companies have data streaming into multiple silos in Warehouse-to-model stage. These silos use different terms (glossary divergence) and definitions (semantic divergence), have conflicting ways of attributing things into categories (taxonomy divergence) and calculating derived features (algorithm divergence). The silos arise from the need for enabling agility in model development and re-training. At the same time, they slow down the model deployment, when quality, consistency - and the ability to use existing code base - become a hard requirement.
A sample architecture - without queueing, not that it would make any difference for the purpose of this article
Reinventing Data Governance
To many people, "Data Governance" associates with a boring set of rules and policies, more appropriate for a large, bureaucratic institution than a young vibrant startup. To many technology leaders, the phrase "We need data governance" rings of accepting the sad faith: the company got too big, too messy, and developed the "Legacy" (and who would have thought, back in the day, when the grass was green and landscapes were greenfield). Although I can do nothing about the name "Data Governance" itself (but I shall try anyway, for onomatology is my hobby), I would like to demonstrate that the concept can be sexy, exciting - and extremely beneficial to the organizations that do Data Science at scale. Yes, even these with the latest and greatest Big Data / AI architectures.
What is Data Governance? To understand this, we need to define what its success looks like. In the organizations where the data is governed successfully, the following is true:
new data sources are added seamlessly and the changes of structure of the existing ones are absorbed without (major) development efforts. The new data sources can speak about the concepts already known to the business, or carry completely new concepts with them, as makes no difference.
there is a space for new insights that impact how the company understood their business domain and its context. Unknown unknowns are not scary (I call this "open knowledge domain paradigm")
the above events have minimal (80-20 rule applies) impact on the applications/services. Re-develop, re-test and re-deploy are things that don't happen often, even when the data changes do.
the Data Scientists can always find the data they require, using the terminology that is most familiar to them. They can also keep using the terminology they prefer, without having to flex their glossary to "the Global terms".
switching a feature on and off in the set does not require re-development of data delivery pipelines. Ideally, it happens instantly and without an Engineer involved.
regardless of who and where performs the aggregations, the same aggregation will result in the same result. That is because both the categorization of concepts and the aggregation algorithms are managed centrally, instead of being re-invented by each user.
there is an easily accessible overview of all the derived features, as well as all the sets and all the models.
there is full lineage. It is clear at all times which feature comes from where and how it is produced.
there are events that will trigger the need to re-train the models.
Data Scientists get their data already linked together. There is a central, one-for-everyone view on how the business domain and its context looks like, and how the things relate to one another. Data Scientists do not need to do business analysis tasks to get their data right.
the data is of appropriate quality. This goes for both traditional data quality definition and aspects such as reliability and conformity (aka "outlier detection"). The definitions of quality rules are central, and hence re-usable for everyone.
Does it look like too much to wish for? If you look at the criteria above, it sounds less like "Data Governance" and more like Data Demiurgy.
If data is our prime matter (which, for data-driven business, it is), Data Demiurgy is the paradigm that organizes it into a consistent universe, with its own elements and laws of physics that govern it.
Data Demiurgy Layer
The component that is responsible for the Data Demiurgy needs to fulfill two main requirements:
It needs to be infinitely flexible, which would allow it to describe the real world with as much precision as is desired by the business. This will also allow it to evolve and accommodate the changes in the data. Relational models, with their strict definition of tables and columns and keys and constraints, are simply too cumbersome. Unstructured approach, however, fails to organize the primordial chaos of data into a consistent universe.
At the same time, it needs to be infinitely multidimensional. This is the only way to accommodate not just the definitions of the data entities, but also the instructions and synonyms and quality rules and algorithms, and instructions on top of these algorithms.
Semantic approach, based on the concept of RDF, is the ideal format for Data Demiurgy:
it allows describing the entities and their attributes and links in a way that is not restrictive, yet prescriptive enough to place every piece of data into its own box.
it allows annotating the entities and links and attributes and annotations themselves, which caters for a rich metadata layer right where the data is - as opposed to a separate component dedicated to Data Glossary and Lineage.
Typing this, I can already hear the voices saying: "But it's so academic! RDF is great in the research environment, but here we have data streams, we have ETL, we have reporting and ML and whatever else the slow and cumbersome RDF approach does not have to face in academia!"
I can also think of an Architect trying to imagine how the conversion of relational, semi-structured and unstructured data into RDF would look like and how it would operate with the rest of the landscape.
Although fully RDF-based data platforms do exist (and I have been part of designing and delivering one), it does not need to be as complex as a complete switch of paradigm. Remember, all we want to accomplish is to have a consistent picture of our universe and make sure all our data conforms to it. Once we can navigate the semantic representation of our business, we can find any and every piece of data in our landscape - but it does not mean that all that data needs to be in a triplestore.
Semantic Paradigm is nothing but another fancy concept without its sister, Metadata-Driven Application Architecture Paradigm.
The idea is simple:
Semantic Layer allows attaching metadata directly to the definitions of the data entities, attributes, and relationships.
That metadata can easily be modeled to reflect the desired logic of application behavior.
All the application needs to do is to read that layer and act accordingly.
This can be said as much about the applications as about data structures: the tables and views and ETL jobs alike can be created from metadata, making sure that every name and every action is consistent with the central representation of our business universe. In fact, metadata layer is the equivalent of the rules of physics that define how the universe functions.
The diagram below describes the high-level data architecture presented in the previous section, but with a Semantic Layer catering metadata that steers the data flows:
Inhabitants of the Semantic Layer
The data itself is stored in a relational database. However, the entities and the relationships between these entities, as well as their attributes, are described in the Domain Ontologies. The table names and the column names conform to the labels defined for the entities and attributes in the Semantic Layer.
The Domain Ontologies can even be used to generate the DDL for the relational databases in a metadata-driven way (I have a component in my private projects called Magrathea - the planet that builds planets - that does just that).
To prevent Data Scientists or Data Fusion Engineers from introducing their own semantics to exploration data sources and the way these link into the main data set, the exploration data sources are also modeled in the Semantic Layer (note that to accelerate delivery, the Exploration ontology set could be modeled in a "quick & dirty" way; here, the only requirement is that each feature is known to the Semantic Layer and has a label it can be traced by throughout the landscape).
If the source is going to be used continuously, it will make sense to register it in the Data Library, where the original structure is mapped to the Semantic representation. ETL/ELT jobs can simply read the Library to make sure the data ends up in the correct table and column. Data Libraries are also great for encoding data quality rules. DQ Engine can read these rules and build itself in such a way to validate them.
Ultimately, the Semantic Layer allows for taking the logic out of the applications and putting it right with the data. Application Support ontologies contain annotations that encode how each data entity and attribute needs to be treated. For example, we can specify the features to export for Model A and features to export for Model B. We can define aggregation logic to allow the Derived Feature Generation (DFG) component to generate its code on the fly. For example, we can specify for the entity "dog" that a count of its individuals needs to be performed within the scope of the street (to create a derived feature "number of dogs living on the street"). We can also make sure that only pet dogs are counted, omitting hence the service dogs:
*please do not take the example above for a guideline on how to model data semantically. There are a good couple of places on this diagram that call for optimization.
Our DFG would react on a new dog individual appearing in the database. It would then read the metadata and perform the operation on the main_entity where the condition verifies on the filter_predicate.
The same DFG could be used to calculate the number of chihuahuas in the city, or the sum of total spend on dog care products in the country. We could even make it more generic by creating a generic derived predicate and letting the DFG generate the names for the derived features using the label of the main_entity, domain and the operation (e.g. "dog_count_street"). This approach can spare us from defining and governing thousands of derived features.
Preventing Glossary Divergence
Another use case for Data Demiurgy is preventing the glossary divergence by storing synonyms. These can be contextual (to say, "this is the way Marketing refers to this concept, as opposed to how Engineers speak about it"). Here's an example: ask the Geospatial Engineer and a Data Scientist what the meaning of the word "feature" is, or ask a DBA, a Semantic Architect and a Data Scientist how they refer to the concept of "attribute". The picture above shows two synonyms for "address". It is possible to model synonyms in a way that links them to the user group or context. Regardless of how it is solved, so long every label and the synonym is known to the Semantic Layer, there will be no silos originating from differences in vocabulary.
There are two dangers that are inherent to Data Demiurgy:
premature inclusion: entities and sources are included in the Semantic Universe before their usefulness is confirmed and their usage is relatively stable. This can lead to continuous schema fluctuations at best (annoying, especially when you just thought you were done optimizing your partitioning and performance) and data migrations at worst (think of re-allocating an attribute from one entity to another due to a previous wrong understanding of the semantics)
semantic lock-in: practically, everything in the data landscape is locked on the Semantic Layer. Major ontology changes - for whichever reason not limited to premature inclusion - can end up shaking the whole landscape. To be honest, this sounds grandiose but is probably not as bad as similar changes in schema shaking up the application, integration and data layers in semantic-less architectures. To be very honest, ontology versioning is a MUST if you are going to be a true Data Demiurge.
That's owl for today! by Andreas Ingvar van der Hoeven