Structural versus semantic integration

Overview

Flow is a data integration and querying platform, built on top of the Taxi schema language.

The idea behind Taxi (and semantic types in general) is to allow producers of data to define both the structural contract (the labels assigned to fields and shapes of objects), and the semantic contract (what each field actually means).

A schema plays multiple roles:

  • Labels define how information is addressed within a message - e.g., firstName, lastName

  • Structure defines how a message will be laid out - what’s nested where. Sometimes data is flat (e.g., tables or CSV files), at other times there’s a nested structure (such as JSON)

  • Semantics define the contract of what information is in place at a certain location

  • Encoding defines how data is written to the wire. (Not all schemas cover this)

Most schema languages focus entirely on addressing labelling and structure, but offer little to model the semantics of a contract. And, when you’re trying to consume data, the semantics are the most important thing.

Producers of data need to be careful about labels, structure and semantics of the data they emit. For consumers, the only thing that really matters is the content. Labelling and structure are just addresses to access the information that consumers need.

Structure-first schemas

Most data schemas (and subsequent integration techniques) focus on describing and connecting data based on the structure of information.

Here are some examples of linking data based on structure, shown in a couple of different languages:

   // Server-side contracts that the producers of data have defined:
interface Actor {
   name: string
   id : number
   agentId: number
}

interface Agent {
   name: string
   id : number
}

// As a consumer - load an actor, then fetch it's agent using the agentId
function loadAgentsAndActors() {
   const actor = await loadActor();
   // Here, we're also coupled to another API call
   const agent = await loadAgent(actor.agentId)
   const castMember = {
    actorName : actor.name,
    agentName : agent.name
   }
}

In this example, we wanted to consume data with two attributes: an actor’s name, and their agent’s name. We also wanted to apply labels that were meaningful to us as consumers (actorName and agentName) that differed from the way producers originally labelled the data (simply - name).

The only way we could build this was by becoming tightly coupled to the producer’s contract.

That’s a problem because tight coupling makes it hard for producers to evolve their APIs, and brittle for consumers when they do.

Integration based on structure is brittle - changes by producers have expensive knock-on effects to consumers.

Semantic integration

Taxi and Flow provide a different style of schema and integration, by allowing schemas to define their semantic contract as well.

Semantic types describe the meaning of content in a field, rather than just how it’s written out by computers.

Field Name System Type Semantic Type

id

String

CustomerId

name

String

CustomerFirstName

houseNbr

Int

HouseNumber

ageInYears

Int

CustomerAge

The column on the left - Field name - describes how a producer has named the field in one specific system. The column in the middle - System Type - describes how a system might read or write a field. The column on the right - Semantic Type - describes the actual meaning of the text in a business sense.

By defining a semantic contract, we introduce a way for producers and consumers to discuss data that’s not coupled to structure.

Semantic contracts provide a shared language for producers and consumers to discuss data that’s not coupled to structure, so it’s not as brittle when structural contracts change.

Semantic contracts allow you to request data based on its meaning, irrespective of the shape producers are emitting it in.

Let’s look at our example query again, this time run through TaxiQL, a semantic query language:

given { ActorId = 1 }
find {
   actorName : ActorName
   agentName : AgentName
}

It’s important to call out a few details:

There’s no FROM

There’s no FROM clause here - nothing that specifies which producer, and where, the data should come from. That’s left to the query layer to determine, which means that consumers remain protected when producers inevitably change.

There’s no JOIN

Likewise, we haven’t specified how to navigate infrastructure and data contracts to describe how to link two ideas (Actor and Agent) together. In fact, for all we know, this could be coming from one table, or some complex lookup linking several entities together.

As a consumer, understanding the specifics of how data has been structured is an implementation detail we shouldn’t be concerned with.

Field names are defined by the consumer

The producing systems named these fields something else. The consumer has been able to define the shape of the data they need, and left it to the querying layer to solve. There’s enough semantic metadata present for the query layer to understand what data is needed.

This is possible because we’ve separated the structural concept from the semantic concept. Field names are now entirely up to the consumer to decide - the way it should be.

Semantic querying is centered around what data the consumer needs, and how the consumer needs it. Producer concepts are abstracted away.

Enhance structural contracts with semantic metadata

It’s not a question of structure OR semantics, or choosing one schema language to rule them all. Taxi is a great schema language, but it’s not intended as a replacement for your existing schemas.

Modern data architectures are polyglot, with a mix of APIs, databases, streaming sources, and flat files. The key to making semantic schemas work across this ecosystem is to enhance existing schema structure with semantic metadata.

Taxi supports this, with a series of extensions.

For example, in Swagger / OpenAPI:

// Partial OpenAPI example:
paths:
  /pets:
    get:
      description: |
        Returns all pets from the system that the user has access to.
      operationId: findPets
      parameters:
        - name: tags
          in: query
          description: Tags to filter by.
          required: false
          style: form
          schema:
            type: array
            items:
              # ------Below is a Taxi extension for semantic types-----------
              x-taxi-type:
                name: petstore.Tag
              # ----------End the extension----------------------------------
              type: string

Or in Protobuf:

// An example of enriching a structural contract with semantic metadata
syntax = "proto3";
import "taxilang/dataType.proto"; // Import the DataType extension
package tutorial;

message Person {
  string name = 1 [(dataType='PersonName')];
  int32 id = 2  [(dataType='PersonId')];
  string email = 3 [(dataType='PersonEmail')];
}