You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 21 Next »

This page is is a brief introduction to the modeling paradigm used on the FI-Platform, as well as recommended modeling principles. Understanding the paradigm is crucial for data modelers as well as actors intending to use the models and data based on the models, whereas the recommended modeling principles give additional guidance on how to do modeling aligned to the paradigm and avoid common pitfalls.

The FI-Platform modeling tool is able to produce core vocabularies and application profiles. While they serve different purposes, they share a common foundation. We will first go through this foundation and then dive into the specifics of these two model types and their use cases.  When discussing the model paradigm, we sometimes contrast it with traditional technologies such as XML Schema and UML, as those are more familiar to many modelers and data architects.

In this guide we have highlighted the essential key facts with spiral notepad (note) symbols.

The Linked Data Modeling Paradigm

The models published on the FI-Platform are in essence Linked Data and thus naturally compatible with the linked data ecosystem. The FI-Platform nevertheless does not expect modelers nor users of the models to migrate their information systems into Linked Data based ones, or to make a complete overhaul of their information management processes. There are multiple pathways for utilising the models, ranging from lightweight semantic layering to fully integrated solutions. In most cases you can think of Linked Data solutions as a semantic layer covering your information architecture, or as a translational layer between your domain and external actors or between individual information systems.

Linked Data is a method of publishing structured data so that it can be interlinked and become more useful through semantic queries. It extends the Web to use URIs to name not just documents but also real-world objects and abstract concepts (in general, anything). The key issues Linked Data aims to solve are:

  • Data Silos: Information is typically stored in isolated databases with limited ability to interoperate. Linked Data was designed to break down these barriers by linking data across different sources, allowing for more comprehensive insights and analyses.
  • Semantic Disconnect: There is a lack of a common framework that could universally describe data relationships and semantics, and a disconnect between model types and associated specifications (conceptual, logical, physical, documentation, code lists, vocabularies). Linked Data describes data with the RDF (Resource Description Framework) language, which is able to encode meaning alongside data, enhancing its machine-readability and semantic interoperability.
  • Integration Complexity: Integrating data from various sources is typically complicated and costly due to heterogeneous data formats and access mechanisms, requiring peer-to-peer ETL solutions, API integrations or commonly agreed and restrictive schemas. Linked Data promotes a standard way to access data (HTTP), a common data format (RDF), and a standardized query language (SPARQL), simplifying data integration.
  • Reusability and Extension: The reusability and combining of data from various sources for new needs is typically limited or burdensome. Linked Data encourages reusing existing data by making the finding and combining data as well as inferring new data straightforward.

In essence, Linked Data offers an alternative solution for the typical data management issues and proposed solutions that rely on tailored and siloed data warehouses, catalogues, data pools, lakes etc.

The Core Idea of Linked Data

As you can imagine based on the name, Linked Data is all about references (links or "pointers") between entities (pieces of data). In principle:

spiral notepad All entities (resources in the Linked Data jargon), be they actual instance data or entities in a data model are named with IRIs (Internationalised Resource Identifier).

spiral notepad The names should (in most cases) be HTTP based IRIs, as this allows a standardized way to resolve the names (i.e. access the resource content named by the IRI).

spiral notepad When a client resolves the name, relevant and appropriate information about the resource should be provided.

spiral notepad The resources should refer (be linked) to other resources when it aids in discoverability, contextualising, validating, or otherwise improving the useability of the data.

How to Name Things

As mentioned, all resources are named (minted in the jargon) with identifiers, which we can then use to refer to them when needed.

spiral notepad The FI-Platform gives every resource an HTTP IRI name in the form of <https://iri.suomi.fi/model/modelName/versionNumber/localName>

spiral notepad On the Web You will typically come across mentions of URIs (Uniform Resource Identifier) and HTTP URIs more often than IRIs or HTTP IRIs. On the FI-Platform we encourage internationalisation and usage of Unicode, and mint all resources with IRIs instead of URIs which are restricted to just a subset of the ASCII characters.

spiral notepad IRIs are an extension of URIs, i.e. each URI is already a valid IRI.

spiral notepad IRIs can be mapped to URIs (with percent encoding the path and punycoding the domain part) for backwards compatibility, but this is not recommended due to potential collisions. Instead, tools supporting IRIs and Unicode should always be preferred. You should generally never mint identifiers which are ambiguous (i.e. name a thing with an IRI and something else with an URI encoded version). The IRI will get encoded anyway when you're doing a HTTP request, so you should always consider the set of URIs that correspond to URI encoded versions of IRI as reserved and not use them for naming resources.

spiral notepad We do not call them HTTPS IRIs despite the scheme on the FI-Platform being https, as the established naming convention does not dictate the security level of the connection. In general you should always mint the HTTP based identifiers with the https scheme and offer the resources through a secured connection, so do not mistake the HTTP IRI or HTTP URI name to mean that the unsecure http protocol should be used.

In the diagram above you can see the hierarchy of different identifier types. The identifier types based on the HTTP scheme are highlighted as they are primary in Linked Data, but the other scheme types (deprecated, current and potential upcoming ones) are of course possible (list of IANA schemes here: https://www.iana.org/assignments/uri-schemes/uri-schemes.xhtml). Note that the diagram is conceptually incorrect in the sense that conceptually IRIs and URIs (including their HTTP subsets) are broader than URLs which are merely Uniform Resource Locators. In other words, URLs tell where something on the Web is located for fetching, whereas URIs and IRIs give them identifiers (name them), and with a proper scheme version (like HTTP IRI) make the name resolvable. But if we simply look at the structure of these identifiers from a pragmatic perspective, the diagram is correct.

URNs are a special case, they are defined per RFC as a subset of URIs. They start with the scheme urn: with a local name or a potential sub-namespace. Well-known sub-namespace URNs are ISBNs, EANs, DOIs and UUIDs. An example of an ISBN URN is urn:isbn:0-123-456-789-123. URNs are not global and they always need a specific resolver. There is no central authority for querying ISBN or ISSN numbers, thus the URN must be resolved against a particular service, and in the case of UUIDs they mean nothing outside their specific context and thus mean they are tightly coupled with the information system(s) they inhabit.

You are probably familiar with UML or XML namespaces, and HTTP IRIs form no exception:

spiral notepad Each IRI consists of a namespace or a namespace with a local name. The idea is to b

spiral notepad When you mint a new HTTP IRI for something, you can give it any name that adheres to RFC 3987 (https://datatracker.ietf.org/doc/rfc3987/), but as parts of the HTTP namespace are controlled (owned) by different parties, it is not a good principle to name things in namespaces owned by someone else unless specifically given permission.

spiral notepad For details on namespace and local name naming conventions please refer to the W3C recommendation at https://www.w3.org/TR/xml-names11/.

Entity Life-Cycle and Versioning

When resources are minted and information is served by resolving them, we generally want that information to reflect the resource that the HTTP IRI identifies. As mentioned above, there is a conceptual difference between an URL and an HTTP IRI. URL just points somewhere with its domain and path usually giving a hint of what to expect, but the contents can vary wildly if wanted - it does not implicitly identify anything in particular. An HTTP IRI on the other hand 



This keeps all the resources neatly inside each model's own versioned namespace (the https://iri.suomi.fi/model/modelName/versionNumber/ part). On the other hand, naming resources <https://suomi.fi/localName> would not be a good idea, as resolving these addresses is controlled by the DVV web server, resulting in either unresolved (HTTP 404) or clashes with other content already being served by the same URL. You can think of the localName part of the IRI as identical to the local names of XML or UML elements. The namespace is a core part of XML and UML schemas as well, but in (HTTP IRI based) linked data each element is referred to with a global identifier - thus making it unambiguous to know which element we are talking about even if two elements have the same local names.

How do we then identify what the local name part of the URI is? There are two methods, but both require that you follow the RFC:s correctly. There is a multitude of ways both users and Web services use malformed URIs, and browsers and web servers encourage this kind of behaviour by autocorrecting many malformed URIs before the request is sent or when it is received. A common method is to use the URI fragment # to distinguish the namespace from the local name, but a more flexible, less problematic way (also used by the FI-Platform) is to simply use slash / and split the URI from the rightmost slash.

the format shown above is called a raw URI, but when looking at linked data serializations, you might stumble upon a familiar representation from XML: namespaceName:localName. This is called a CURIE (Compact URI Expression), where the namespace is given a shorter easier to read name and it is concatenated with the local name with a colon. This only aids the human-readability of the serializations, and requires that you always keep the context (namespace names to namespace URI mappings) with the data. Otherwise you will end up with unusable data if you combine CURIE form datasets that use colliding namespace names for different namespace URIs. This is not a problem when managing linked data with software, only when editing it manually.

As you are aware, there are some strict limitations with URIs - such as adhering to only ASCII characters. On the FI-Platform we have chosen to use IRIs instead. IRI is an Unicode extension of URI, meaning that both the domain and path parts of the identifier can contain a reasonable wide subset of characters from Unicode, which gives tremendous flexibility for supporting the needs of different modeling domains. There are some things to consider, namely that the domain part is handled differently from the path part of the URI. All internationalized domain names need to be punycoded (represented as ASCII) and don't support case-sensitivity, whereas the path part is case-sensitive and supports Unicode. Also, you should be aware that every IRI can be converted to URI for backwards-compatibility, which will naturally also create potential collisions, as there is no way to distinguish the ASCII version of an IRI from a URI. It is recommended to use only software that can deal with IRIs natively without requiring any conversion. Also, per the RFC, you must be aware that two identifiers are declared the same if and only if they match exactly character by character. This means for example that an IRI containing e.g. scandinavic letters and its URI-conversion are not the same identifier. We do have the capability to say that they are the same, but how to do it is explained later on in the text.

If you are managing data that you intend to distribute/share/transfer to other parties, it is good to adhere to the rule that all resources are given HTTP URI names from a namespace (domain or IP address) that a) you control and b) are resolved to something meaningful. There is an exception though: if all the information is a) managed in a siloed environment and b) can be represented explicitly as linked data requiring no resolving, you can freely use any URIs you wish without worrying about them resolving to anything nor conflicting with existing URIs outside your environment. This means that you can also mint URIs for resources before you have a framework ready for resolving them.

spiral notepad You should use primarily IRIs for naming all entities you create and only use namespaces that are under your control.

spiral notepad You should always publish your models and data with versioned URI identifiers to avoid unintended side-effects.

Linked Data is Atomic

It is crucial to understand that by nature, linked data is atomic and always in the form of a graph. The lingua franca of linked data is RDF (Resource Description Framework), which allows for a very intuitive and natural way of representing information. In RDF everything is expressed as triples (3-tuples): statements consisting of three components (resources). You can think of triples as simply rows of data in a three column data structure: the first column represents the subject resource (from whose point of view the statement is made), the second column represents the context resource of what is being stated by the subject, and the third column represents the object or value resource of the statement. Simplified to the extreme, "Finland is a country" is a statement in this form:

If all data is explicitly in RDF, it means we have a fully atomic dataset where everything from the types of entities down to their attributes exist as individual resources (nodes) connected by associations (edges). If we expand the example above to also include statements about the number of lakes and Finland's capital, we could end up with the following dataset:

As you can see, there is no traditional class/instance structure with inner fields. Finland as an entity does not have fixed attribute slots inside it for the number of lakes nor its capital: everything is expressed by simply adding more associations between individual nodes. As mentioned above, everything is named with an URI, so a more realistic example would actually look like this:

The triples in this dataset can be serialized very simply as three triples each consisting of three resources, or represented as a three column tabular structure:

subject resourcepredicate resourceobject resource
<https://finland.fi/><http://www.w3.org/1999/02/22-rdf-syntax-ns#type>

<https://datamodel/Country>

<https://finland.fi/><https://datamodel/hasNumberOfLakes>

"187888"^^xsd:integer

<https://finland.fi/><https://datamodel/hasCapital>

<https://finland.fi/Helsinki>

As you can see, the model consists of only atomic parts with nothing inside them. Linked data is structural: the meaning of every resource is only defined by how it is related to other resources. We can say that conceptually it is the resource <https://finland.fi/> that contains the number of lakes or its type, but on the level of data structures, there is no containment. This is in stark contrast to typical data modeling paradigms, where entities are specified as templates or containers with mandatory or optional components. As an example, UML classes are static in the sense that they cannot be expanded to cover facets that were not part of the original design. On the other hand, we can expand the resource  <https://finland.fi/> to have an unlimited amount of facts, even conflicting ones if our intention is to compare or harmonize data between datasets. We could for example have two differing values for the number of lakes with additional information on the counting method, time and responsible party in order to analyze the discrepancy.

Note that triples are directional. So if our dataset or model points to an object resource somewhere on the Web, that resource has no idea of our model existing or associating itself with it. But, we are naturally able to use those external resources at any position in the triples of our graph: we could even take a resource and speak with "its mouth". As an example, we could take the (pretty much universally used) definition of a class from the OWL vocabulary residing at at <http://www.w3.org/2002/07/owl#Class> and state novel facts about it. This in general is not recommended, but in specific use cases it can be extremely beneficial to be able to expand outside resources with new facts (on the other hand there are more sophisticated tools offered by e.g. OWL for achieving the same ends). 

In a nutshell:

spiral notepad All data and models are graphs defined as triples.

spiral notepad It is always possible to expand graphs by adding triples pointing to new resources or to state facts about the external world.

Linked Data Can Refer to Anything

You might have already guessed that this kind of graph data structure becomes cumbersome when it is used for example to store lists or arrays. Both are possible in RDF, but the flexibility of linked data allows us to leverage the fact that the same URI can offer us a large dataset e.g. in JSON format with the proper content accept type, while using the same URI as the identifier for the dataset and offer relevant RDF description of it. Additionally, e.g. REST endpoint URLs that point to individual records or fragments of a dataset can also be used to allow us to talk about individual dataset records in RDF while simultaneously keeping the original data structure and access methods intact. The URL based ecosystem does not have to be aware of the semantic layer that is added on top or parallel to it, so implementing linked data based semantics to your information management systems and flows is not by default disruptive.

Let us take an example: you might have a REST API that serves photos as JPEGs with a path: https://archive.myserver/photos/localName. When doing a HTTP request with Accept headers for image/jpeg, the URI will resolve a JPEG, but when using Accept header for application/ld+json, the same URI would resolve to a semantic RDF representation in JSON-LD of the photo (for example of its location, time and other EXIF data as well a provenance etc.).

What about models then? 

Everything Has an Identity

Because the above mentioned resources are all HTTP URIs (with the exception of the literal integer value for the number of lakes), we are free to refer to them to either expand the domain of this dataset or combine it with another dataset to cover new use cases. The killer feature of linked data is the ability to enrich data by combining it easily and infer new facts from it. As an example, we could expand the model to add the perspective of an individual ("Matti resides in Finland") and of a class ("A country has borders"):

A small exception to the naming rule is that the literal integer value of 187888 does not have an identity (nor do any other literal values).

All Resources Are First-Class Citizens

In traditional UML modeling classes are the primary entities with other elements being subservient to them. In RDF this is not the case. Referring to the diagram above, e.g. the resource <https://datamodel/hasCapital> is not bound to be only an association, even though here it is used as such. As it has been named with a URI, we can use it as the subject of additional triples that describe it. We could - for example - add triples that describe its meaning or metadata about its specification. So, when reading the diagrams you should always keep in mind that the sole reason some resources are represented as edges is due to the fact that they appear in the stated triples in the predicate position. 

Above, you can see another triple where the creator of the <https://datamodel/hasCapital> resource is stated as being some organization under the <https://finland.fi/> namespace.

Data and Model Can Coexist

From the diagrams above it is evident that both individuals ("instances") and classes are named the same way and can coexist in the same graph. From the plain linked data perspective such a distinction doesn't even exist: there are just resources pointing to other resources with RDF. Where the distinction becomes important is when the data is interpreted through the lens of a knowledge description language where some of the resources have a special meaning for the processing software. The meaning of these resources become evident when we discuss the core vocabularies and application profiles.

Relationships Are Binary

Third key takeaway is that all relationships are binary. This means that these kind of structures (n-ary relationships) are not possible:

Stating this would always require two triples with each one using the same "has child" as its predicate, meaning the association is used twice. This means that treating them as one requires an additional grouping structure, for example in the case where all of the child associations would share some feature that we do not want to replicate for each individual association.

You're not Modelling records but the Entities

What all the resources in a linked data dataset are depends of course entirely on the use case at hand. In general though, there is a deep philosophical distinction between how linked data and traditional data approach modelling. Traditionally data modelling is done in a siloed way with an emphasis on modelling records, in other words a data structure that describes a set of data for a specific use case. As an example, different information systems might hold data about an individual's income taxes, medical history etc. These data sets relate to the individual indirectly via some kind of a permanent identifier, such as the finnish Personal Identity Code. The identifier nor the records are meant to represent the concepts themselves but just a the relevant data of them. The data is typically somewhat denormalized and serves a process or system view of the domain at hand.

On the other hand, in linked data the modelling of specific entities is often approached by assuming that the minted URIs (or IRIs) are actually names for the entities themselves.

Core Vocabularies (Ontologies)

Core Vocabularies provide a structured framework to represent knowledge as a set of concepts within a domain and the relationships between those concepts. They are used extensively to formalize a domain's knowledge in a way that can be processed by computers. Ontologies allow for sophisticated inferences and queries because they can model complex relationships between entities and can include rules for how entities are connected.

  • RDFS (RDF Schema) is a basic ontology language providing basic elements for the description of ontologies. It introduces concepts such as classes and properties, enabling rudimentary hierarchical classifications and relationships.

  • OWL (Web Ontology Language) offers more advanced features than RDFS and is capable of representing rich and complex knowledge about things, groups of things, and relations between things. OWL is highly expressive and designed for applications that need to process the content of information instead of just presenting information.

Schemas

Schemas, on the other hand, are used for data validation. They define the shape of the data, ensuring it adheres to certain rules before it is processed or integrated into systems. Schemas help maintain consistency and reliability in data across different systems.

  • SHACL (Shapes Constraint Language) is used to validate RDF graphs against a set of conditions. These conditions are defined as shapes and can be used to express constraints, such as the types of nodes, the range of values, or even the cardinality (e.g., a person must have exactly one birthdate).

While ontologies and schemas are the main categories, there are other tools and languages within the linked data ecosystem that also play vital roles, though they may not constitute a separate major category by themselves. These include:

  • SPARQL (SPARQL Protocol and RDF Query Language), which is used to query RDF data. It allows users to extract and manipulate data stored in RDF format across various sources. SPARQL can be embedded in SHACL constraints to further increase its expressivity.

  • SKOS (Simple Knowledge Organization System), which is used for representing knowledge organization systems such as thesauri and classification schemes within RDF.

Each tool or language serves specific purposes but ultimately contributes to the broader goals of linked data: enhancing interoperability, enabling sophisticated semantic querying, and ensuring data consistency across different systems. Ontologies and schemas remain the foundational categories for organizing and validating this data.

When transitioning from traditional modeling techniques like UML (Unified Modeling Language) or Entity-Relationship Diagrams (ERD) to linked data based modeling with tools like OWL, RDFS, and SHACL, practitioners encounter both conceptual and practical shifts. This chapter aims to elucidate these differences, providing a clear pathway for those accustomed to conventional data modeling paradigms to adapt to linked data methodologies effectively.

Conceptual Shifts

Graph-based vs. Class-based Structures

  • Traditional Models: Traditional data modeling, such as UML and ERD, uses a class-based approach where data is structured according to classes which define the attributes and relationships of their instances. These models create a somewhat rigid hierarchy where entities must fit into predefined classes, and interactions are limited to those defined within and between these classes.
  • Linked Data Models: Contrastingly, linked data models, utilizing primarily RDF-based technologies, adopt a graph-based approach. In these models, data is represented as a network of nodes (entities) and edges (relationships) that connect them. Each element, whether data or a conceptual entity, can be directly linked to any other, allowing for dynamic and flexible relationships without the confines of a strict class hierarchy. This structure facilitates more complex interconnections and seamless integration of diverse data sources, making it ideal for expansive and evolving data ecosystems.

From Static to Dynamic Schema Definitions

  • Traditional Models: UML and ERD typically define rigid schemas intended to structure database systems where the schema must be defined before data entry and is difficult to change.
  • Linked Data Models: OWL, RDFS etc. allow for more flexible, dynamic schema definitions that can evolve over time without disrupting existing data. They support inferencing, meaning new relationships and data types can be derived logically from existing definitions.

From Closed to Open World Assumption

  • Traditional Models: Operate under the closed world assumption where what is not explicitly stated is considered false. For example, if an ERD does not specify a relationship, it does not exist.
  • Linked Data Models: Typically adhere to the open world assumption, common in semantic web technologies, where the absence of information does not imply its negation, i.e. we cannot deduce falsity based on missing data. This approach is conducive to integrating data from multiple, evolving sources.

Entity Identification

  • Traditional Models: Entities are identified within the confines of a single system or database, often using internal identifiers (e.g., a primary key in a database). Not all entities (such as attributes of a class) are identifiable without their context (i.e. one can't define an attribute with an identity and use it in two classes).
  • Linked Data Models: Linked data models treat all elements as atomic resources that can be uniquely identified and accessed. Each resource, whether it's a piece of data or a conceptual entity, is assigned a Uniform Resource Identifier (URI). This ensures that every element in the dataset can be individually addressed and referenced, enhancing the accessibility and linkage of data across different sources.

Practical Shifts

Modeling Languages and Tools

  • Traditional Models: Use diagrammatic tools to visually represent entities, relationships, and hierarchies, often tailored for relational databases.
  • Linked Data Models: Employ declarative languages that describe data models in terms of classes, properties, and relationships that are more aligned with graph databases. These tools often focus on semantics and relationships rather than just data containment.

Data Integrity and Validation

  • Traditional Models: Data integrity is managed through constraints like foreign keys, unique constraints, and checks within the database system.
  • Linked Data Models: SHACL is used for validating RDF data against a set of conditions (data shapes), which can include cardinality, datatype constraints, and more complex logical conditions.

Interoperability and Integration

  • Traditional Models: Often siloed, requiring significant effort (e.g. ETL solutions, middleware) to ensure interoperability between disparate systems.
  • Linked Data Models: Designed for interoperability, using RDF (Resource Description Framework) as a standard model for data interchange on the Web, facilitating easier data merging and linking.

Transition Strategies

Understanding Semantic Relationships

  • Invest time in understanding how OWL and RDFS manage ontologies, focusing on how entities and relationships are semantically connected rather than just structurally mapped.

Learning New Validation Techniques

  • Learn SHACL to understand how data validation can be applied in linked data environments, which is different from constraint definition in relational databases.

Adopting a Global Identifier Mindset

  • Embrace the concept of using URIs for identifying entities globally, which involves understanding how to manage and resolve these identifiers over the web.
  • It is also worth learning about how URIs differ from URNs and URLs, how they enable interoperability with other identifier schemas (such as using UUIDs), what resolving identifiers means, and how URIs and their namespacing can be used to use URIs in a local scope.

Linked Data Modeling in Practice

You might already have a clear-cut goal for modeling, or alternatively be tasked with a still loosely defined goal of improving the management of knowledge in your organization. As a data architect, you're accustomed to dealing with REST APIs, PostgreSQL databases, and UML diagrams. But the adoption of technologies like RDF, RDFS, OWL, and SHACL can elevate your data architecture strategies. Here is a short list explaining some of the most common use-cases as they would be implemented here (not exhaustive, more examples to be added in the future):

General recipe

  1. Use the FI-Platform terminology tool (https://sanastot.suomi.fi/en) to reuse existing or specify novel terminological concepts for your message specification.
  2. Use the FI-Platform modeling tool to discover and reuse, derive from or specify novel vocabulary structures for annotating your message. Existing external compatible vocabularies can be used as well. The basis for these definitions should prefereably already exist in terminological form in the terminology tool.
  3. Use the FI-Platform modeling tool to discover and reuse, derive from or specify novel application profile for the message, basing the restrictions in the message on the named resources you previously specified in the vocabulary or vocabularies.

Recipe 1: Define a Message Schema for Information Exchange

  1. Apply the general recipe up to step 3.
  2. Choose one of the use cases:
    1. If you intend to use the application profile for validating messages already in linked data format, you can use the permanent published IRI of the model as the shapes graph for an existing validator (for example the EU Interoperability Test Bed validator, accessible here: https://www.itb.ec.europa.eu/shacl/any/upload, sources here: https://github.com/ISAITB/shacl-validator)
    2. More use cases to be added...

Recipe 2: Turn Relational Data Into Linked Data or Serve Relational Data as Linked Data

  1. Apply the general recipe up to step 2.
  2. Choose one of the use cases:
    1. Use R2RML (RDB to RDF Mapping Language, https://www.w3.org/TR/r2rml/) to create a mapping from the schema to a core vocabulary, and use one or more FI-Platform core vocabularies to annotate the created data. The mapping does not have to be one to one, you are able to create mappings that resemble views as well.
    2. Use a virtualization interface for an on-the-fly mapping of data with an R2RML mapping based on a core vocabulary to serve relational data through a SPARQL endpoint.
    3. More use cases to be added...

https://graphdb.ontotext.com/documentation/10.6/virtualization.html

https://linkml.io/linkml/

For conceptual and logical models, you should create a Core Vocabulary (i.e. an OWL ontology). You can do both lightweight conceptual modeling by merely specifying classes and their interrelationships with associations, or a more complex and comprehensive model by including attributes, their hierarchies and attribute/relationship-based constraints. In either case, the same OWL ontology acts both as a formally defined conceptual and logical model, there is no need for an atrificial separation of the two. This also helps to avoid inconsistencies cropping up between the two. The primary advantage of basing these models on OWL is the ability to use inferencing engines to logically validate the internal consistency of the created models, which is not possible with traditional conceptual and logical models.

For relational-RDF mappings, please check the following resources:

Regarding database schemas, even for relational databases, conceptual and logical models are necessary to define the structure and interrelationships of data. In the case 

RDFS and OWL can also be used to document the schema of an existing relational database in RDF format. This can serve as an interoperability layer to help integrate the relational database with other systems, facilitating easier data sharing and querying across different platforms.

You should create 

  • Defining API specifications:
  • Defining message schemas:

Which one to create: a Core Vocabulary or Application profile?

Which model type should you start with? This naturally depends on your use-case. You might be defining a database schema, building a service that distributes information products adhering to a specific schema, trying to integrate two datasets... In general, all these and other use-cases start with the following workflow:

  1. In general the expectation is that your data needs to be typed, structured and annotated by metadata at least on some level. This is where you would traditionally employ a conceptual model and terminological or controlled vocabularies as the semantic basis for your modeling. You might have already done this part, but if not, the FI-Platform can assist you with ending up with a logically consistent conceptual model.
  2. Your domain's conceptual model needs to be created in a formal semantic form as a core vocabulary. The advantage compared to a traditional separate conceptual and logical model is that here they are part of the same definition and the logical soundness of the definitions can be validated. Thus, there is no risk of ending up with a logical model that would be based on a conceptually ambiguous (and potentially internally inconsistent) definition.
  3. When you've defined and validated your core vocabulary, you have a sound basis to annotate your data with. Annotating in principle consists both of typing and describing the data (i.e. adding metadata). Annotating your data allows for inferencing (deducing indirect facts from the data), logical validation (checking if the data adheres to the definitions set in the core vocabulary), harmonizing datasets, etc.
  4. Finally, if you intend to create schemas e.g. for validating data passing through an API or ensuring that a specific message payload has a valid structure and contents, you need to create an application profile. Application profiles are created based on core vocabularies, i.e. an application profile looks for data annotated with some core vocabulary concepts and then applies validation rules.

Thus, the principal difference between these two is:

  • If you want to annotate data, check its logical soundness or infer new facts from it, you need a core vocabulary. With a core vocabulary you are essentially making a specification stating "individuals that fit these criteria belong to these classes".
  • If you want to validate the data structure or do anything you'd traditionally do with a schema, you need an application profile. With an application profile you are essentially making a specification stating "graph structures matching these patterns are valid".

Core Vocabularies in a Nutshell

As mentioned, the idea of a Core Vocabulary is to describe semantically the resources (entities) you will be using to describe your data with. In other words, what typically ends up as a conceptual model documentation or diagram, is now described by a formal model.

Core vocabularies are technically speaking ontologies based on the Web Ontology Language (OWL) which is a knowledge description language. OWL makes it possible to connect different data models (and thus data annotated with different models) and make logical inferences on which resources are equivalent, which have a subset relationship, which are complements etc. As the name implies, there is a heavy emphasis on knowledge - we are not simply labeling data but describing what it means in a machine-interpretable manner.

A key feature in OWL is the capability to infer facts from the data that are not immediately apparent - especially in large and complex datasets. This makes the distinction between core and derived data more apparent than in traditional data modeling scenarios, and helps to avoid modeling practices that would increase redundancy or potential for inconsistencies in the data. Additionally, connecting two core vocabularies allows for inferencing between them (and their data) and thus harmonizing them. This allows for example revealing parts in the datasets that represent the same data despite having being named or structured differently.

The OWL language has multiple profiles for different kind of inferencing. The one currently selected for the FI-Platform (OWL 2 EL) is computationally simple, but still logically expressive enough to fulfill most modeling needs. An important reminder when doing core vocabulary modeling is to constantly ask: is the feature I am after part of a specific use case (and thus application profile) or is it essential to the definition of these concepts?

Application Profiles in a Nutshell

Application profiles fill the need to not only validate the meaning and semantic consistency of data and specifications, but to enforce a specific syntactic structure and contents for data.

Application profiles are based on the Shapes Constraint Language (SHACL), which does not deal with inferencing and should principally be considered a pattern matching validation language. A SHACL model can be used to find resources based on their type, name, relationships or other properties and check for various conditions. The SHACL model can also define the kind of validation messages that are produced for the checked patterns.

Following the key Semantic Web principles, SHACL validation is not based on whitelisting (deny all, permit some) like traditional closed schema definitions. Instead, SHACL works by validating the patterns we are interested in and ignoring everything else. Due to the nature of RDF data, this doesn't cause problems, as we can simply dump all triples from the dataset that are not part of the validated patterns. Also, it is possible to extend SHACL validation by SHACL-SPARQL or SHACL Javascript extensions to perform a vast amount of pre/postprocessing and validation of the data, though this is not currently supported by the FI-Platform nor within the scope of this document.

Core Vocabulary modeling

When modeling a core vocabulary, you are essentially creating three types of resources:

Attributes

Attributes are in principle very similar to attribute declarations in other data modeling languages. There are some differences nevertheless that you need to take into account:

  1. Attributes can be used without classes. For an attribute definition, one can specify rdfs:domain and/or rdfs:range. The domain refers to the subject in the <subject, attribute, literal value> triple, and range refers to the literal value. Basically what this means is that when such a triple is found in the data, its subject is assumed to be of the type specified by rdfs:domain, and the datatype is assumed to be of the type specified by rdfs:range.
  2. The attribute can be declared as functional, meaning that when used it will only have at most one value. As an example, one could define a functioanl attribute called age with a domain of Person. This would then indicate that each instance of Person can have at most one literal value for their age attribute. On the other hand, if the functional declaration is not used, the same attribute (e.g. nickname) can be used to point to multiple literal values.
  3. Attribute datatypes are by default XSD datatypes, which come with their own datatype hierarchy (see here).
  4. In core vocabularies it is sometimes preferable to define attribute datatype on a very general level, for example as rdfs:Literal. This allows using the same attribute in a multitude of application profiles with the same intended semantic meaning but enforcing a context-specific precise datatype in each application profile.
  5. Attributes can have hierarchies. This is an often overlooked but useful feature for inferencing. As an example, you could create a generic attribute called Identifier that represents the group of all attributes that act as identifiers. You could then create sub-attributes, for example TIN (Tax Identification Number), HeTu (the Finnish personal identity code) and so on.
  6. Attributes can have explicit equivalence declarations (i.e. an attribute in this model is declared to be equivalent to some other attribute).

Associations

Associations are similarly not drastically different compared to other languages. There are some noteworthy things to consider nevertheless:

  1. Associations can be used without classes as well. The rdfs:domain and rdfs:range options can here be used to define the source and target classes for the uses of a specific association. As an example, the association hasParent might have Person as both its domain and range, meaning that all triples using this association are assumed to describe connections between instances of Person.
  2. Associations in RDF are binary, meaning that the triple <..., association, ...> will always connect two resources with the association acting as the predicate.
  3. Associations can have hierarchies similarly to attributes.
  4. Associations have flags for determining whether they are reflexive meaning that both the subject and object of the association are assumed to be the same resource, whether they are transitive (meaning that if classes A and B as well as B and C are connected by association X, then this is equivalent to declaring that class A is connected to C by association X).
  5. Associations can have explicit equivalence declarations (i.e. an association in this model is declared to be equivalent to some other association).

Classes

Classes form the most expressive backbone of OWL. Classes can simply utilize the rdfs:subClassOf association to create hierarchies, but typically classes contain property restrictions - in the current FI-Platform case really simple ones. A class can simply state existential restrictions requiring that the members of a class must contain specific attributes and/or associations. Further cardinality restrictions are not declared here, as the chosen OWL profile does not support them, and cardinality can be explicitly defined in an application profile. In order to require specific associations or attributes to be present in an instance of a class, they must exist, as associations and attributes are never owned by a class, unlike in e.g. UML. They are individual definitions that are simply referred to by the class definition. This allows for situations where an extremely common definition (for example a date of birth or surname) can be defined only once in one model and then reused endlessly in all other models without having to be ever redefined.

Classes are not templates in the vein of programming or UML style classes but mathematical sets. You should think of a class definition as being the definition of a group: "instances with these features belong to this class". Following the Semantic Web principles, it is possible to declare a resource as an instance of a class even if it doesn't contain all the attribute and association declarations required by the class "membership". On the other hand, such a resource would never be automatically categorized as being a member of said class as it lacks the features required for the inference to make this classification.

Similarly to associations and attributes, classes have equivalence declarations. Additionally, classes can be declared as non-intersecting. It is important to understand that classes being sets doesn't by default in any way force them to be strictly separated. From the perspective of the inference reasoner, classes for inanimate objects and people could well be overlapping, unless it is explicitly declared logically inconsistent. With a well laid out class hierarchy, simply declaring a couple of superclasses as non-intersecting will automatically make all their subclasses non-intersecting as well.

Application profile modeling 

With application profiles we use strictly separate set of terms to avoid mixing up the core vocabulary structures we are validating and the validating structures themselves. The application profile entities are called restrictions:

Attribute and association restrictions

These restrictions are tied to specific attribute and association types that are used in the data being validated. Creating a restriction for a specific core vocabulary association allows it to be reused in one or more class restrictions. In the future the functionality of the FI-Platform might be extended to cover using attribute and association restrictions individually without class restrictions, but currently this is not possible.

Common restrictions for both attributes and associations are the maximal and minimal cardinalities: there is no inherent link between the cardinalities specified in SHACL and the functional switch defined in the core vocabulary attribute as they have different use-cases. It is nevertheless usually preferable keep these two consistent (a functional core attribute should not be allowed or required to have a cardinality of > 1 in an application profile). Allowed, required and default values are also common features for both restriction types.

The available features for attribute restrictions specifically are partially dependent on the datatype of the attribute. As mentioned before, it is preferable to set the exact required datatype here and have a wider datatype in the core vocabulary attribute. For string types, max and min lengths, regex validation pattern, and languages. For numerical types, min and max values are currently supported.

For association restrictions, the currently supported extra restriction is the class type requirement for the association restriction target (i.e. what type of an instance must be at the object end of the association).

Class restrictions

Similarly to core vocabulary classes, also class restrictions utilize a group of predefined attribute and association definitions. Again, this allows for example the specification of some extremely reusable association and attribute restrictions which can then be reused a multitude of times in various application profiles.

The target class definition works by default on the level of RDFS inferencing (in other words, it will validate the instances of the specified class and all its subclasses).

Class restrictions don't operate in a set-theoretical manner like core vocabulary definitions, but there is a way to implement "inheritance" in validated classes. If a class restriction utilizes another class restriction, its target classes contents are checked against both of these class restrictions.

General word of caution on modeling

SHACL is a very flexible language and due to this nature it allows the creation of validation patterns that might seem legit but are actually unsatisfiable by any instance data. As an example, the utilization of other class restrictions might lead to a situation where an attribute can never be validated as it is required to conform to two conflicting datatypes at the same time.

Also, whereas the OWL specification is monolithic and cannot be extended or modified, SHACL can be extended with a multitude of ways, for example by embedding SPARQL queries or Javascript processing in SHACL constraints. The standard allows for this, but naturally it is dependent on the used SHACL validator, which extensions are supported. The FI-Platform in its current form adheres to the core specification (vanilla) SHACL.

A final note regarding SHACL validation is that its results are also dependent on whether inferencing is executed on the data prior to validation or not. The SHACL validator by default does not know or care about OWL inferencing, and works strictly based on the triples it sees declared in the data. It is recommended that inferencing is run before validation to ensure there are no implicit facts that the SHACL validator could miss. Also, it is important to remember that the core vocabulary declarations for the instance data must be included in the data graph to be validated. The SHACL validator will not resolve anything outside the data graph, and will match patterns only based on what it sees in the data graph.

  • No labels