View Source

This page is is a brief introduction to the modelling paradigm used on the FI-Platform, as well as recommended modelling principles. Understanding the paradigm is crucial for data modelers as well as actors intending to use the models and data based on the models, whereas the recommended modelling principles give additional guidance on how to do modelling aligned to the paradigm and avoid common pitfalls.

The FI-Platform modelling tool is able to produce core vocabularies and application profiles. While they serve different purposes, they share a common foundation. We will first go through this foundation and then dive into the specifics of these two model types and their use cases. When discussing the model paradigm, we sometimes contrast it with traditional technologies such as XML Schema and UML, as those are more familiar to many modelers and data architects.

In this document, data and information are used in the colloquial sense as interchangeable. Knowledge in the context of this document refers to data or information that has been semantically enriched with a core vocabulary.

In this guide we have highlighted the essential key facts with (note) symbols.

The Linked Data Modelling Paradigm

The models published on the FI-Platform are in essence Linked Data and thus naturally compatible with the linked data ecosystem. The FI-Platform nevertheless does not expect modelers nor users of the models to migrate their information systems into Linked Data based ones, or to make a complete overhaul of their information management processes. There are multiple pathways for utilising the models, ranging from lightweight semantic layering to fully integrated solutions. In most cases you can think of Linked Data solutions as a semantic layer covering your information architecture, or as a translational layer between your domain and external actors or between individual information systems.

Linked Data is a method of publishing structured data so that it can be interlinked and become more useful through semantic queries. It extends the Web to use URIs to name not just documents but also real-world objects and abstract concepts (in general, anything). The key issues Linked Data aims to solve are:

Data Silos: Information is typically stored in isolated databases with limited ability to interoperate. Linked Data was designed to break down these barriers by linking data across different sources, allowing for more comprehensive insights and analyses.
Semantic Disconnect: There is a lack of a common framework that could universally describe data relationships and semantics, and a disconnect between model types and associated specifications (conceptual, logical, physical, documentation, code lists, vocabularies). Linked Data describes data with the RDF (Resource Description Framework) language, which is able to encode meaning alongside data, enhancing its machine-readability and semantic interoperability.
Integration Complexity: Integrating data from various sources is typically complicated and costly due to heterogeneous data formats and access mechanisms, requiring peer-to-peer ETL solutions, API integrations, integration platforms, MDM or commonly agreed message schemas. Linked Data promotes a standard way to access data (HTTP), a common data format (RDF), and a standardized query language (SPARQL), simplifying data integration.
Reusability and Extension: The reusability and combining of data from various sources for new needs is typically limited or burdensome. Linked Data encourages reusing existing data by making the finding and combining data as well as inferring new data straightforward.

A good set of principles applicable to Linked Data in general are the FAIR principles (see more here: https://www.go-fair.org/fair-principles/).

In essence, Linked Data offers an alternative solution for the typical data management issues and proposed solutions that rely on tailored and siloed data warehouses and catalogues or other highly platform, product or service provider dependent solutions such as ipAAS.

The Core Idea of Linked Data

As you can imagine based on the name, Linked Data is all about references (links or "pointers") between entities (pieces of data). In principle:

All entities (resources in the Linked Data jargon), be they anything - actual instance data or structures in a data model, physical or abstract entities - are named with IRIs (Internationalised Resource Identifier).

The names should (in most cases) be HTTP based IRIs, as this allows a standardized way to resolve the names (i.e. access the resource content named by the IRI).

When a client resolves the name, relevant and appropriate information about the resource should be provided.

The resources should refer (be linked) to other resources when it aids in discoverability, contextualising, validating, or otherwise improving the useability of the data.

What all the resources in a linked data dataset or a model represent, depends of course entirely on the domain and use case in question. In general though, there is a deep philosophical distinction between how Linked Data and traditional data modelling approach the question of what is being modeled. Traditionally data modelling is done with a strong emphasis on modelling records, which has long roots in the way data modelling has been subservient to the technical limitations of database systems and message schemas. In Linked Data the aim is not typically to manage records of the entities in question, but to name and describe the entities directly. As an example, authorities typically create records of people for specific processes concerning e.g. citizenship, taxation, healthcare etc. These records are then attached to each citizen by one or more identifiers (such as a Personal Identity Code), making it possible to handle the records of a specific individual in a particular process. In Linked Data, the entities in question would be named and described semantically, leading not to a fragmented collection of separate records but a linked and multifaceted network of these entities and information describing them and their interrelations. The distinction here is subtle and quite abstract but very tangible in real world use cases.

How Things Are Named

Yhteentoimivuusalustan julkinen dokumentaatio > Modelling Paradigm & Guidance > image-2024-5-3_13-5-43.png

As mentioned, all resources are named (minted in the jargon) with identifiers, which we can then use to refer to them when needed.

The FI-Platform gives every resource an HTTP IRI name in the form of <https://iri.suomi.fi/model/modelName/versionNumber/localName>

On the Web You will typically come across mentions of URIs (Uniform Resource Identifier) and HTTP URIs more often than IRIs or HTTP IRIs. On the FI-Platform we encourage internationalisation and usage of Unicode, and mint all resources with IRIs instead of URIs which are restricted to just a subset of the ASCII characters.

IRIs are an extension of URIs, i.e. each URI is already a valid IRI.

IRIs can be mapped to URIs (with percent encoding the path and punycoding the domain part) for backwards compatibility, but this is not recommended due to potential collisions. Instead, tools supporting IRIs and Unicode should always be preferred. You should generally never mint identifiers which are ambiguous (i.e. name a thing with an IRI and something else with an URI encoded version). The IRI will get encoded anyway when you're doing a HTTP request, so you should always consider the set of URIs that correspond to URI encoded versions of IRI as reserved and not use them for naming resources.

We do not call them HTTPS IRIs despite the scheme on the FI-Platform being https, as the established naming convention does not dictate the security level of the connection. In general you should always mint the HTTP based identifiers with the https scheme and offer the resources through a secured connection, so do not mistake the HTTP IRI or HTTP URI name to mean that the unsecure http protocol should be used.

In the diagram above you can see the hierarchy of different identifier types. The identifier types based on the HTTP scheme are highlighted as they are primary in Linked Data, but the other scheme types (deprecated, current and potential upcoming ones) are of course possible (list of IANA schemes here: https://www.iana.org/assignments/uri-schemes/uri-schemes.xhtml). Note that the diagram is conceptually incorrect in the sense that conceptually IRIs and URIs (including their HTTP subsets) are broader than URLs which are merely Uniform Resource Locators. In other words, URLs tell where something on the Web is located for fetching, whereas URIs and IRIs give them identifiers (name them), and with a proper scheme version (like HTTP IRI) make the name resolvable. But if we simply look at the structure of these identifiers from a pragmatic perspective, the diagram is correct.

URNs are a special case, they are defined per RFC as a subset of URIs. They start with the scheme urn: with a local name or a potential sub-namespace. Well-known sub-namespace URNs are ISBNs, EANs, DOIs and UUIDs. An example of an ISBN URN is urn:isbn:0-123-456-789-123. URNs are not global and they always need a specific resolver. There is no central authority for querying ISBN or ISSN numbers, thus the URN must be resolved against a particular service, and in the case of UUIDs they mean nothing outside their specific context and thus mean they are tightly coupled with the information system(s) they inhabit.

You are probably already familiar with UML or XML namespaces. HTTP IRIs work quite similarly to XML namespaces:

Each IRI consists of a namespace or a namespace and a local name. The idea of a namespace is to avoid collisions in naming resources, not to restrict access or visibility like in UML. Because the path part of HTTP IRIs is hierarchical, namespaces can be nested with the / (slash) separator.

Namespaces can only be reliably distinguished from local names by specifying them (very much in the same way as they are specified in XML). The namespace is always the beginning part of the IRI up to the last # (hash) or / (slash) delimiter. This is not strictly required but it is strongly encouraged, as it makes it possible to unambiguously deduce the innermost namespace for the resource. Explicitly naming namespaces grants us the possibility to use XML-style CURIEs (Compact URI Expression) which look like namespaceName:localName and give a more human-readable representation of the IRIs when necessary.

When you mint a new HTTP IRI for something, you can give it any name that adheres to RFC 3987 (https://datatracker.ietf.org/doc/rfc3987/), but as parts of the HTTP namespace are controlled (owned) by different parties, it is not a good principle to name things in namespaces owned by someone else unless specifically given permission. Your models and your data should reside under a domain and a path under your control - either directly or by delegation like on the FI-Platform (each model having its own namespace under the suomi.fi domain).

For details on namespace and local name naming conventions in general, please refer to the W3C recommendation at https://www.w3.org/TR/xml-names11/.

Life-Cycle and Versioning

When resources are minted and information is served by resolving them, we generally want that information to reflect the resource that the HTTP IRI identifies. As mentioned above, there is a conceptual difference between an URL and an HTTP IRI. URL just points somewhere with the path usually giving a hint of what kind of content to expect, but the identity of the content is often quite loosely coupled with the form of the URL. An HTTP IRI on the other hand is intended to identify something - thus an IRI for a specific resource should resolve to data about said resource and not something completely separate, even in situations where the life-cycle of the resource has ended. This question does not only apply to instance data but also to models. Good principles for naming resources are:

Minting an IRI for a resource should typically indicate that the IRI is never to be used for other purposes. Since an identifier for a resource has been created, it might be used years later when the resource itself has ceased to exist. The IRI should nevertheless not be freed for naming other entities.

If the resource in question is modified during its life-cycle within the bounds of its identity, this should be reflected by other data describing the change and pointing to this permanent identifier.

If the resource in question changes and these changes are represented as snapshots of its state during its life-cycle, these snapshots can be minted with their own IRIs (for example with datetime stamp or semantic version number in the path distinguishing them from each other), as they have distinct identities (each snapshot is justifiably its own resource). Additional data is then used to represent the link between the snapshots and the permanent identifier of the resource.

How Linked Data is Structured

It is crucial to understand that by nature, linked data is atomic and the model or dataset is always in the form of a graph. The lingua franca of linked data is RDF (Resource Description Framework), which allows for a very intuitive and natural way of representing information. In RDF everything is expressed as triples (3-tuples): statements consisting of three resources with one of them potentially having an attribute value (these are literals in RDF jargon). You can think of triples as simply rows of data in a three column data structure: the first column represents the subject resource (from whose point of view the statement is made), the second column represents the context resource of what is being stated by the subject, and the third column represents the object resource or a literal value of the statement. Simplified to the extreme, "Finland is a country" is a statement in this form:

Yhteentoimivuusalustan julkinen dokumentaatio > Modelling Paradigm & Guidance > image-2024-4-24_10-16-42.png

If we expand the example above to also include statements for example about the number of lakes and Finland's capital, we could end up with the following dataset:

Yhteentoimivuusalustan julkinen dokumentaatio > Modelling Paradigm & Guidance > image-2024-4-24_10-17-1.png

As mentioned above, everything is named with IRIs, so a more realistic example would actually look like this:

Yhteentoimivuusalustan julkinen dokumentaatio > Modelling Paradigm & Guidance > image-2024-4-24_15-40-49.png

The triples in this dataset can be serialized very simply as three triples each consisting of three resources, or represented as a three column tabular structure:

subject resource	predicate resource	object resource or literal
`<https://finland.fi/>`	`<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>`	`<https://datamodel/Country>`
`<https://finland.fi/>`	`<https://datamodel/hasNumberOfLakes>`	`"187888"^^xsd:integer`
`<https://finland.fi/>`	`<https://datamodel/hasCapital>`	`<https://finland.fi/Helsinki>`

As is evident from the table above, the resources do not all inhabit the same namespaces - they don't even have to be controlled by the same actor(s).

The core features of Linked Data can be condensed into the following statements:

All data is atomic: the resources in the graph do not possess inner structures or data. On the other hand, resolving the resources might return more RDF triples or some other content (JSON message, image, a web page, anything) - so in this sense the resources might not be "empty".

All data is structural: the formal semantic meaning of resources is only defined in relation to other resources. Is not necessary for each resource to be resolvable in every case, as the meaning of the resource might already be specified by triples, in which case its IRI acts as a permanent identifier for it.

All relationships between resources are binary: there are no n-ary relationships. A parent having two children means there must be two triples in the dataset with the same predicate, there is no way to link the children to the parent with just one triple, as the children in the object position cannot be expressed as a bag or list.

All resources are first-class citizens: You should not mistake the resources in the predicate position being any different from the others - one is free to make statements about them as well, and this is precisely the way many of them are typed. As an example, the graph above could be extended with a triple stating some metadata about the association hasCapital, and a general statement about the class Country:
Yhteentoimivuusalustan julkinen dokumentaatio > Modelling Paradigm & Guidance > image-2024-5-4_10-22-35.png

Data and models can coexist: in the example above you can see that there are instances (<https://finland.fi/>) and classes (<https://datamodel/Country>) together in the same dataset. This is due to the fact that both models and data are described with the same RDF language. The ability to do this is one of the big advantages of Linked Data.

All data is extensible: It is always possible to expand graphs by adding triples pointing to new resources or to state facts about the external world. It is for example possible to combine multiple datasets covering differing facets or even redundant or conflicting of the same resource with ease by simply concatenating them. As an example, multiple differing measurements of the same dimension of a specific object can be combined and then analysed or validated. The extensibility is due to Linked Data being fundamentally based on the open world assumption: data that is not available cannot be deduced as being invalid or false - we simply do not have it at hand (yet). Validating data for a particular use case (for example requiring a person having a date of birth for some certificate) is separate from the use case of combining different data of a person together. The fact that a combined dataset of a person is lacking his or hers date of birth is not an indication that the dataset is invalid or that it does not describe a person, but the state of affairs at a certain point in time. The dataset can always be supplemented with new triples that build a more comprehensive description of the person in question. The combined dataset can then be validated for a variety of different use cases with each validation schema touching upon a specific facet of the entire dataset.

All data is highly perspective-dependent: a graph can utilise any resources in any positions in a triple, and it will only change the interpretation of the resources within this graph or any other graph utilising this graph. In other words, a Linked Data graph can make statements about entities without affecting the entities. As an example, <http://www.w3.org/2002/07/owl#Class> is an ubiquitous resource for a class definition. It could be expanded with new triples in this graph without affecting any other parties using the same resource - unless this graph was employed by said parties.

How Linked Data is Serialised

The FI-Platform supports three serialisation formats for Linked Data: Turtle, RDF/XML and JSON-LD. These three were chosen as the initial primary formats for the following reasons:

Turtle is an extremely easy to read format where the triples are expressed very explicitly. It is not very compact though and does not support multiple graphs in the same document, but as an introductory format to reading and understanding Linked Data it is extremely useful and widely supported.
RDF/XML is the primary serialisation format for RDF supported by practically any tool and leveraging the XML ecosystem tools. It should primarily be seen as a legacy format that is easy to parse and manage with the vast spectrum of XML tools. RDF/XML does not support multiple graphs in one document.
JSON-LD (JSON for Linking Data, see https://json-ld.org/) is a novel and extremely versatile format that allows multiple graphs and is easily employed by developers already familiar with JSON. JSON-LD is rapidly growing in popularity and is recommended as the primary format for Linked Data exchange.

Lately, the list has been expanded with OpenAPI serialisation. If necessary, other serialisation formats can be added, too.

Graphs and Document

As mentioned above, JSON-LD supports multiple graphs in one document. What does this mean? A graph in Linked Data is a "bucket" for a set of triples. It is used for grouping a specific set of triples together for various purposes ranging from logical and semantic separation of data to targeted querying. Each graph (naturally) has its own IRI, making it possible to annotate it and thus describe the graphs contents and their provenance as a single entity. You should not mistake graphs as having anything to do with namespaces except for the fact that the contents of a graph are separated from the contents of other graphs.

On the FI-Platform graphs are used as follows:

When creating a new model (core vocabulary or application profile), you will mint a permanent static identifier for the model. This identifier acts as the resource describing the identity of your model. All the resources you create in your model will also be minted with permanent static identifiers. These resources will reside in a graph acting as a "sandbox" or "staging area" where you are free to modify the contents between publications. The identifiers for these resources are not resolvable to anything, their content is specified explicitly by the model contents (the triples in the graph).

When you publish a new version of your model, the contents will be copied from the sandbox to a new graph having a versioned IRI identifier. This graph will now act as a frozen snapshot of your model from a specific point in time. The versioned graph IRI will be resolvable and will always resolve to this particular snapshot.

When the model contents of a specific version are resolved, the model will contain triples with the permanent model IRI containing the current versioned IRI identifier, a pointer to the previous version IRI and a statement about backwards compatibility between these versions (among a lot of other metadata).

When you fetch a serialisation of the model via the versioned IRI, you will get one document containing the graph contents. You should nevertheless generally avoid interpreting a graph as a synonym for a model document; as mentioned, one document can contain multiple graphs, but it could also contain just a fragment of one or more graphs - basically any triples depending on what is needed.

Semantic And Syntactic Validity

When combining data from various sources, an obvious question arises: how can the data be concatenated together while maintaining syntactic validity? Typically data is structured according to purpose-fit schemas (for example XML Schema or JSON Schema) which follow the closed world assumption, i.e. what is not defined in the schema is assumed to not exist or be invalid. Thus, adding data from two or more schemas together typically results in a dataset which cannot be validated with any of the schemas. In many agile data management cases data in a data lake or a NoSQL store might be syntactically heterogeneous and incompatible on the level of schema languages themselves, making it completely impossible to directly combine them (there are in many cases no universal one-to-one mappings between the languages, meaning information is lost or its semantics will be altered). In both cases, data integration is required to reshape and reinterpret them into a coherent single dataset. This is often one of the most error prone and labour intensive tasks in data management.

The way Linked Data differs is that data pulled from various sources is always RDF regardless of the quality of the data. A collection of plain resources, a sophisticated knowledge model, schema definition and an relational to RDF mapping specification can all be simply concatenated into a single graph containing their triples. Thus the problem of syntactic validity between data sources does not really exist in the Linked Data domain. But consequently, another question arises: what about the semantic validity of the data (its meaning or logical soundness)? This is where a big contrast is drawn between the RDF based knowledge description languages and many schema languages:

Schema languages tie the semantic meaning of the data to the syntactic structure and require human-readable documentation for interpreting the meaning. There is for example no natural or "neutral and normalised" way in XML to model a person, apartment and the ownership relation (person owning the apartment), as this relationship between the two entities can be structured with a multitude of different tree structures.

RDF based knowledge description languages on the other hand guide towards a specific conceptualisation of the world due to them being based on formal logics (set theory). The structure can of course be formulated as one wishes, but the extent to which syntax affects semantics is relatively small. Referring to the previous example, the person and apartment would quite naturally be modelled as resources with the ownership being the resource connecting them. Depending on the fidelity of the model, the relationship could be unqualified (direct) or qualified (having additional information describing the qualities of the ownership relationship), etc.

The fundamental difference between schema languages and RDF based knowledge description languages is that with the latter:

The data itself (the resources and the triples) do not inherently mean anything. As mentioned earlier, everything is determined in a constructivist manner: things are defined by their relationships to other things. So, validating that an RDF dataset adheres to the very few schematic rules that RDF requires does not yet tell us anything about its validity in any domain or for any purpose unlike with schema languages where the context is expressed with syntax.

Building on the previous note: the resources in Linked Data models only receive meaning when they are structured in vocabularies where each resource is given a (typically) formally specified meaning in relation to other resources. The predominant formal way of giving meaning to resources is to use set theory (description logic to be more precise) constructs. These constructs are built into the vocabularies meant for knowledge representation (such as OWL).

These vocabularies that give structured meaning to data are generally referred to as ontologies.

In addition to these formal vocabularies, there exists a number of popular vocabularies using weak semantics. This means that the meaning of the constructs is defined loosely, requiring human interpretation and allowing for a degree of flexibility or ambiguity that might better suite the interoperability of knowledge between parties by allowing a fuzzy or loosely defined flexible interpretation of sameness for definitions.

Literals do not have any semantic meaning, except in few cases where they are used to indicate e.g. the cardinality or other logical features of the structures.

Building on the previous two notes: the resources in Linked Data datasets only receive meaning when they are connected to resources defined in the models mentioned above.

In summary, Linked Data becomes knowledge only when it is described with a vocabulary suited for knowledge representation.

As an example, the resources rdf:type (specifying instances) and rdfs:subClassOf (specifying classes) have formal meaning encoded in the Linked Data tools, meaning data or models containing triples with these two resources as the predicates will then have formal meaning. Other, more sophisticated resource definitions exist in vocabularies such as OWL, whose semantic meaning is activated when the data or models are processed with tools capable of doing logical reasoning (inference) based on these resources. More complex and sophisticated vocabularies can be built on top of these vocabularies. These special resources grant the ability for tools to check the logical soundness of what has been said in the data or models. This inference (calculating the logical consequences of the triples = producing new explicit facts from inherent indirect facts already existing in the data) takes based on an open world assumption. This is introduced in more detail in the next section that deals very superficially with inferencing related to the OWL profile used on the FI-Platform.

There is a multitude of OWL based vocabularies for specific cases, ranging from very specific domain tailored ontologies to so called upper or top ontologies intended for completely content agnostic description of pretty much anything and primarily serving as a foundation for more specific ontologies. Some popular examples include:

Descriptive ontology for linguistic and cognitive engineering (DOLCE, see https://www.iso.org/standard/78927.html)
Basic Formal Ontology (BFO, see https://www.iso.org/standard/74572.html and https://basic-formal-ontology.org/)

While OWL is the predominant de facto language for formal knowledge representation in the Linked Data ecosystem, there are popular vocabularies with weak semantics that serve very important use cases. Some popular examples include:

SKOS (Simple Knowledge Organization System, see https://www.w3.org/TR/skos-reference/) is a vocabulary for terminological vocabularies, thesauri, classification schemes, subject heading systems, and taxonomies. It is used in the FI-Platform terminology tool.
Dublin Core Terms (https://www.dublincore.org/specifications/dublin-core/dcmi-terms/) is an extremely widely used general metadata description vocabulary
Schema.org (https://schema.org/) is a widely used general vocabulary for categorising content on the Web

An exception to these knowledge description resources are SHACL resources that have a special meaning for SHACL validators. They are used to validate that a certain RDF graph contains specific sub-graphs. This validation is also open world based, meaning that the validation only concerns those sub-graphs (triples) that are defined, others are invisible to the validator. The reason for this is the open world expansive nature of Linked Data: you should be able to validate a certain facet of a dataset without having to prune everything else from the data being validated.

The FI-Platform Approach to Linked Data modelling

At this point we can move onto discussing the two primary model types on the FI-Platform. You should be aware that the implementation of models on the FI-Platform is not universal, in other words it does not currently serve all the possible use cases the user base might have. Nevertheless, the platform can be expanded and refined gradually according to the needs of its user base.

The primary aim of the tool is however fixed: it is meant for producing annotations which are used to semantically label data in the form of core vocabularies, and restrictions for validating labeled data in the form of application profiles.

When modelling a core vocabulary, you are essentially creating three types of resources: attributes, associations and classes - each with the structures provided by OWL 2 EL. With application profiles you are creating class, attribute or association restrictions provided by SHACL. For modelers with no previous background in OWL or SHACL modelling this might initially seem intuitive and straightforward (one might assume they are able to create models that look identical and are semantically equivalent to e.g. UML Class diagrams), but this is not true. It is important that you approach using the FI-Platform with the correct assumptions on what the platform is intended for. The FI-Platform is not meant for example for DDL-style database modelling or UML class diagram modelling. Instead, the approach with the FI-Platform is all about interoperability in the context of the European Interoperability Framework. Thus, the FI-Platform will be the correct tool for example:

when you need to add a knowledge layer (additional semantics, for example metadata) to your information architecture

when you need to manage or share knowledge (data enriched with embedded semantics)

when you need to harmonize data

when you need to create or maintain a knowledge base

when you need to increase the interoperability of data in your organization or between actors

when you want to manage data or knowledge based on open standards

Core Vocabularies (Ontologies)

The core vocabularies provide a structured framework for representing knowledge as a set of concepts within a domain and the relationships between those concepts. They are used extensively to formalize a domain's knowledge in a way that can be processed by computers. You will need core vocabularies in the following use cases (not an exhaustive nor ordered list):

when you need to annotate data, interfaces or models semantically

when you need to produce semantically interoperable specifications

when you need to formulate a conceptual model that you can validate for logical soundness of your definitions

when you need to formulate a conceptual model acting simultaneously as a logical (formal) model

when you need to create an application profile (application profiles must refer core vocabularies)

when you need to produce metadata definitions

when you need to produce models that are language independent while still containing annotations for the structures in multiple languages

when you need to produce taxonomies or other structures for categorising data

when you need to do machine-based reasoning (inference) on the annotated data

Core vocabularies on the FI-Platform are based on the following RDF-based formal knowledge representation languages:

RDFS (RDF Schema) is a basic ontology language providing basic elements for the description of ontologies. It introduces concepts such as classes and properties, enabling rudimentary hierarchical classifications and relationships.

OWL (Web Ontology Language) offers substantially more advanced features compared to RDFS, and is capable of representing rich and complex knowledge about things, groups of things, and relations between things. OWL is highly expressive and designed for applications that need to process the content of information instead of just presenting information. The OWL profile currently used on the FI-Platform is OWL 2 EL. Other profiles (dialects) such as RL, QL or DL might be supported in the future.

This document is not intended to serve as a tutorial to OWL modelling nor as an explanatory documentation on the various profiles. For a more comprehensive understanding of OWL (as well as for learning more about Linked Data and Semantic Web) we recommend the following resources:

OpenHPI courses, for example Semantic Web Technologies (https://www.youtube.com/watch?v=Q5DrZV5wWzo&list=PLoOmvuyo5UAeihlKcWpzVzB51rr014TwD), highly recommended
Uschold, M. (2018). Demystifying OWL for the Enterprise (https://www.morganclaypool.com/doi/10.2200/S00824ED1V01Y201801WBE017)
Allemang, D., & Hendler, J. A. (2011). Semantic Web for the working ontologist: Effective modelling in RDFS and OWL (2nd ed). Morgan Kaufmann/Elsevier. (https://dl.acm.org/doi/book/10.1145/3382097)
Colomb, R. M. (2007). Ontology and the Semantic Web. IOS Press. (https://www.iospress.com/catalog/books/ontology-and-the-semantic-web)

Application Profiles (Schemas)

Application profiles on the other hand are specifications that enforce restrictions on data annotated with core vocabularies. You will need application profiles in the following use cases (not an exhaustive list):

when you need to specify a message schema for exchanging information

when you need to validate data

when you need to produce an API specification

Application profiles on the FI-Platform are based on the following RDF-based language:

SHACL (Shapes Constraint Language) is used to validate RDF data against specific defined "shapes" (conditions). The language is very expressive and difficult to generally map to simpler or more restricted schema languages such as JSON Schema and XML Schema. Limited subset of SHACL can be mapped though.

Modelling Core Vocabularies

Attributes

Attributes are in principle very similar to attribute declarations in other data modelling languages. The key differences from other major languages are:

As mentioned earlier, attributes are first-class citizens as they have identities. An attribute is defined once and can then be reused in multiple class definitions. A class does not "own" attributes used by it and they do not exist only within the namespace of the class nor is their life-cycle tied to the class.

Attributes can be used without being used "inside" a class (unlike in e.g. UML). For an attribute definition, one can specify the class (rdfs:domain) and/or data type (rdfs:range). The meaning of the data type is self-explanatory, but the class should not be confused as the attribute belonging to the chosen class. These definitions are related to inferencing only and will have no enforced effect if it is not used. Let us consider an attribute B which has class X set as the domain value and integer set as the range value. When a triple <A, B, C> is found, the A is inferred as being of type X and C being an integer literal. The domain and range values do not act directly as any kind of constraints. Without using inferencing their role is in guiding the correct use of the attribute resource.

Attributes can be declared as functional, meaning in essence that only one value is allowed for a resource using an attribute. As an example, a resource describing an individual might use the attribute age multiple times with different values (i.e. data might contain two triples: <somePerson, age, "50"^^xsd:integer> and <somePerson, age, "40"^^xsd:integer>). If age is declared as being functional, enabling inferencing will result in a contradiction, thus making the data invalid. Importantly, the functional declaration acts as a constraint on how the attribute can be utilised across the entire model.

Attributes can be declared as being equivalent to one another. This declaration means that if a property is assigned a value for an individual, any equivalent properties are inferred to have the same value for that individual as well. For example, if birthYear and yearOfBirth are equivalent properties, and it is known that an individual has a birthYear of 1990, then it can automatically be inferred that the individual's yearOfBirth is also 1990. This equivalence helps in maintaining consistency and ensuring semantic alignment between properties that are intended to represent the same concept or data point. Additionally, whatever is being said of either attribute will be applied to the individuals of the other attribute. As an example, if yearOfBirth is declared as functional, it will also mean that this restriction is imposed also on each individual using birthYear.

Attributes can have hierarchies. This is an often overlooked but useful feature for inferencing. As an example, you could create a generic attribute called Identifier that represents the group of all attributes that act as identifiers. You could then create sub-attributes, for example TIN (Tax Identification Number), HeTu (the Finnish personal identity code) and so on. Additionally, these hierarchies are not restricted to trees like in UML - polyhierarchies are allowed meaning that an attribute can have more than one super-attribute.

Some special guidance about using attributes in the FI-Platform core vocabularies:

Attribute datatypes are by default XSD datatypes, which come with their own datatype hierarchy (https://www.w3.org/TR/xmlschema11-2/#built-in-datatypes). In core vocabularies it is usually preferable to define attribute datatype on a very general level, for example as rdfs:Literal. This allows using the same attribute in a multitude of application profiles with the same intended semantic meaning but enforcing a context-specific precise datatype in each application profile.

If you have an entity that clearly could have an identity and you want to reference it as part of your model, it is not recommended to do this with an attribute. This is called the "things not strings" principle. As an example, if you model references a city as a structured part of an address, it is not recommended to use an attribute with the city name. This approach would lose the structural semantics as well as introduce difficulties related to localisation. Using traditional code lists gets rid of the localisation issue but not the lack of embedded semantics.

Associations

The key differences from other major languages are:

Associations are similarly first-class citizens and can and should primarily be reused. Similarly to attributes, associations or their life-cycles are also not tied to any classes.

Associations can also be used without being tied to a class and have domain and range definitions, but instead of the range expressing a datatype, it now expresses a class. This enables an association to label instances as being of certain type based on the usage of the association in data. The domain and range values do not act directly as any kind of constraints. Without using inferencing their role is in guiding the correct use of the attribute resource.

Associations can be declared as transitive, meaning that if resource A and B as well as B and C are connected with a transitive association X, then (after inferencing) also A is connected to C with the same association. This inferential behavior is a key feature of transitive properties, helping to extend direct relationships across linked chains of resources. While transitivity helps in deriving new connections, it is not a constraint that limits data entry but rather a rule for expanding understanding and relationships within the data.

Associations can be declared as reflexive, meaning that whenever the association is used in a triple, the subject of the triple is inferred to being also connected to itself via this association. In essence, if A is connected to B via a reflexive association X, inference will result in a new triple <A, X, A>. Reflexivity is an important feature in specific modelling contexts.

Associations can be declared as being equivalent to one another, with the similar logical consequences as with equivalent attributes.

Associations can have hierarchies. This is an often overlooked but useful feature for inferencing. As an example, you could create a generic association called hasAncestor that represents the group of all associations that connect people to their ancestors. You could then create sub-associations, for example hasFather and so on that depict a narrower definition of the ancestral relationship. Additionally, these hierarchies are not restricted to trees like in UML - polyhierarchies are allowed meaning that an association can have more than one super-association.

Some special guidance about using attributes in the FI-Platform core vocabularies:

If you want to model qualified associations in the vein of Labeled Property Graphs or UML association classes, the primary recommended way is not to model this directly in the core vocabulary itself, but to create qualified associations (instances of the vocabulary association) in the data (with the desired associations and attributes) and use an application profile for ensuring that the instances will all have the required and/or desired qualifications. It is possible to model this also on the level of a core vocabulary by representing the qualified association with a class, but it will lead to different results in inferencing, as this will necessarily mean that the classes are not connected via the qualified association directly, but via two associations and a class representing the qualified association in between.

Classes

Classes form the most important and expressive backbone of OWL. The key differences from other major languages are:

Classes are not templates or blueprints in the vein of OOP, DDL or other common constructs but sets. Thus, when you define a class in the core vocabulary, you are defining the necessary and/or sufficient conditions for something to be labeled as belonging to the class (i.e. being of the "type" of the class). As an example:

A sufficient condition that you might model into the class Human is having a Personal Identity Code (PIN), because it is only given to a subset of humans. Thus it is enough for an instance to have a PIN for it to be labeled as being human.

A necessary condition that you might model into the class MarriedPerson is that they by definition must have a spouse. As you remember from the open world nature of Linked Data - an instance of a married person might not have information of his or her spouse because the information might not be available, but the mere declaration of the instance as a MarriedPerson has implications on what we can expect of the nature of this or any other such instance.

While classes can simply utilise the rdfs:subClassOf association to create hierarchies with subclassing, there is a major thing to note: being sets, the classes can overlap and there is no implicit expectation in inferencing that they would not overlap. So, unless you explicitly state that two classes are not intersecting, they might overlap, meaning that there might be instances which belong to both classes. This is again due to the fact that we must be able to combine multiple facets of information easily together, meaning that for example an instance of Human might simultaneously belong to the classes MarriedPerson, Employee and Patient.

The attributes and associations added into a class (more precisely: used by a class) are used to specify the bounds of a class: what kind of features are to be expected of instances of a class. Again, specifying a class this way is not about constraints in the traditional sense but for machine inferencing - for validation you must use application profiles.

Similarly to associations and attributes, classes have equivalence declarations, which is most typically used when combining or harmonizing datasets and aligning and simplifying models (as an example, one information system in an organization might use Employee and another StaffMember to describe the same individuals, there could be declared identical).

Modelling Application Profiles

With application profiles we use strictly separate set of terms to avoid mixing up the core vocabulary structures we are validating and the validating structures themselves. The application profile entities are called restrictions:

Attribute And Association Restrictions

These restrictions are tied to specific attribute and association types that are used in the data being validated. Creating a restriction for a specific core vocabulary association allows it to be reused in one or more class restrictions. In the future the functionality of the FI-Platform might be extended to cover using attribute and association restrictions individually without class restrictions, but currently this is not possible.

Common restrictions for both attributes and associations are the maximal and minimal cardinalities: there is no inherent link between the cardinalities specified in SHACL and the functional switch defined in the core vocabulary attribute as they have different use-cases. It is nevertheless usually preferable keep these two consistent (a functional core attribute should not be allowed or required to have a cardinality of > 1 in an application profile). Allowed, required and default values are also common features for both restriction types.

The available features for attribute restrictions specifically are partially dependent on the datatype of the attribute. As mentioned before, it is preferable to set the exact required datatype here and have a wider datatype in the core vocabulary attribute. For string types, max and min lengths, regex validation pattern, and languages. For numerical types, min and max values are currently supported.

For association restrictions, the currently supported extra restriction is the class type requirement for the association restriction target (i.e. what type of an instance must be at the object end of the association).

Class restrictions

Similarly to core vocabulary classes, also class restrictions utilize a group of predefined attribute and association definitions. Again, this allows for example the specification of some extremely reusable association and attribute restrictions which can then be reused a multitude of times in various application profiles.

The target class definition works by default on the level of RDFS inferencing (in other words, it will validate the instances of the specified class and all its subclasses).

Class restrictions don't operate in a set-theoretical manner like core vocabulary definitions, but there is a way to implement "inheritance" in validated classes. If a class restriction utilizes another class restriction, its target classes contents are checked against both of these class restrictions.

General word of caution on modelling

SHACL is a very flexible language and due to this nature it allows the creation of validation patterns that might seem valid but are actually unsatisfiable by any instance data. As an example, the utilization of other class restrictions might lead to a situation where an attribute can never be validated as it is required to conform to two conflicting datatypes at the same time.

Also, whereas the OWL specification is monolithic and cannot be extended or modified, SHACL can be extended with a multitude of ways, for example by embedding SPARQL queries or Javascript processing in SHACL constraints. The standard allows for this, but naturally it is dependent on the used SHACL validator, which extensions are supported. The FI-Platform in its current form adheres to the core specification (vanilla) SHACL.

A final note regarding SHACL validation is that its results are also dependent on whether inferencing is executed on the data prior to validation or not. The SHACL validator by default does not know or care about OWL inferencing, and works strictly based on the triples it sees declared in the data. It is recommended that inferencing is run before validation to ensure there are no implicit facts that the SHACL validator could miss. Also, it is important to remember that the core vocabulary declarations for the instance data must be included in the data graph to be validated. The SHACL validator will not resolve anything outside the data graph, and will match patterns only based on what it sees in the data graph.