You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 24 Next »

This page is is a brief introduction to the modeling paradigm used on the FI-Platform, as well as recommended modeling principles. Understanding the paradigm is crucial for data modelers as well as actors intending to use the models and data based on the models, whereas the recommended modeling principles give additional guidance on how to do modeling aligned to the paradigm and avoid common pitfalls.

The FI-Platform modeling tool is able to produce core vocabularies and application profiles. While they serve different purposes, they share a common foundation. We will first go through this foundation and then dive into the specifics of these two model types and their use cases.  When discussing the model paradigm, we sometimes contrast it with traditional technologies such as XML Schema and UML, as those are more familiar to many modelers and data architects.

In this guide we have highlighted the essential key facts with spiral notepad (note) symbols.

The Linked Data Modeling Paradigm

The models published on the FI-Platform are in essence Linked Data and thus naturally compatible with the linked data ecosystem. The FI-Platform nevertheless does not expect modelers nor users of the models to migrate their information systems into Linked Data based ones, or to make a complete overhaul of their information management processes. There are multiple pathways for utilising the models, ranging from lightweight semantic layering to fully integrated solutions. In most cases you can think of Linked Data solutions as a semantic layer covering your information architecture, or as a translational layer between your domain and external actors or between individual information systems.

Linked Data is a method of publishing structured data so that it can be interlinked and become more useful through semantic queries. It extends the Web to use URIs to name not just documents but also real-world objects and abstract concepts (in general, anything). The key issues Linked Data aims to solve are:

  • Data Silos: Information is typically stored in isolated databases with limited ability to interoperate. Linked Data was designed to break down these barriers by linking data across different sources, allowing for more comprehensive insights and analyses.
  • Semantic Disconnect: There is a lack of a common framework that could universally describe data relationships and semantics, and a disconnect between model types and associated specifications (conceptual, logical, physical, documentation, code lists, vocabularies). Linked Data describes data with the RDF (Resource Description Framework) language, which is able to encode meaning alongside data, enhancing its machine-readability and semantic interoperability.
  • Integration Complexity: Integrating data from various sources is typically complicated and costly due to heterogeneous data formats and access mechanisms, requiring peer-to-peer ETL solutions, API integrations, integration platforms, MDM or commonly agreed message schemas. Linked Data promotes a standard way to access data (HTTP), a common data format (RDF), and a standardized query language (SPARQL), simplifying data integration.
  • Reusability and Extension: The reusability and combining of data from various sources for new needs is typically limited or burdensome. Linked Data encourages reusing existing data by making the finding and combining data as well as inferring new data straightforward.

In essence, Linked Data offers an alternative solution for the typical data management issues and proposed solutions that rely on tailored and siloed data warehouses and catalogues or other highly platform, product or service provider dependent solutions such as ipAAS.

The Core Idea of Linked Data

As you can imagine based on the name, Linked Data is all about references (links or "pointers") between entities (pieces of data). In principle:

spiral notepad All entities (resources in the Linked Data jargon), be they anything - actual instance data or structures in a data model, physical or abstract entities - are named with IRIs (Internationalised Resource Identifier).

spiral notepad The names should (in most cases) be HTTP based IRIs, as this allows a standardized way to resolve the names (i.e. access the resource content named by the IRI).

spiral notepad When a client resolves the name, relevant and appropriate information about the resource should be provided.

spiral notepad The resources should refer (be linked) to other resources when it aids in discoverability, contextualising, validating, or otherwise improving the useability of the data.

What all the resources in a linked data dataset or a model represent, depends of course entirely on the domain and use case in question. In general though, there is a deep philosophical distinction between how Linked Data and traditional data modelling approach the question of what is being modeled. Traditionally data modelling is done with a strong emphasis on modelling records, which has long roots in the way data modeling has been subservient to the technical limitations of database systems and message schemas. In Linked Data the aim is not typically to manage records of the entities in question, but to name and describe the entities directly. As an example, authorities typically create records of people for specific processes concerning e.g. citizenship, taxation, healthcare etc. These records are then attached to each citizen by one or more identifiers (such as a Personal Identity Code), making it possible to handle the records of a specific individual in a particular process. In Linked Data, the entities in question would be named and described semantically, leading not to a fragmented collection of separate records but a linked and multifaceted network of these entities and information describing them and their interrelations. The distinction here is subtle and quite abstract but very tangible in real world use cases.

How Things Are Named

As mentioned, all resources are named (minted in the jargon) with identifiers, which we can then use to refer to them when needed.

spiral notepad The FI-Platform gives every resource an HTTP IRI name in the form of <https://iri.suomi.fi/model/modelName/versionNumber/localName>

spiral notepad On the Web You will typically come across mentions of URIs (Uniform Resource Identifier) and HTTP URIs more often than IRIs or HTTP IRIs. On the FI-Platform we encourage internationalisation and usage of Unicode, and mint all resources with IRIs instead of URIs which are restricted to just a subset of the ASCII characters.

spiral notepad IRIs are an extension of URIs, i.e. each URI is already a valid IRI.

spiral notepad IRIs can be mapped to URIs (with percent encoding the path and punycoding the domain part) for backwards compatibility, but this is not recommended due to potential collisions. Instead, tools supporting IRIs and Unicode should always be preferred. You should generally never mint identifiers which are ambiguous (i.e. name a thing with an IRI and something else with an URI encoded version). The IRI will get encoded anyway when you're doing a HTTP request, so you should always consider the set of URIs that correspond to URI encoded versions of IRI as reserved and not use them for naming resources.

spiral notepad We do not call them HTTPS IRIs despite the scheme on the FI-Platform being https, as the established naming convention does not dictate the security level of the connection. In general you should always mint the HTTP based identifiers with the https scheme and offer the resources through a secured connection, so do not mistake the HTTP IRI or HTTP URI name to mean that the unsecure http protocol should be used.

In the diagram above you can see the hierarchy of different identifier types. The identifier types based on the HTTP scheme are highlighted as they are primary in Linked Data, but the other scheme types (deprecated, current and potential upcoming ones) are of course possible (list of IANA schemes here: https://www.iana.org/assignments/uri-schemes/uri-schemes.xhtml). Note that the diagram is conceptually incorrect in the sense that conceptually IRIs and URIs (including their HTTP subsets) are broader than URLs which are merely Uniform Resource Locators. In other words, URLs tell where something on the Web is located for fetching, whereas URIs and IRIs give them identifiers (name them), and with a proper scheme version (like HTTP IRI) make the name resolvable. But if we simply look at the structure of these identifiers from a pragmatic perspective, the diagram is correct.

URNs are a special case, they are defined per RFC as a subset of URIs. They start with the scheme urn: with a local name or a potential sub-namespace. Well-known sub-namespace URNs are ISBNs, EANs, DOIs and UUIDs. An example of an ISBN URN is urn:isbn:0-123-456-789-123. URNs are not global and they always need a specific resolver. There is no central authority for querying ISBN or ISSN numbers, thus the URN must be resolved against a particular service, and in the case of UUIDs they mean nothing outside their specific context and thus mean they are tightly coupled with the information system(s) they inhabit.

You are probably already familiar with UML or XML namespaces. HTTP IRIs work quite similarly to XML namespaces:

spiral notepad Each IRI consists of a namespace or a namespace and a local name. The idea of a namespace is to avoid collisions in naming resources, not to restrict access or visibility like in UML. Because the path part of HTTP IRIs is hierarchical, namespaces can be nested with the / (slash) separator.

spiral notepad Namespaces can only be reliably distinguished from local names by specifying them (very much in the same way as they are specified in XML). The namespace is always the beginning part of the IRI up to the last # (hash) or / (slash) delimiter. This is not strictly required but it is strongly encouraged, as it makes it possible to unambiguously deduce the innermost namespace for the resource. Explicitly naming namespaces grants us the possibility to use XML-style CURIEs (Compact URI Expression) which look like namespaceName:localName and give a more human-readable representation of the IRIs when necessary.

spiral notepad When you mint a new HTTP IRI for something, you can give it any name that adheres to RFC 3987 (https://datatracker.ietf.org/doc/rfc3987/), but as parts of the HTTP namespace are controlled (owned) by different parties, it is not a good principle to name things in namespaces owned by someone else unless specifically given permission. Your models and your data should reside under a domain and a path under your control - either directly or by delegation like on the FI-Platform (each model having its own namespace under the suomi.fi domain).

spiral notepad For details on namespace and local name naming conventions in general, please refer to the W3C recommendation at https://www.w3.org/TR/xml-names11/.

Life-Cycle and Versioning

When resources are minted and information is served by resolving them, we generally want that information to reflect the resource that the HTTP IRI identifies. As mentioned above, there is a conceptual difference between an URL and an HTTP IRI. URL just points somewhere with the path usually giving a hint of what kind of content to expect, but the identity of the content is often quite loosely coupled with the form of the URL. An HTTP IRI on the other hand is intended to identify something - thus an IRI for a specific resource should resolve to data about said resource and not something completely separate, even in situations where the life-cycle of the resource has ended. This question does not only apply to instance data but also to models. Good principles for naming resources are:

spiral notepad Minting an IRI for a resource should typically indicate that the IRI is never to be used for other purposes. Since an identifier for a resource has been created, it might be used years later when the resource itself has ceased to exist. The IRI should nevertheless not be freed for naming other entities.

spiral notepad If the resource in question is modified during its life-cycle within the bounds of its identity, this should be reflected by other data describing the change and pointing to this permanent identifier.

spiral notepad If the resource in question changes and these changes are represented as snapshots of its state during its life-cycle, these snapshots can be minted with their own IRIs (for example with datetime stamp or semantic version number in the path distinguishing them from each other), as they have distinct identities (each snapshot is justifiably its own resource). Additional data is then used to represent the link between the snapshots and the permanent identifier of the resource.

How Linked Data is structured

It is crucial to understand that by nature, linked data is atomic and the model or dataset is always in the form of a graph. The lingua franca of linked data is RDF (Resource Description Framework), which allows for a very intuitive and natural way of representing information. In RDF everything is expressed as triples (3-tuples): statements consisting of three resources with one of them potentially having an attribute value (these are literals in RDF jargon). You can think of triples as simply rows of data in a three column data structure: the first column represents the subject resource (from whose point of view the statement is made), the second column represents the context resource of what is being stated by the subject, and the third column represents the object resource or a literal value of the statement. Simplified to the extreme, "Finland is a country" is a statement in this form:

If we expand the example above to also include statements for example about the number of lakes and Finland's capital, we could end up with the following dataset:

As mentioned above, everything is named with IRIs, so a more realistic example would actually look like this:

The triples in this dataset can be serialized very simply as three triples each consisting of three resources, or represented as a three column tabular structure:

subject resourcepredicate resourceobject resource
<https://finland.fi/><http://www.w3.org/1999/02/22-rdf-syntax-ns#type>

<https://datamodel/Country>

<https://finland.fi/><https://datamodel/hasNumberOfLakes>

"187888"^^xsd:integer

<https://finland.fi/><https://datamodel/hasCapital>

<https://finland.fi/Helsinki>

Here are a few key takeaways:

spiral notepad All data is atomic: the resources in the graph do not possess inner structures or data. On the other hand, resolving the resources might return more RDF triples or some other content (JSON message, image, a web page, anything) - so in this sense the resources might not be "empty".

spiral notepad All data is structural: the formal semantic meaning of resources is only defined in relation to other resources. Is not necessary for each resource to be resolvable in every case, as the meaning of the resource might already be specified by triples, in which case its IRI acts as a permanent identifier for it.

spiral notepad All relationships between resources are binary - there are no n-ary relationships. A parent having two children means there must be two triples in the dataset with the same predicate, there is no way to link the children to the parent with just one triple, as the children in the object position cannot be expressed as a bag or list.

spiral notepad All resources are first-class citizens: You should not mistake the resources in the predicate position being any different from the others - one is free to make statements about them as well, and this is precisely the way many of them are typed.

spiral notepad Data and models can coexist: in the example above you can see that there are instances (<https://finland.fi/>) and classes (<https://datamodel/Country>) together in the same dataset. The ability to do this is one of the big advantages of Linked Data.

spiral notepad All data is extensible: It is always possible to expand graphs by adding triples pointing to new resources or to state facts about the external world. It is for example possible to combine multiple datasets covering differing facets or even redundant or conflicting of the same resource with ease by simply concatenating them. As an example, multiple differing measurements of the same dimension of a specific object can be combined and then analysed or validated.

spiral notepad All data is highly perspective-dependent: a graph can utilise any resources in any positions in a triple, and it will only change the interpretation of the resources within this graph or any other graph utilising this graph. In other words, a Linked Data graph can make statements about entities without affecting the entities. As an example, <http://www.w3.org/2002/07/owl#Class> is an ubiquitous resource for a class definition. It could be expanded with new triples in this graph without affecting any other parties using the same resource - unless this graph was employed by said parties.

How Linked Data is Serialised

The FI-Platform currently supports three serialisation formats for Linked Data: Turtle, RDF/XML and JSON-LD. The list might be expanded in the future, but these three were chosen as the initial primary formats for the following reasons:

spiral notepad Turtle is an extremely easy to read format where the triples are expressed very explicitly. It is not very compact though and does not support multiple graphs in the same document, but as an introductory format to reading and understanding Linked Data it is extremely useful and widely supported.

spiral notepad RDF/XML is the primary serialisation format for RDF supported by practically any tool and leveraging the XML ecosystem tools. It should primarily be seen as a legacy format that is easy to parse and manage with the vast spectrum of XML tools. RDF/XML does not support multiple graphs in one document.

spiral notepad JSON-LD (JSON for Linking Data, see https://json-ld.org/) is a novel and extremely versatile format that allows multiple graphs and is easily employed by developers already familiar with JSON. 

Graphs and Documents

As mentioned above, JSON-LD supports multiple graphs in one document. What does this mean? 

Core Vocabularies (Ontologies)

Core Vocabularies provide a structured framework to represent knowledge as a set of concepts within a domain and the relationships between those concepts. They are used extensively to formalize a domain's knowledge in a way that can be processed by computers. Ontologies allow for sophisticated inferences and queries because they can model complex relationships between entities and can include rules for how entities are connected.

  • RDFS (RDF Schema) is a basic ontology language providing basic elements for the description of ontologies. It introduces concepts such as classes and properties, enabling rudimentary hierarchical classifications and relationships.

  • OWL (Web Ontology Language) offers more advanced features than RDFS and is capable of representing rich and complex knowledge about things, groups of things, and relations between things. OWL is highly expressive and designed for applications that need to process the content of information instead of just presenting information.

Schemas

Schemas, on the other hand, are used for data validation. They define the shape of the data, ensuring it adheres to certain rules before it is processed or integrated into systems. Schemas help maintain consistency and reliability in data across different systems.

  • SHACL (Shapes Constraint Language) is used to validate RDF graphs against a set of conditions. These conditions are defined as shapes and can be used to express constraints, such as the types of nodes, the range of values, or even the cardinality (e.g., a person must have exactly one birthdate).

While ontologies and schemas are the main categories, there are other tools and languages within the linked data ecosystem that also play vital roles, though they may not constitute a separate major category by themselves. These include:

  • SPARQL (SPARQL Protocol and RDF Query Language), which is used to query RDF data. It allows users to extract and manipulate data stored in RDF format across various sources. SPARQL can be embedded in SHACL constraints to further increase its expressivity.

  • SKOS (Simple Knowledge Organization System), which is used for representing knowledge organization systems such as thesauri and classification schemes within RDF.

Each tool or language serves specific purposes but ultimately contributes to the broader goals of linked data: enhancing interoperability, enabling sophisticated semantic querying, and ensuring data consistency across different systems. Ontologies and schemas remain the foundational categories for organizing and validating this data.

When transitioning from traditional modeling techniques like UML (Unified Modeling Language) or Entity-Relationship Diagrams (ERD) to linked data based modeling with tools like OWL, RDFS, and SHACL, practitioners encounter both conceptual and practical shifts. This chapter aims to elucidate these differences, providing a clear pathway for those accustomed to conventional data modeling paradigms to adapt to linked data methodologies effectively.

Conceptual Shifts

Graph-based vs. Class-based Structures

  • Traditional Models: Traditional data modeling, such as UML and ERD, uses a class-based approach where data is structured according to classes which define the attributes and relationships of their instances. These models create a somewhat rigid hierarchy where entities must fit into predefined classes, and interactions are limited to those defined within and between these classes.
  • Linked Data Models: Contrastingly, linked data models, utilizing primarily RDF-based technologies, adopt a graph-based approach. In these models, data is represented as a network of nodes (entities) and edges (relationships) that connect them. Each element, whether data or a conceptual entity, can be directly linked to any other, allowing for dynamic and flexible relationships without the confines of a strict class hierarchy. This structure facilitates more complex interconnections and seamless integration of diverse data sources, making it ideal for expansive and evolving data ecosystems.

From Static to Dynamic Schema Definitions

  • Traditional Models: UML and ERD typically define rigid schemas intended to structure database systems where the schema must be defined before data entry and is difficult to change.
  • Linked Data Models: OWL, RDFS etc. allow for more flexible, dynamic schema definitions that can evolve over time without disrupting existing data. They support inferencing, meaning new relationships and data types can be derived logically from existing definitions.

From Closed to Open World Assumption

  • Traditional Models: Operate under the closed world assumption where what is not explicitly stated is considered false. For example, if an ERD does not specify a relationship, it does not exist.
  • Linked Data Models: Typically adhere to the open world assumption, common in semantic web technologies, where the absence of information does not imply its negation, i.e. we cannot deduce falsity based on missing data. This approach is conducive to integrating data from multiple, evolving sources.

Entity Identification

  • Traditional Models: Entities are identified within the confines of a single system or database, often using internal identifiers (e.g., a primary key in a database). Not all entities (such as attributes of a class) are identifiable without their context (i.e. one can't define an attribute with an identity and use it in two classes).
  • Linked Data Models: Linked data models treat all elements as atomic resources that can be uniquely identified and accessed. Each resource, whether it's a piece of data or a conceptual entity, is assigned a Uniform Resource Identifier (URI). This ensures that every element in the dataset can be individually addressed and referenced, enhancing the accessibility and linkage of data across different sources.

Practical Shifts

Modeling Languages and Tools

  • Traditional Models: Use diagrammatic tools to visually represent entities, relationships, and hierarchies, often tailored for relational databases.
  • Linked Data Models: Employ declarative languages that describe data models in terms of classes, properties, and relationships that are more aligned with graph databases. These tools often focus on semantics and relationships rather than just data containment.

Data Integrity and Validation

  • Traditional Models: Data integrity is managed through constraints like foreign keys, unique constraints, and checks within the database system.
  • Linked Data Models: SHACL is used for validating RDF data against a set of conditions (data shapes), which can include cardinality, datatype constraints, and more complex logical conditions.

Interoperability and Integration

  • Traditional Models: Often siloed, requiring significant effort (e.g. ETL solutions, middleware) to ensure interoperability between disparate systems.
  • Linked Data Models: Designed for interoperability, using RDF (Resource Description Framework) as a standard model for data interchange on the Web, facilitating easier data merging and linking.

Transition Strategies

Understanding Semantic Relationships

  • Invest time in understanding how OWL and RDFS manage ontologies, focusing on how entities and relationships are semantically connected rather than just structurally mapped.

Learning New Validation Techniques

  • Learn SHACL to understand how data validation can be applied in linked data environments, which is different from constraint definition in relational databases.

Adopting a Global Identifier Mindset

  • Embrace the concept of using URIs for identifying entities globally, which involves understanding how to manage and resolve these identifiers over the web.
  • It is also worth learning about how URIs differ from URNs and URLs, how they enable interoperability with other identifier schemas (such as using UUIDs), what resolving identifiers means, and how URIs and their namespacing can be used to use URIs in a local scope.

Linked Data Modeling in Practice

You might already have a clear-cut goal for modeling, or alternatively be tasked with a still loosely defined goal of improving the management of knowledge in your organization. As a data architect, you're accustomed to dealing with REST APIs, PostgreSQL databases, and UML diagrams. But the adoption of technologies like RDF, RDFS, OWL, and SHACL can elevate your data architecture strategies. Here is a short list explaining some of the most common use-cases as they would be implemented here (not exhaustive, more examples to be added in the future):

General recipe

  1. Use the FI-Platform terminology tool (https://sanastot.suomi.fi/en) to reuse existing or specify novel terminological concepts for your message specification.
  2. Use the FI-Platform modeling tool to discover and reuse, derive from or specify novel vocabulary structures for annotating your message. Existing external compatible vocabularies can be used as well. The basis for these definitions should prefereably already exist in terminological form in the terminology tool.
  3. Use the FI-Platform modeling tool to discover and reuse, derive from or specify novel application profile for the message, basing the restrictions in the message on the named resources you previously specified in the vocabulary or vocabularies.

Recipe 1: Define a Message Schema for Information Exchange

  1. Apply the general recipe up to step 3.
  2. Choose one of the use cases:
    1. If you intend to use the application profile for validating messages already in linked data format, you can use the permanent published IRI of the model as the shapes graph for an existing validator (for example the EU Interoperability Test Bed validator, accessible here: https://www.itb.ec.europa.eu/shacl/any/upload, sources here: https://github.com/ISAITB/shacl-validator)
    2. More use cases to be added...

Recipe 2: Turn Relational Data Into Linked Data or Serve Relational Data as Linked Data

  1. Apply the general recipe up to step 2.
  2. Choose one of the use cases:
    1. Use R2RML (RDB to RDF Mapping Language, https://www.w3.org/TR/r2rml/) to create a mapping from the schema to a core vocabulary, and use one or more FI-Platform core vocabularies to annotate the created data. The mapping does not have to be one to one, you are able to create mappings that resemble views as well.
    2. Use a virtualization interface for an on-the-fly mapping of data with an R2RML mapping based on a core vocabulary to serve relational data through a SPARQL endpoint.
    3. More use cases to be added...

https://graphdb.ontotext.com/documentation/10.6/virtualization.html

https://linkml.io/linkml/

For conceptual and logical models, you should create a Core Vocabulary (i.e. an OWL ontology). You can do both lightweight conceptual modeling by merely specifying classes and their interrelationships with associations, or a more complex and comprehensive model by including attributes, their hierarchies and attribute/relationship-based constraints. In either case, the same OWL ontology acts both as a formally defined conceptual and logical model, there is no need for an atrificial separation of the two. This also helps to avoid inconsistencies cropping up between the two. The primary advantage of basing these models on OWL is the ability to use inferencing engines to logically validate the internal consistency of the created models, which is not possible with traditional conceptual and logical models.

For relational-RDF mappings, please check the following resources:

Regarding database schemas, even for relational databases, conceptual and logical models are necessary to define the structure and interrelationships of data. In the case 

RDFS and OWL can also be used to document the schema of an existing relational database in RDF format. This can serve as an interoperability layer to help integrate the relational database with other systems, facilitating easier data sharing and querying across different platforms.

You should create 

  • Defining API specifications:
  • Defining message schemas:

Which one to create: a Core Vocabulary or Application profile?

Which model type should you start with? This naturally depends on your use-case. You might be defining a database schema, building a service that distributes information products adhering to a specific schema, trying to integrate two datasets... In general, all these and other use-cases start with the following workflow:

  1. In general the expectation is that your data needs to be typed, structured and annotated by metadata at least on some level. This is where you would traditionally employ a conceptual model and terminological or controlled vocabularies as the semantic basis for your modeling. You might have already done this part, but if not, the FI-Platform can assist you with ending up with a logically consistent conceptual model.
  2. Your domain's conceptual model needs to be created in a formal semantic form as a core vocabulary. The advantage compared to a traditional separate conceptual and logical model is that here they are part of the same definition and the logical soundness of the definitions can be validated. Thus, there is no risk of ending up with a logical model that would be based on a conceptually ambiguous (and potentially internally inconsistent) definition.
  3. When you've defined and validated your core vocabulary, you have a sound basis to annotate your data with. Annotating in principle consists both of typing and describing the data (i.e. adding metadata). Annotating your data allows for inferencing (deducing indirect facts from the data), logical validation (checking if the data adheres to the definitions set in the core vocabulary), harmonizing datasets, etc.
  4. Finally, if you intend to create schemas e.g. for validating data passing through an API or ensuring that a specific message payload has a valid structure and contents, you need to create an application profile. Application profiles are created based on core vocabularies, i.e. an application profile looks for data annotated with some core vocabulary concepts and then applies validation rules.

Thus, the principal difference between these two is:

  • If you want to annotate data, check its logical soundness or infer new facts from it, you need a core vocabulary. With a core vocabulary you are essentially making a specification stating "individuals that fit these criteria belong to these classes".
  • If you want to validate the data structure or do anything you'd traditionally do with a schema, you need an application profile. With an application profile you are essentially making a specification stating "graph structures matching these patterns are valid".

Core Vocabularies in a Nutshell

As mentioned, the idea of a Core Vocabulary is to describe semantically the resources (entities) you will be using to describe your data with. In other words, what typically ends up as a conceptual model documentation or diagram, is now described by a formal model.

Core vocabularies are technically speaking ontologies based on the Web Ontology Language (OWL) which is a knowledge description language. OWL makes it possible to connect different data models (and thus data annotated with different models) and make logical inferences on which resources are equivalent, which have a subset relationship, which are complements etc. As the name implies, there is a heavy emphasis on knowledge - we are not simply labeling data but describing what it means in a machine-interpretable manner.

A key feature in OWL is the capability to infer facts from the data that are not immediately apparent - especially in large and complex datasets. This makes the distinction between core and derived data more apparent than in traditional data modeling scenarios, and helps to avoid modeling practices that would increase redundancy or potential for inconsistencies in the data. Additionally, connecting two core vocabularies allows for inferencing between them (and their data) and thus harmonizing them. This allows for example revealing parts in the datasets that represent the same data despite having being named or structured differently.

The OWL language has multiple profiles for different kind of inferencing. The one currently selected for the FI-Platform (OWL 2 EL) is computationally simple, but still logically expressive enough to fulfill most modeling needs. An important reminder when doing core vocabulary modeling is to constantly ask: is the feature I am after part of a specific use case (and thus application profile) or is it essential to the definition of these concepts?

Application Profiles in a Nutshell

Application profiles fill the need to not only validate the meaning and semantic consistency of data and specifications, but to enforce a specific syntactic structure and contents for data.

Application profiles are based on the Shapes Constraint Language (SHACL), which does not deal with inferencing and should principally be considered a pattern matching validation language. A SHACL model can be used to find resources based on their type, name, relationships or other properties and check for various conditions. The SHACL model can also define the kind of validation messages that are produced for the checked patterns.

Following the key Semantic Web principles, SHACL validation is not based on whitelisting (deny all, permit some) like traditional closed schema definitions. Instead, SHACL works by validating the patterns we are interested in and ignoring everything else. Due to the nature of RDF data, this doesn't cause problems, as we can simply dump all triples from the dataset that are not part of the validated patterns. Also, it is possible to extend SHACL validation by SHACL-SPARQL or SHACL Javascript extensions to perform a vast amount of pre/postprocessing and validation of the data, though this is not currently supported by the FI-Platform nor within the scope of this document.

Core Vocabulary modeling

When modeling a core vocabulary, you are essentially creating three types of resources:

Attributes

Attributes are in principle very similar to attribute declarations in other data modeling languages. There are some differences nevertheless that you need to take into account:

  1. Attributes can be used without classes. For an attribute definition, one can specify rdfs:domain and/or rdfs:range. The domain refers to the subject in the <subject, attribute, literal value> triple, and range refers to the literal value. Basically what this means is that when such a triple is found in the data, its subject is assumed to be of the type specified by rdfs:domain, and the datatype is assumed to be of the type specified by rdfs:range.
  2. The attribute can be declared as functional, meaning that when used it will only have at most one value. As an example, one could define a functioanl attribute called age with a domain of Person. This would then indicate that each instance of Person can have at most one literal value for their age attribute. On the other hand, if the functional declaration is not used, the same attribute (e.g. nickname) can be used to point to multiple literal values.
  3. Attribute datatypes are by default XSD datatypes, which come with their own datatype hierarchy (see here).
  4. In core vocabularies it is sometimes preferable to define attribute datatype on a very general level, for example as rdfs:Literal. This allows using the same attribute in a multitude of application profiles with the same intended semantic meaning but enforcing a context-specific precise datatype in each application profile.
  5. Attributes can have hierarchies. This is an often overlooked but useful feature for inferencing. As an example, you could create a generic attribute called Identifier that represents the group of all attributes that act as identifiers. You could then create sub-attributes, for example TIN (Tax Identification Number), HeTu (the Finnish personal identity code) and so on.
  6. Attributes can have explicit equivalence declarations (i.e. an attribute in this model is declared to be equivalent to some other attribute).

Associations

Associations are similarly not drastically different compared to other languages. There are some noteworthy things to consider nevertheless:

  1. Associations can be used without classes as well. The rdfs:domain and rdfs:range options can here be used to define the source and target classes for the uses of a specific association. As an example, the association hasParent might have Person as both its domain and range, meaning that all triples using this association are assumed to describe connections between instances of Person.
  2. Associations in RDF are binary, meaning that the triple <..., association, ...> will always connect two resources with the association acting as the predicate.
  3. Associations can have hierarchies similarly to attributes.
  4. Associations have flags for determining whether they are reflexive meaning that both the subject and object of the association are assumed to be the same resource, whether they are transitive (meaning that if classes A and B as well as B and C are connected by association X, then this is equivalent to declaring that class A is connected to C by association X).
  5. Associations can have explicit equivalence declarations (i.e. an association in this model is declared to be equivalent to some other association).

Classes

Classes form the most expressive backbone of OWL. Classes can simply utilize the rdfs:subClassOf association to create hierarchies, but typically classes contain property restrictions - in the current FI-Platform case really simple ones. A class can simply state existential restrictions requiring that the members of a class must contain specific attributes and/or associations. Further cardinality restrictions are not declared here, as the chosen OWL profile does not support them, and cardinality can be explicitly defined in an application profile. In order to require specific associations or attributes to be present in an instance of a class, they must exist, as associations and attributes are never owned by a class, unlike in e.g. UML. They are individual definitions that are simply referred to by the class definition. This allows for situations where an extremely common definition (for example a date of birth or surname) can be defined only once in one model and then reused endlessly in all other models without having to be ever redefined.

Classes are not templates in the vein of programming or UML style classes but mathematical sets. You should think of a class definition as being the definition of a group: "instances with these features belong to this class". Following the Semantic Web principles, it is possible to declare a resource as an instance of a class even if it doesn't contain all the attribute and association declarations required by the class "membership". On the other hand, such a resource would never be automatically categorized as being a member of said class as it lacks the features required for the inference to make this classification.

Similarly to associations and attributes, classes have equivalence declarations. Additionally, classes can be declared as non-intersecting. It is important to understand that classes being sets doesn't by default in any way force them to be strictly separated. From the perspective of the inference reasoner, classes for inanimate objects and people could well be overlapping, unless it is explicitly declared logically inconsistent. With a well laid out class hierarchy, simply declaring a couple of superclasses as non-intersecting will automatically make all their subclasses non-intersecting as well.

Application profile modeling 

With application profiles we use strictly separate set of terms to avoid mixing up the core vocabulary structures we are validating and the validating structures themselves. The application profile entities are called restrictions:

Attribute and association restrictions

These restrictions are tied to specific attribute and association types that are used in the data being validated. Creating a restriction for a specific core vocabulary association allows it to be reused in one or more class restrictions. In the future the functionality of the FI-Platform might be extended to cover using attribute and association restrictions individually without class restrictions, but currently this is not possible.

Common restrictions for both attributes and associations are the maximal and minimal cardinalities: there is no inherent link between the cardinalities specified in SHACL and the functional switch defined in the core vocabulary attribute as they have different use-cases. It is nevertheless usually preferable keep these two consistent (a functional core attribute should not be allowed or required to have a cardinality of > 1 in an application profile). Allowed, required and default values are also common features for both restriction types.

The available features for attribute restrictions specifically are partially dependent on the datatype of the attribute. As mentioned before, it is preferable to set the exact required datatype here and have a wider datatype in the core vocabulary attribute. For string types, max and min lengths, regex validation pattern, and languages. For numerical types, min and max values are currently supported.

For association restrictions, the currently supported extra restriction is the class type requirement for the association restriction target (i.e. what type of an instance must be at the object end of the association).

Class restrictions

Similarly to core vocabulary classes, also class restrictions utilize a group of predefined attribute and association definitions. Again, this allows for example the specification of some extremely reusable association and attribute restrictions which can then be reused a multitude of times in various application profiles.

The target class definition works by default on the level of RDFS inferencing (in other words, it will validate the instances of the specified class and all its subclasses).

Class restrictions don't operate in a set-theoretical manner like core vocabulary definitions, but there is a way to implement "inheritance" in validated classes. If a class restriction utilizes another class restriction, its target classes contents are checked against both of these class restrictions.

General word of caution on modeling

SHACL is a very flexible language and due to this nature it allows the creation of validation patterns that might seem legit but are actually unsatisfiable by any instance data. As an example, the utilization of other class restrictions might lead to a situation where an attribute can never be validated as it is required to conform to two conflicting datatypes at the same time.

Also, whereas the OWL specification is monolithic and cannot be extended or modified, SHACL can be extended with a multitude of ways, for example by embedding SPARQL queries or Javascript processing in SHACL constraints. The standard allows for this, but naturally it is dependent on the used SHACL validator, which extensions are supported. The FI-Platform in its current form adheres to the core specification (vanilla) SHACL.

A final note regarding SHACL validation is that its results are also dependent on whether inferencing is executed on the data prior to validation or not. The SHACL validator by default does not know or care about OWL inferencing, and works strictly based on the triples it sees declared in the data. It is recommended that inferencing is run before validation to ensure there are no implicit facts that the SHACL validator could miss. Also, it is important to remember that the core vocabulary declarations for the instance data must be included in the data graph to be validated. The SHACL validator will not resolve anything outside the data graph, and will match patterns only based on what it sees in the data graph.

Linked Data Toolset

This paragraph gives you hints on tools that are useful in building your own Linked Data solutions.

Native RDF Databases

  • GraphDB by Ontotext (https://graphdb.ontotext.com/), a free closed source database, large set of integrations to both search and relational databases, inferencing engine
  • Apache Jena (https://jena.apache.org/), a free open source Linked Data framework, database, inferencing engine, 
  • No labels