Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The Linked Data Modeling paradigm

What are we modeling exactly on the FI-Platform?

In the digital world, how we organize and connect data can significantly influence how effectively we can use that information. Linked data knowledge representation languages (you can read this as "modeling languages"), such as OWL and RDFS, are tools that help us define and interlink data in a meaningful way across the Internet. Let's break down what these models are and what they are used for, using concrete examples.

Linked data models are frameworks used to create, structure, and link data so that both humans and machines can understand and use it efficiently. They are part of the broader technology known as the Semantic Web, which aims to make data on the web readable by machines as well as by humans.

Linked data models are instrumental in several ways:

  • Data Integration: They facilitate the combination of data from diverse sources in a coherent manner. This can range from integrating data across different libraries to creating a unified view of information that spans multiple organizations.
  • Interoperability: A fundamental benefit of using linked data models is their ability to ensure interoperability among disparate systems. This means that data structured with OWL, RDFS etc. can be shared, understood, and processed in a consistent way, regardless of the source. This capability is crucial for industries like healthcare, where data from various healthcare providers must be combined and made universally accessible and useful, or in supply chain management, where different stakeholders (manufacturers, suppliers, distributors) need to exchange information seamlessly.

  • Knowledge Management: These models help organizations manage complex information about products, services, and internal processes in a way that is easily accessible and modifiable. This structured approach supports more efficient retrieval and use of information.

  • Artificial Intelligence and Machine Learning: OWL, RDFS etc. provide a structured context for data, which is essential for training machine learning models. This structure allows AI systems to interpret data more accurately and apply learning to similar but previously unseen data. By using linked data models, organizations can ensure that their data is not only accessible and usable within their own systems but can also be easily linked to and from external systems. This creates a data ecosystem that supports richer, more connected, and more automatically processable information networks.

  • Enhancing Search Capabilities: By providing detailed metadata and defining relationships between data entities, these models significantly improve the precision and breadth of search engine results. This enriched search capability allows for more detailed queries and more relevant results.

When discussing linked data models, particularly in the context of the Semantic Web, there are two primary categories to consider: ontologies and schemas. Each serves a unique role in structuring and validating data. Let's explore these categories, specifically highlighting ontologies (like RDFS and OWL) and schemas (such as SHACL).

Ontologies

Ontologies provide a structured framework to represent knowledge as a set of concepts within a domain and the relationships between those concepts. They are used extensively to formalize a domain's knowledge in a way that can be processed by computers. Ontologies allow for sophisticated inferences and queries because they can model complex relationships between entities and can include rules for how entities are connected.

  • RDFS (RDF Schema) is a basic ontology language providing basic elements for the description of ontologies. It introduces concepts such as classes and properties, enabling rudimentary hierarchical classifications and relationships.

  • OWL (Web Ontology Language) offers more advanced features than RDFS and is capable of representing rich and complex knowledge about things, groups of things, and relations between things. OWL is highly expressive and designed for applications that need to process the content of information instead of just presenting information.

Schemas

Schemas, on the other hand, are used for data validation. They define the shape of the data, ensuring it adheres to certain rules before it is processed or integrated into systems. Schemas help maintain consistency and reliability in data across different systems.

  • SHACL (Shapes Constraint Language) is used to validate RDF graphs against a set of conditions. These conditions are defined as shapes and can be used to express constraints, such as the types of nodes, the range of values, or even the cardinality (e.g., a person must have exactly one birthdate).

While ontologies and schemas are the main categories, there are other tools and languages within the linked data ecosystem that also play vital roles, though they may not constitute a separate major category by themselves. These include:

  • SPARQL (SPARQL Protocol and RDF Query Language), which is used to query RDF data. It allows users to extract and manipulate data stored in RDF format across various sources.

  • SKOS (Simple Knowledge Organization System), which is used for representing knowledge organization systems such as thesauri and classification schemes within RDF.

Each tool or language serves specific purposes but ultimately contributes to the broader goals of linked data: enhancing interoperability, enabling sophisticated semantic querying, and ensuring data consistency across different systems. Ontologies and schemas remain the foundational categories for organizing and validating this data.

...

When transitioning from traditional modeling techniques like UML (Unified Modeling Language) or Entity-Relationship Diagrams (ERD) to linked data based modeling with tools like OWL, RDFS, and SHACL, practitioners encounter both conceptual and practical shifts. This chapter aims to elucidate these differences, providing a clear pathway for those accustomed to conventional data modeling paradigms to adapt to linked data methodologies effectively.

Conceptual Shifts

From Static to Dynamic Schema Definitions:

  • Traditional Models: UML and ERD typically define rigid schemas intended to structure database systems where the schema must be defined before data entry and is difficult to change.
  • Linked Data Models: OWL, RDFS etc. allow for more flexible, dynamic schema definitions that can evolve over time without disrupting existing data. They support inferencing, meaning new relationships and data types can be derived logically from existing definitions.

From Closed to Open World Assumption:

  • Traditional Models: Operate under the closed world assumption where what is not explicitly stated is considered false. For example, if an ERD does not specify a relationship, it does not exist.
  • Linked Data Models: Typically adhere to the open world assumption, common in semantic web technologies, where the absence of information does not imply its negation, i.e. we cannot deduce falsity based on missing data. This approach is conducive to integrating data from multiple, evolving sources.

Entity Identification:

  • Traditional Models: Entities are identified within the confines of a single system or database, often using internal identifiers (e.g., a primary key in a database). Not all entities (such as attributes of a class) are identifiable without their context (i.e. one can't define an attribute with an identity and use it in two classes).
  • Linked Data Models:

...

  • Linked data models treat all elements as atomic resources that can be uniquely identified and accessed. Each resource, whether it's a piece of data or a conceptual entity, is assigned a Uniform Resource Identifier (URI). This ensures that every element in the dataset can be individually addressed and referenced, enhancing the accessibility and linkage of data across different sources.

Practical Shifts

...

Modeling Languages and Tools:

  • Traditional Models: Use diagrammatic tools to visually represent entities, relationships, and hierarchies, often tailored for relational databases.
  • Linked Data Models: Employ declarative languages that describe data models in terms of classes, properties, and relationships that are more aligned with graph databases. These tools often focus on semantics and relationships rather than just data containment.

Data Integrity and Validation:

  • Traditional Models: Data integrity is managed through constraints like foreign keys, unique constraints, and checks within the database system.
  • Linked Data Models: SHACL is used for validating RDF data against a set of conditions (data shapes), which can include cardinality, datatype constraints, and more complex logical conditions.

Interoperability and Integration:

  • Traditional Models: Often siloed, requiring significant effort (e.g. ETL solutions, middleware, data federation) to ensure interoperability between disparate systems.
  • Linked Data Models: Designed for interoperability, using RDF (Resource Description Framework) as a standard model for data interchange on the Web, facilitating easier data merging and linking.

Transition Strategies

Understanding Semantic Relationships:

  • Invest time in understanding how OWL and RDFS manage ontologies, focusing on how entities and relationships are semantically connected rather than just structurally mapped.

Learning New Validation Techniques:

  • Learn SHACL to understand how data validation can be applied in linked data environments, which is different from constraint definition in relational databases.

Adopting a Global Identifier Mindset:

  • Embrace the concept of using URIs for identifying entities globally, which involves understanding how to manage and resolve these identifiers over the web.
  • It is also worth learning about how URIs differ from URNs and URLs, how they enable interoperability with other identifier schemas (such as using UUIDs), what resolving identifiers means, and how URIs and their namespacing can be used to use URIs in a local scope.

You're working with a graph

The RDF data model is a very generalized graph which is able to describe many kinds of data structures. Both data models and instance data are described with the same structure: triples of two nodes and an edge connecting them. RDF graphs can be represented in a very simple three column tabular form: <subject, predicate, object>. Each subject and object are some entities, or resources in the linked data jargon (for example classes or instances of classes, literal values etc.) and predicates are entities that link them together. For example a subclass association between classes A and B would be represented as <B, subclass, A>, or visually as two nodes in a graph linked by a subclass edge. Attribute values are represented with the same structure, with the attribute entity acting as the edge and the literal attribute value acting as the object node: <A, someAttribute, "foobar">.

Polyhierarchies are supported

In traditional data modeling multiple inheritance (typically the only way to represent hierarchical structures) is typically not allowed or at least severely limited. Instead, building hierarchies with multiple superclasses is allowed and in some cases even necessary.

All entities have identities

Usually some entities in data modeling languages don't have an identity as they are inherently part of their defining entities. As an example, UML attributes are not entities that can be individually referenced, they exist only as part of the class that defined them. This means that a model might have multiple attributes with the same identifying name and meaning but there is no technical way to straightforwardly identify these attributes as being "the same".

In RDF, every resource (entity) has an unique identifier, thus allowing RDF the reuse of any defined resource which reduces data duplication and overlapping definitions. Both data structures and instance data share the same URI-based naming principle, which on the FI-Platform is HTTPS IRI.

Resource identifiers can generally be minted (declared) as anything that adheres to the URI RFC 3986, but on the FI-Platform minting is controlled by enforcing a namespace which is under https://iri.suomi.fi/. This ensures that model resources will not accidentally collide with resources elsewhere on the web.

No strict separation between data and metadata

Due to the abovementioned identifiers, it is possible to add descriptive metadata to any entity either by stating the metadata by the entity itself, or externally by referring to the entity by its identifier.

No strict separation between classes and instances

So-called punning means that classes can act also as instances, there is no hard line separating them (though this doesn't mean the situation would be ambiguous, there are clear logical rules for deducing the state of affairs).

No strict separation between conceptual and logical model

...

Linked Data Modeling structures in practice


Conclusion

Linked data modeling is a powerful paradigm that shifts away from traditional structured data models towards a more flexible, web-like data structure. By understanding and implementing its fundamental concepts, organizations can harness the full potential of their data, making it more accessible, interconnected, and useful across diverse applications. This chapter lays the groundwork for those interested in exploring or transitioning to linked data modeling, providing the conceptual tools needed to navigate this complex yet rewarding field.

Which one to create: a Core Vocabulary or Application profile?

...