A schema is the definition of recurring structure in data, which can also be seen as constraints on the data.
An algorithm can decide whether a piece of data validates the schema or not.
What types of schema do you know?
Relational databases always rely on a schema for guaranteeing consistency.
A relational schema defines:
CREATE TABLE Person(
taxID VARCHAR(25) NOT NULL,
familyName VARCHAR(50) NOT NULL,
givenName VARCHAR(50) NOT NULL,
birthDate DATE,
PRIMARY KEY (taxID)
);
INSERT INTO Person
VALUES
(NULL, 'Ketchum', 'Ash', NULL);
Error: taxID cannot be NULL.
Object-oriented models also rely on a schema to verify the executability of programs at compile time.
An object-oriented schema defines:
class Person extends Thing {
final String taxID;
final String familyName;
final String givenName;
final Date birthDate;
public Person(String taxID) { … }
}
Error: familyName is never assigned.
In contrast, schema.org is an RDF schema that cannot generate errors.
@prefix : <http://schema.org/>
:Person a :Class .
# (skipping taxID, etc...)
:birthDate
:domainIncludes :Person ;
:rangeIncludes :Date .
<ash>
a :Person;
# taxID: none
:familyName "Ketchum" ;
:givenName "Ash" .
OK.
Schema.org is a sort of schema that only defines:
To ensure large adoption, schema.org imposes no constraint on the structure of RDF triples that use its classes and properties.
Semantic Web practitioners favor the terms 'vocabulary' over 'schema' to refer to such a model.
For example, states identify citizens with tax IDs but not all persons are citizens of some state, especially not fictional characters.
Some real persons also have multiple tax IDs.
It has been a design choice to keep schema.org generic, i.e.:
It is still possible to define post-hoc constraints on RDF graphs that use schema.org's vocabulary.
Schema.org's validator and Google's "enriched result" tester have their own schema to trigger warnings and errors.
The property "salt" is not recognized by the schema for an object of type "NutritionInformation".
Field "recipeCuisine" missing (optional).
Schema.org's validator not only validates JSON-LD data.
It also tries to repair schema violations.
Schema.org was inspired by RDF Schema (RDFS), a minimal language to define vocabularies.
However, RDFS goes slightly beyond vocabulary-level definitions.
RDFS and schema.org were actually designed by the same persons.
RDFS includes:
@prefix rdfs: <http://www.w3.org/…>
@prefix xsd: <http://www.w3.org/…>
@prefix : <http://schema.org/>
:Person rdfs:subClassOf :Thing .
:birthDate
rdfs:domain :Person ;
rdfs:range xsd:date .
<junichi-masuda>
a :Person;
:familyName "Masuda" ;
:givenName "Junichi" ;
:birthDate "1968"^^xsd:gYear .
The range of :birthDate is of incompatible datatype xsd:date.
With RDFS, it is however not possible to validate whether Junichi Masuda has a tax ID or not.
RDFS is not meant for data validation.
<junichi-masuda>
a :Person;
:familyName "Masuda" ;
:givenName "Junichi" ;
# full date provided
:birthDate "1968-01-12" .
Do these statements about Junichi Masuda validate the constraint that a birth date must be a date?
In many cases, obvious statements are not asserted.
A schema can help infer obvious statements from asserted ones.
A schema used for inference should rather be called an ontology.
<junichi-masuda>
:birthDate "1968-01-12"^^xsd:date .
… is a necessary fact for the schema to be validated. It can be considered true.
Still, many statements are unknown with respect to a schema.
Typically, what is unknown is assumed to be false when validating data.
There are good reasons not to make this assumption. See for instance ontologies expressed in the Web Ontology Language.
What would a proper validation schema for RDF look like?
The Shapes Constraint Language (SHACL) has been designed to declare constraints on classes and properties.
ex:PersonShape
a sh:NodeShape ;
sh:targetClass :Person ;
sh:property [
sh:path :taxID ;
sh:minCount 1 ;
sh:maxCount 1
] .
<ash>
a :Person;
# taxID: none
:familyName "Ketchum" ;
:givenName "Ash" .
Error: taxID has less than one value.
Try the example on the SHACL playground.
On the Semantic Web, everything is in RDF. Even SHACL validation reports.
[]
a sh:ValidationResult ;
sh:resultSeverity sh:Violation ;
sh:focusNode <ash> ;
sh:resultPath schema:taxID .
A SHACL schema is composed of node shapes.
A node shape applies to focus nodes, which are either:
ex:PersonShape
a sh:NodeShape ;
sh:targetClass :Person ;
sh:property [
sh:path :taxID ;
sh:minCount 1 ;
sh:maxCount 1
] .
A node shape is composed of one or more property shapes.
A property shape defines a path from the focus node to value nodes.
ex:PersonShape
a sh:NodeShape ;
sh:targetClass :Person ;
sh:property [
sh:path :taxID ;
sh:minCount 1 ;
sh:maxCount 1 ;
] .
A property shape also defines constraints that must apply on all value nodes.
ex:PersonShape
a sh:NodeShape ;
sh:targetClass :Person ;
sh:property [
sh:path :taxID ;
sh:minCount 1 ;
sh:maxCount 1
] .
The SHACL standard includes numerous built-in constraints.
Value type | Cardinality | Number | String | Combination | Recursivity |
sh:class sh:datatype |
sh:minCount sh:maxCount |
sh:minExclusive sh:maxExclusive sh:minInclusive sh:maxInclusive |
sh:minLength sh:maxLength sh:pattern |
sh:or sh:and sh:not |
sh:node |
ex:PersonShape
a sh:NodeShape ;
sh:targetClass :Person ;
sh:property [
sh:path :birthDate ;
sh:datatype xsd:date
] .
SHACL was introduced in 2017.
How could Semantic Web developers validate their RDF graphs before 2017?
SPARQL can also be considered as a constraint language.
SELECT ?person WHERE {
?person a :Person .
FILTER NOT EXISTS {
# ~sh:minCount 1
?person :taxID ?id .
}
}
SELECT ?person WHERE {
?person
a :Person ;
:birthDate ?bdate .
FILTER (
# ~sh:datatype xsd:date
datatype(?bdate) = xsd:date
)
}
SHACL and SPARQL can be combined to provide an expressive validation language.
The integration, along with many features of SHACL, is illustrated in a single example.
SHACL shapes must be defined in RDF, which might be tedious to write.
The main benefit is that RDF class and property declarations can embed SHACL shapes.
ex:Citizen
a rdfs:Class ;
rdfs:subClassOf :Person;
sh:property [
sh:path :taxID ;
sh:minCount 1 ;
sh:maxCount 1
] .
If a class is also a node shape, it implicitly targets instances of itself.
In contrast to SHACL, the Shape Expressions (ShEx) language made the choice of a distinct, more succinct syntax.
:PersonShape {
:taxID xsd:string,
:familyName xsd:string,
:givenName xsd:string,
:birthDate xsd:date
}