This section addressed to platform developer and integrators.
This is the multi-page printable view of this section. Click here to print.
Get hands-on with the platform.
- 1: 🚀 Overview
- 2: Installation
- 3: Getting started
- 4: Guide to Application building
- 5: Customizing the SDaaS platform
- 5.1: Architecture overview
- 5.2: Module building
- 5.3: Driver building
1 - 🚀 Overview
SDaaS™ (Smart Data as a Service) is a software platform designed by LinkedData.Center to build Enterprise Knowledge Graphs.
The SDaaS platform empowers your applications, allowing them to integrate any Linked Data and leverage the full potential of the semantic web. It is the reference implementation of KEES (Knowledge Exchange Engine Specifications), facilitating the exchange and sharing of domain-specific knowledge.
node "User Application" as Application
cloud "Linked Data cloud" as RDFResource
usecase "Data access" as query
package "Application Data Management System" {
node "Smart Data service" as SDaaS #aliceblue;line:blue;line.dotted;text:blue
database "Knowledge Graph" as KnowledgeGraph
usecase "Data management" as maintain
}
Application -- query
KnowledgeGraph -- query
SDaaS -- maintain
KnowledgeGraph -- maintain
KnowledgeGraph <- SDaaS
RDFResource -> SDaaS
note top of SDaaS
Here is where the SDaaS platform is used
end note
The typical SDaaS user is a DevOps professional who utilizes the commands provided by the platform to develop the service that learns, reasons, enriches and publish linked data within a graph store. The resulting knowledge graph is then queried by an application using SPARQL or REST APIs. SDaaS developers and system integrators can extend the platform by adding custom modules and creating new commands.
What is a Knowledge Graph?
The knowledge graph is a pivotal component within any modern Data Management Platform (DMP) consisting in:
- a graph store containing first, second and third-party data;
- a formal knowledge organization that allows linking the data;
- metadata to evaluate the data quality;
- axioms and rules computed by “reasoners” that are able to understand the meaning of the data;
- an interface to query the knowledge graph and to answers questions
What is KEES?
KEES (Knowledge Exchange Engine Specifications and Services) is an architectural design pattern that establishes specific requirements for Semantic Web Applications. Its purpose is to formally describe domain knowledge with the goal of making it tradeable and shareable. The full specifications are available on the KEES homepage.
SDaaS™ features
SDaaS provides the following features:
- compliance with W3C Semantic Web standards
- compatibility with various storage engines (e.g., Wikimedia BlazeGraph, AWS Neptune, etc., etc.)
- performance optimization for the vendor implementations
- concurrent use of many knowledge instances for a virtually unlimited capacity
- ingestion automation of Linked Data according with VoiD standard
- data provenance management automation according with PROV ontology
- extensible deductive reasoning through configurable OWL axioms materialization
- abductive reasoning through configurable rules
- full support of the KEES specification
SDaaS is available as:
- the community edition platform (ce), released as unsupported Open Source in GITHUB that provides a CLI interface and core tools for creating a knowledge base;
- the enterprise edition platform (ee), only available in the commercial version, that extends the community edition with additional tools and interfaces
SDaaS is a precious companion for:
- Anti-Money Laundry and Fraud detection (e.g. https://mospo.eu/)
- AI applications (e.g. https://EZClassifier.com and https://AdvisorLens)
- Identity management (eg. W3C Verifiable Credential schema)
- Financial data processing (e.g. https://budget.g0v.it/)
- Social Network
- Recommendation engines
- Scientific applications
- … many other use case
have a look to the Getting Started guide to see SDaaS in action
Who wrote the code of SDaaS
LinkedData.Center is a small technology company based in Europe, specializing in Neuro-Symbolic A.I. For more information and contacts, please visit the LinkedData.Center website.
2 - Installation
Since from version 2.1.0 the deployment of the SDaaS platform is based on Docker. The SDaaS Enterprise Edition can also be installed on a standalone host (virtual or physical). If you do not already have Docker on your computer, it’s the right time to install it.
Requirements
SDaaS is contained in a docker image For SDaaS to operate effectively, certain prerequisites are necessary:
- Docker Orchestrator: A capable Docker orchestrator such as Docker Compose or Kubernetes is essential. This orchestrator manages the deployment and execution of Docker images, ensuring efficient resource utilization and scalability.
- Accessible Graph Store: The SDaaS platform requires access to at least one graph store. This store serves as the repository for the graph data and must be accessible by the SDaaS platform for storing and retrieving data as needed.
Docker requirements
The SDaaS requires:
- Docker version
~20.10
- Docker Compose version
~2.17
Graph store requirements
SDaaS requires read/write access to at least one RDF Graph Store that requires:
- mandatory compliance to HTTP/1.1 protocol
- mandatory compliance to SPARQL 1.1 protocol
- mandatory compliance to SPARQL 1.1 Query Language
- mandatory compliance to SPARQL 1.1 Update specifications
- optional compliance to SPARQL 1.1 Graph Store HTTP Protocol
- optional compliance to SPARQL1.1 Service Description
The RDF Graph Store must provide an http(s) service endpoint compliant with the following minimal service description:
@prefix sd: <http://www.w3.org/ns/sparql-service-description#>
sd:supportedLanguage
sd:SPARQL11Query,
sd:SPARQL11Update ;
sd:resultFormat
<http://www.w3.org/ns/formats/RDF_XML>,
<http://www.w3.org/ns/formats/Turtle>,
<http://www.w3.org/ns/formats/N-Triples>,
<http://www.w3.org/ns/formats/N-Quads>
.
sd:resultFormat
<http://www.w3.org/ns/formats/RDF_XML>,
<http://www.w3.org/ns/formats/Turtle>,
<http://www.w3.org/ns/formats/N-Triples>
<http://www.w3.org/ns/formats/SPARQL_Results_CSV>
.
sd:feature sd:UnionDefaultGraph.
The SDaaS license encompasses access to the sdaas-rdfstore
graph store docker image, which is constructed upon a tailored iteration of the Blazegraph graph database. This customized version of Blazegraph is engineered to align with the specifications and demands set forth for a compliant graph store within the SDaaS framework.
SDaaS docker image
The community edition image of SDaaS is available at https://hub.docker.com/r/linkeddatacenter/sdaas-ce
With your license you will receive a repository url and key that allows you to download the sdaas-ee docker image. Before using SDaaS it you need to login to the LinkedData.Center repository with using the received key:
docker login <sdaas repository URL you received>
Customizing your SDaaS docker image
You can personalize your SDaaS instance writing your Dockerfile from this example:
FROM linkeddatacenter/sdaas-(ce|ee):latest # you can substitute latest your preferred version id
## here your docker customization
then build your docker image with the command:
docker build -t sdaas <here the path of your dockerfile>
Now you can store the create docker image in your private registry or keep it local.
Do not save your generated docker image in a public registry nor distribute it, because it contains information about your license and because this breaks the license agreement.
Enterprise Edition can use any RDF store or service compatible with SPARQL 1.1 specifications, e.g. LinkedData.Center SGaaS, Blazegraph, Stardog, AWS Neptune, virtuoso, etc. etc
The best security practices suggest to run SDaaS platform and the RDF store in a dedicated VPN, but this is not mandatory, you are free to adopt your preferred network topology.
Configuration variables
These variables are defined by SDaaS docker, you can use them but should be considered readonly :
Platform variable | Default | Description |
---|---|---|
AGENT_ID | defined at script boot | an unique id for the SDaaS running session |
SDAAS_INSTALL_DIR | /opt/sdaas | Root of the distribution of the SDaaS platform modules |
SDAAS_WORKSPACE | /workspace | Default working directory |
SDAAS_ETC | /etc/sdaas | Where internal configuration files are located |
SDAAS_REFERENCE_DOC | https://linkeddata.center/sdaas | Base URL for command documentation (http/https) |
These variables have a global scope and can be changed runtime in docker instance or inside a script:
Configuration variable | Default | Description |
---|---|---|
SDAAS_APPLICATION_ID | Community Edition | Used in HTTP protocol agent signature |
SD_LOG_PRIORITY | 5 | Sets the NOTICE default priority in logging |
SD_ABORT_ON_FAIL | false | Abort scripts on command failure |
SD_DEFAULT_CONTEXT | commands’ context defaults | it allows to specify some couple in the form |
STORE | http://kb:8080/sdaas/sparql | Default graph store SPARQL endpoint |
STORE_TYPE | w3c | Default store engine driver |
Each SDaaS command has a local context that can be overridden by setting the SD_DEFAULT_CONTEXT global configuration variable or through command options and operands.
For instance, after setting SD_DEFAULT_CONTEXT="sid=MYSTORE"
all subsequent calls to SDaaS commands will use the sid MYSTORE
instead of the default STORE
3 - Getting started
The SDaaS Platform offers a programmatic approach to building and update Knowledge Graphs. It includes a language and a command-line interface (CLI), offering optimized access to one or more RDF graph stores.
The subsequent chapters assume you’ve installed the SDaaS Platform and have some familiarity with the bash shell, Docker, and SPARQL. Additionally, you should possess a basic understanding of the key concepts and definitions outlined in the Knowledge Exchange Engine Specifications (KEES).
Connecting to an RDF Graph Store
SDaaS requires access to a RDF SPARQL service: let’s launch a graph store using a public docker image in a vpn named myvpn
docker network create myvpn
docker run --network myvpn --name kb -d linkeddatacenter/sdaas-rdfstore
This will run in background a small, full featured RDF Graph Store instance compliant with SDaaS requirements
Now you can get SDaaS prompt with:
docker run --network myvpn --rm -ti sdaas
Your terminal will show the SDaaS command prompt as an extension of the bash shell:
____ ____ ____
/ ___|| _ \ __ _ __ _/ ___|
\___ \| | | |/ _` |/ _` \___ \
___) | |_| | (_| | (_| |___) |
|____/|____/ \__,_|\__,_|____/
Smart Data as a Service platform - Pitagora
Community Edition 4.0.0 connected to http://kb:8080/sdaas/sparql (w3c)
Copyright (C) 2018-2024 LinkedData.Center
more info at https://linkeddata.center/sdaas
sdaas >
What is happening behind the scene?
SDaaS needs to connect to a graph store; to create such connection you have to specify one sid (store ID) that is an environment variable containing the URL of a SPARQL service endpoint for the graph store. By convention, the default sid is named STORE
. The SDaaS platform comes out-of-the-box configured with a default sid named STORE=http://kb:8080/sdaas/sparql
that you can change in any moment.
All SDaaS commands that require to access a graph store provides the -s <sid>
option.
If omitted, SDaaS platform will use use the name STORE
.
Each sid_ requires a driver specified by the driver variable <sid>_TYPE
. For instance, the store engine driver for STORE
is defined in STORE_TYPE
. By default SDaaS uses the w3c
driver that is suitable for any standard SPARQL service implementations. In addition to the standard driver, SDaaS Enterprise Edition provides some optimized drivers.
For instance: given that the linkeddatacenter/sdaas-rdfstore
Docker image is based on blazegraph engine, in Enterprise Edition you have the flexibility to utilize an optimized driver. To enable this use STORE_TYPE=blazegraph
The first look to the platform
SDaaS provides a set of commands to introspect the platform. Try typing:
# to get the platform version:
sd view version
# to see SDaaS configuration variables:
sd view config
# to list all installed platform modules. Modules are cached on first use. The cached modules are flagged with "--cached".
sd view modules
# to see all commands exported by exported by the [view module](/module/view):
sd view module view
# to download the whole SDaaS Language profile in turtle RDF serialization
sd view ontology -o turtle
see the Calling SDaaS commands section in the Application building guide to learn more about SDaaS commands
Boot the knowledge base
Cleanup a knowledge graph using the sd store erase command (be careful, this zap your default knowledge graph):
sd store erase
You can verify that the store is empty with
sd store size
KEES compliance:
The command sd kees boot (only Enterprise Edition) uses an optimized algorithm to cleanup the knowledge graph adding all metadata required by KEES specifications.
Ingest facts
To ingest RDF data, you have some options available:
- using the sd sparql update command to execute SPARQL Update operations for server side ingestion.
- using a ETL pipeline with the sd sparql graph command : to transfer and load a stream of RDF triples to a named graph in the the graph store.
- using the learn module that provides some specialized shortcuts to ingest data and KEES metadata
Using SPARQL update
the sparql update commands are executed by SPARQL service. Therefore, the resource must be visible to the graph store server. For example, to load the entire definition of schema.org:
echo 'LOAD <https://schema.org/version/latest/schemaorg-current-http.ttl> INTO GRAPH <urn:graph:0>' | sd sparql update
Using SPARQL graph
sd sparql graph command is executed by the SDaaS processor that store a stream of RDF triples (in nTriples serialization) into a named graph inside the graph store. This command offers increased control over the resource by allowing enforcement of the resource type. SDaaS optimizes the transfer of resource triples to the graph store using the most driver-optimized method.
In a ETL process, this command realizes the load stage. It is tipally used in a piped command.
Some examples:
# get data from a command
sd view ontology \
| sd sparql graph urn:sdaas:tbox
# get data from a local file
sd_cat mydata.nt \
| sd sparql graph
# retrieve linked data from a ntriple remote resource
sd_curl -s -f https://schema.org/version/latest/schemaorg-current-http.nt \
| sd sparql graph
# retrieve RFD data serialized with turtle
sd_rapper -i turtle https://dbpedia.org/data/Milan.ttl \
| sd sparql graph https://dbpedia.org/resource/Milan
# retrieve a linked data resource with content negotiation
ld=https://dbpedia.org/resource/Lecco
sd_curl_rdf $ld \
| sd_rapper -g - $ld \
| sd sparql graph -a PUT $ld
# same as above but with KEES metadata
sd_curl_rdf $ld \
| sd_rapper -g - $ld \
| sd kees metadata -D "activity_type=Learning source=$ld trust=0.8" $ld \
| sd sparql graph -a PUT $ld
sd_curl
, sd_curl_rdf
, sd_rapper
, sd_cat
are just wrappers for standard bash commands to trap and log errors.
The sd sparql graph supports two method for graph accrual:
-a PUT
for override named graph; it creates new named graph if needed-a POST
(default) to append data to a named graph; it creates new named graph if needed.
The gsp
driver implementation is capable of utilizing the SPARQL 1.1 Graph Store Protocol (GSP). To enable this support, define the driver type <sid>_TYPE=gsp
and set <sid>_GSP_ENDPOINT
to point to the URL of the service providing the Graph Store Protocol.
WARNING: many graph store engines have limitations regarding the size of data ingestion using just SPARQL update features with the default w3c
driver. Whenever possible, utilize a driver optimized for your graph store or a GSP capable endpoint.
Using the learn module (EE)
The learn module provides some shortcuts to loads linked data into a graph store together to ther KEES metadata.
Here are some examples that loads RDF triples:
sd learn resource -D "graph=urn:dataset:dbpedia" https://dbpedia.org/resource/Milan
sd learn file /etc/app.config
sd learn dataset urn:dataset:dbpedia
sd learn datalake https://data.exemple.org/
WARNING:
if the sd learn dataset
command fails, the target named graph could be incomplete, or annotated with a prov:wasInvalidatedBy
property
Querying the knowledge graph
To query the store you can use SPARQL :
cat <<-EOF | sd sparql query -o csv
SELECT ?g (COUNT (?s) AS ?subjects) WHERE {
GRAPH ?g{?s?p ?o}
} GROUP BY ?g
EOF
This command prints a CSV table with all named graphs and the number of triples they contain. The -o
option
specifies the format you want for the result.
The sd sparql query command, by default, outputs XML serialization. However, it allows for specification of a preferred serialization using the -o
flag. Additionally, the sparql module provides convenient command aliases, e.g.:
sd sparql list
to print a select query as csv without header on stdoutsd sparql rule
to print the result of a SPARQL CONSTRUCT as a stream of nTriples on stdout
e.g.echo "SELECT DISTINCT ?class WHERE { ?s a ?class} LIMIT 10" | sd sparql list
Reasoning about facts
To materialize inference in the knowledge graph you have some options available:
- using sd sparql update commands using SPARQL Update or SPARQL Query constructors
- using the sd sparql rule with constructors
- using the sd learn rule (EE) with constructors
You can use SPARQL update or a combination of SPARQL query as in the following examples:
# insert data evaluates the expression at SPARQL server side.
cat <<EOF | sd sparql update
INSERT {...}
WHERE {...}
EOF
# pipe two commands
cat <<EOF | sd sparql rule | sd sparql graph
CONSTRUCT ...
EOF
# pipe four commands adding metadata
cat <<EOF | sd sparql rule | sd kees metadata -D "trust=0.9" urn:graph:mycostructor | sd sparql graph urn:graph:mycostructor
CONSTRUCT ...
EOF
# same of above shortcut:
cat <<EOF | sd learn rule -D "trust=0.9" urn:graph:mycostructor
CONSTRUCT ...
EOF
Using plans (EE)
The plan module provides some specialized commands to run a SDaaS scripts
You can think a plan similar to a stored procedure in a SQL database.
For example, assume that STORE contains the following three plans:
prefix sdaas: <http://linkeddata.center/sdaas/reference/v4#> .
<urn:myapp:cities> a sdaas:Plan; sdaas:script """
sd learn resource -D "graph=urn:dataset:dbpedia" http://dbpedia.org/resource/Milano
sd learn resource -D "graph=urn:dataset:dbpedia" http://dbpedia.org/resource/Lecco
""" .
<urn:reasoning:recursive> a sdaas:Plan ; sdaas:script """
sd learn rule 'CONSTRUCT ...'
sd learn rule http://example.org/rules/rule1.rq
""" .
<urn:test:acceptance> a sdaas:Plan ; sdaas:script """
sd_curl_sparql http://example.org/tests/test1.rq | sd sparql test
sd_curl_sparql http://example.org/tests/test2.rq | sd sparql test
""" .
Then this commands use plans to automate some common activities:
sd plan run -D "activity_type=Ingestion" urn:myapp:cities
sd plan loop -D "activity_type=Reasoning trust=0.75" urn:reasoning:recursive
sd -A plan run -D "activity_type=Publishing" urn:test:acceptance
the sd plan loop
command executes a plan until there are no more changes in the knowledge base. It is useful to implement incremental or recursive reasoning rules.
Managing the Knowledge Base Status (EE)
You can signal the publication status of a specific knowledge base using KEES status poperties.
For setting, getting, and checking the status of a specific window in the knowledge graph, use:
# prints the date of the last status changes:
sd kees date published
# test a status
sd kees is published || echo "Knowledge Graph is not Published"
sd kees is stable || echo "Knowledge Graph is not in a stable status"
Connecting to multiple RDF Graph Stores
You can direct the SDaaS platform to connect to multiple RDF store instances, using standard or optimized drivers:
AWS="http://mystore.example.com/sparql"
AWS_TYPE="neptune"
WIKIDATA="https://query.wikidata.org/sparql"
WIKIDATA_TYPE="w3c"
This allow to import or reasoning using specialized SPARQL end point For instance, the above example imports all cat from wikidata into the default graph store and then list the first five cat names:
cat <<EOF | sd sparql rule -s WIKIDATA | sd sparql graph
DESCRIBE ?item WHERE {
?item wdt:P31 wd:Q146
}
EOF
cat <<EOF | sd sparql list
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wd: <http://www.wikidata.org/entity/>
SELECT DISTINCT ?cat WHERE {
?item wdt:P31 wd:Q146; rdfs:label ?cat
FILTER( LANG(?cat)= "en")
} ORDER BY ?cat LIMIT 5
EOF
Scripting
You have the ability to create a bash script containing various commands.
Refer to the Application building guide for more info about SDaaS scripting
Quitting the platform
When you type exit
you can safely destroy the sdaas container but the created data will persist in the external store.
Free allocated docker resources by typing:
docker rm -f kb
docker network rm myvpn
4 - Guide to Application building
SDaaS™ is a software platform that helps to build Semantic Web Applications and Smart Data Plaforms.
SDaaS assists in constructing and managing a knowledge graph that organizes linked data resources annotated with application-specific semantics.
Instead of accessing various data silos, applications utilize the knowledge graph as a reliable central repository, offering a semantic query service. This repository contains a semantic enriched replica of all the data necessary for the application. Because of the inherently distributed nature of the web and the continuous changes in data, an application using the SDaaS platform adopts the Eventual Consistency model. This model is highly popular today as it represents a reasonable trade-off between performance and complexity.
What is a semantic web application
A semantic web application is a software system designed to leverage and utilize the principles and technologies of the Semantic Web.
These applications utilize linked data, ontologies, and metadata to create richer connections between different pieces of information on the internet. They typically involve:
Structured Data Representation: Semantic web apps use RDF (Resource Description Framework) to represent data in a structured and machine-readable format. This allows for better understanding and interpretation of relationships between different data points.
Ontologies and Vocabularies: They employ ontologies and vocabularies (such as OWL - Web Ontology Language) to define relationships and meaning between entities, making it easier for systems to understand the context of the data.
Data Integration and Interoperability: These applications facilitate data integration from diverse sources, enabling different systems to exchange and use information more effectively.
Inference and Reasoning: Semantic web apps can perform logical inference and reasoning to derive new information or insights from existing data based on defined rules and relationships.
Enhanced Search and Discovery: They enable more sophisticated search functionalities by understanding the semantics of the data, providing more relevant and contextualized results.
In summary, a semantic web application harnesses Semantic Web technologies to enable machines to comprehend and process data more intelligently, facilitating better data integration, discovery, and utilization across various platforms and domains.
What is a smart data platform
A Smart Data Platform refers to a technological infrastructure designed to collect, process, analyze, and leverage data intelligently to generate insights, make decisions, and power various applications or services. These platforms often incorporate advanced technologies such as artificial intelligence (AI), machine learning (ML), data analytics, and automation to handle vast amounts of data from diverse sources.
A Smart Data Platform typically integrates multiple functionalities, including data ingestion, storage, processing, analysis, visualization, and often includes features for data governance, security, and compliance.
What is Eventual Consistency
Eventual Consistency is a concept in distributed computing where, in a system with multiple replicas of data, changes made to the data will eventually propagate through the system and all replicas will converge to the same state. However, this convergence is not instantaneous; it occurs over time due to factors like network latency, system failures, or concurrent updates. The Knowledge Graph can be considered as a semantically enriched replica of the ingested distributed data
The typical SDaaS user is a DevOps professional who utilizes the commands provided by the platform to script the building and updating of a knowledge graph. This knowledge graph is then queried by an application using SPARQL or REST APIs. SDaaS developers and system integrators can extend the platform by adding custom modules and creating new commands.
More in details the typical SDaaS use case scenario is summarized by the following diagram:
cloud "Linked Data cloud" as data
usecase "SDaas script\ndevelopment" as writesScript
usecase "smart data service\ndeployment" as managesSDaaS
usecase "application development" as developsApplication
usecase "queries\nKnowledge Graph" as usesKnowledge
usecase "installs\nSDaaS modules" as installsSDaaS
usecase "configure\nSDaaS" as configuresSDaaS
usecase "knowledge update" as updatesKnowledge
actor "App devops" as user
package "SDaaS distribution" as Distribution <<Docker image>>
node "smart data service" as SDaaS {
component "SDaaS script" as Script
package Module {
component "SDaaS Command" as Command
interface ConfigVariable
}
}
database "Knowledge Graph" as Store
node Application
user .. developsApplication
user .. writesScript
user .. managesSDaaS
Command o-> ConfigVariable : uses
writesScript .. Script
managesSDaaS -- installsSDaaS
managesSDaaS -- configuresSDaaS
configuresSDaaS .. ConfigVariable
installsSDaaS .. Module
Command .. updatesKnowledge
data . updatesKnowledge
updatesKnowledge . Store
Script o--> Command : calls
Distribution .. installsSDaaS
Application .. usesKnowledge
developsApplication .. Application
usesKnowledge .. Store
Calling SDaaS commands
The SDaaS Platform operates through a set of bash commands and functions. The general syntax to call a SDaaS command is sd <module> <name> [*OPTIONS*] [*OPERANDS*]
, while the syntax of an SDaaS function is sd_<name>
.
The modules are bash script fragments that define a set of SDaaS functions, providing a namespace for them.
Before calling an SDaaS Function, you must explicitly load its module cache with sd_include <module>
core function. Core functions are contained in the core module that is loaded at startup. SDaaS commands automatically include the required modules.
SDaaS commands MAY depend on a set of context variables you can pass using options.The global configuration variable SD_DEFAULT_CONTEXT
provides a default local context used by all commands.
For instance these calls are all equivalent:
sd sparql graph urn:myapp:abox
sd sparql graph -s STORE -D "graph=urn:myapp:abox"
sd sparql graph -D "sid=STORE" -D "graph=urn:myapp:abox"
sd sparql graph -D "sid=STORE graph=urn:myapp:abox"
sd sparql graph -D "sid=OVERRDEN_BY-s graph=urn:myapp:overridden_by_operand" -s STORE urn:myapp:abox
SD_DEFAULT_CONTEXT="sid=STORE graph=urn:myapp:abox"; sd sparql graph
SDaaS scripting
The smart data service is usually includes SDaaS script and an application config file.
The SDaaS script is normal bash scrips that include the SDaaS platform with the command source $SDAAS_INSTALL_DIR/core
Usually you create an application config file that contains the definition of the dataset and rules used by ingestion, reasoning and publishing plan. For instance:
#!/usr/bin/env bash
source $SDAAS_INSTALL_DIR/core
sd store erase
## loads the language profile and the application specific configurations
sd view ontology | sd sparql graph urn:tbox
sd_curl -s -f https://schema.org/version/latest/schemaorg-current-http.nt | sd sparql graph urn:tbox
## loading some facts from dbpedia
for ld in https://dbpedia.org/resource/Lecco https://dbpedia.org/resource/Milan; do
sd_curl_rdf $ld | sd_rapper -g - $ld | sd sparql graph urn:abox
done
The script MAY implements a never-ending loop, similar to this pseudo-code using SDaaS Enterprise Edition Platform:
#!/usr/bin/env bash
source $SDAAS_INSTALL_DIR/core # Loads the SDaaS platform
while NEW_DATA_DISCOVERED ; do
# Boot and lock platform ######################
sd kees boot -L
## loads the language profile and the application specific configurations
sd -A view ontology | sd sparql graph urn:myapp:tbox
sd learn file /etc/myapp.config
## loading facts
sd learn dataset -D "activity_type=Learning trust=0.9" urn:myapp:facts
# reasoning window loop #########################
sd plan loop -D "activity_type=Reasoning trust=0.9" urn:myapp:reasonings
# publishing window ########################
sd -A plan run -D "activity_type=Publishing" urn:myapp:tests
sd kees unlock
sleep $TIME_SLOT
done
Application architectures enabled by SDaaS
In this chapter you find some typical architectures that are enabled by SDaaS
Use case 1: autonomous agent
an ETL agent that transform raw data into linked data:
Folder "Raw data" as Data
Folder "Linked data" as RDF
note right of RDF: RDF data according to\nan application language profile
package "ETL application" #aliceblue;line:blue;line.dotted;text:blue {
node "Autonomous Agent" as aa #white;line:blue;line.dotted;text:blue
database "graph store" as graphStore
aa -(0 graphStore : run mapping\nrules
Data ..> aa
aa ..> RDF
The autonomous agent uses SDaaS to upload raw data in an intermediate form to a grap store and to use SPARQL rules to map the intermediate format into the application language profile.
Use case 2: linked data plaform
The SDaaS platform is used to implement an agents that transform and loads raw data into a knowledge graph, doing some ontology mappings and providing a Linked Data Platform interface to applications. It is compatible with a SOLID protocol
cloud "LinkedOpen Data Cloud" as Data
package "LOD smart cache" #aliceblue;line:blue;line.dotted;text:blue {
node "Autonomous\nDiscovery Agent" as DiscoveryAgent #white;line:blue;line.dotted;text:blue
database "graph store" as DataLake
node "Linked data Proxy" as DataLakeProxy
DiscoveryAgent -(0 DataLake : writes RDF data
DataLake 0)- DataLakeProxy
}
Data ..> DiscoveryAgent
DataLakeProxy -LDP
Linked-data proxy is a standard component providing support the VOiD ontology and HTTP cache features. Linked data center provides a free open source implementation that can be used out-of-the-box or as reference implementation for this component.
Use case 3: smart data warehouse
The typical SDaaS application architecture to build an RDF based data warehouse is the following:
cloud "1st, 2nd and 3rd-party raw data" as Data
package "data management platform" #aliceblue;line:blue;line.dotted;text:blue {
node "Fact provider" as DiscoveryAgent #white;line:blue;line.dotted;text:blue
folder "Linked-data lake" as DataLake
node "smart data service" as SDaaSApplication #white;line:blue;text:blue
database "RDF\nGraph Store" as GraphStore
DiscoveryAgent --(0 DataLake : writes RDF data
DataLake 0)-- SDaaSApplication : learn data
SDaaSApplication --(0 GraphStore : updates
note right of DiscoveryAgent
Here the application injects
its specific semantic in raw data
end note
note left of SDaaSApplication
Here KEES cycle
is implemented
end note
}
interface "SPARQL QUERY" AS SQ
GraphStore - SQ
package "application" {
node "Application backend" as Backend
node "Application frontend" as Frontend
database "application local data" as firstPartyData
Backend 0)- Frontend : calls
Backend --( SQ : queries
firstPartyData 0)-- Backend : writes
}
Data ..> DiscoveryAgent : gets raw data
Data <..... firstPartyData : copy 1st-party data
You can distinguish two distinct threads: the development of a data management platform and the development of the application. The knowledge graph built in the data platform is used by the application as the primary source of all data. The data produced by the application can be reinjected into the data management platform.
The SDaaS platform is used in the development of the data management platform, primarily in the development of the smart data service and optionally in the Autonomous Discovery Agent.
More in detail, the main components of the data platform are:
- Autonomous Discovery Agent
- its an application-specific ETL process triggered by changes in data. This process transforms raw data into linked data annotated with information recognized by the application and stores it in a linked-data lake. Multiple Autonomous Discovery Agents may exist and operate concurrently. Each agent can access the Graph Store to identify enrichment opportunities or to detect changes in the data source.
- Linked-data lake
- it is a repository, for example an s3 bucket or a shared file system, that contains RDF files, that is Linked Data Platform RDF Sources [LDP-RS] expressed with a known language profile. This files can be mirrors or existing web resources, mappings of databases or even private data described natively in RDF.
- smart data service
- it is a service that includes the SDaaS platform and that contains a script processing data conforming with the KEES specifications.
- RDF Graph Store
- implements the Knowledge Graph supporting the SPARQL protocol interface. Linked data center provides a free full featured RDF graph database you can use to learn and thes the SDaaS platform
Use case 4: Smart data agent
All activities ar performed by the same agent that embeds its workflow
cloud "3rd-party data" as ldc
database "Knowledge Graph" as GraphStore
folder "Reports" as RawData
Folder "Linked data lake" as ldl
node "Smart data agent" as agent #white;line:blue;text:blue
ldl --> agent
ldl <-- agent
ldc -> agent
agent --> GraphStore
agent -> RawData
The workflow is just a definition of activities that should be completed by agent
Use case 5: semantic web agency
In this architecture, multiple agents run at the same time, agents coordinated using knowledge graph status and locks.
cloud "3rd-party data" as ldc
database "Knowledge Graph" as GraphStore
folder "reports" as RawData
note bottom of RawData: can be used as raw data
Folder "Linked data lake" as ldl
note top of ldl: contains activity plans
package "Agency" #aliceblue;line:blue;line.dotted;text:blue {
node "Smart data agent" as Ingestor #white;line:blue;text:blue
node "Reasoning\nagent(s)" as Reasoner #white;line:blue;line.dotted;text:blue
node "Enriching\nagent(s)" as Enricher #white;line:blue;line.dotted;text:blue
node "Publishing\nagent(s)" as Publisher #white;line:blue;line.dotted;text:blue
}
ldl --> Ingestor
ldl <-- Enricher
ldc --> Enricher
Ingestor --> GraphStore
Reasoner <--> GraphStore
Publisher <-- GraphStore
Enricher <-- GraphStore
Publisher --> RawData
The workflow is just a definition of activities that should be completed by agent
5 - Customizing the SDaaS platform
5.1 - Architecture overview
The SDaaS platform is utilized for creating smart data platforms as backing services, empowering your applications to leverage the full potential of the Semantic Web.
The SDaaS platform out-of-the-box contains an extensible set of modules that connect to several Knowledge Graphs through optimized driver modules.
A smart data service conforms to the KEES specifications and is realized by a customization of the SDaaS docker image. It contains one or more scripts calling a set of commands implemented by modules. The command behavior can be modified through configuration variables.
node "SDaaS platform " as SDaaSDocker <<docker image>>{
component Module {
collections "commands" as Command
collections "configuration variables" as ConfigurationVariable
}
}
node "smart data service" as SDaaSService <<docker image>>{
component "SDaaS Script" as App
}
database "Knowledge Graphs" as GraphStore
cloud "Linked Data" as Data
SDaaSService ---|> SDaaSDocker : extends
Command ..(0 Data : HTTP
Command ..(0 GraphStore : SPARQL
Command o. ConfigurationVariable
App --> Command : calls
App --> ConfigurationVariable : set/get
It is possible to add a new modules to extend the power of the platform to match special needs.
Data Model
The SDaaS data model is designed around few concepts: Configuration Variables, Functions, Commands, and Modules.
SDaaS configuration variables
A SDaaS configuration variable is a bash environment variable that the platform uses as configuration option. Configuration variables have a default value; they can be changed statically in the Docker image or runtime in Docker run, Docker orchestration, or in user scripts.
The following taxonomy applies to SDaaS functions:
class "SDaaS Configuration variable" as ConfigVariable
class "SID Variable" as SidVariable
class "Platform Variable" as PlatformVariable <<read only>>
interface EnvironmentVariable
ConfigVariable --|> EnvironmentVariable
SidVariable --|> ConfigVariable
ConfigVariable <|- PlatformVariable
Environment Variable
It is a shell variable.
Platform variable
It is a variable defined by the SDaaS docker tha should not be changed outside the dokerfile.
SID Variable
It is a special configuration variable that states a graph store property. The general syntax is <sid>_<var name>
. For example the variable STORE_TYPE
refers to a driver module name that it must be used to access the graph store connected by the sid STORE
. Some driver can require/define other sid variables with their default values.
See all available configuration variables in the installation guide
SDaaS Functions
An SDaaS function is a bash function embedded in the platform. For example, sd_log
. A bash function accepts a fixed set of positional parameters, writes output on stdout, and returns 0 on success or an error code otherwise.
The following taxonomy applies to SDaaS functions:
Interface "Bash function" as BashFunction
Class "Sdaas function" as SdaasFunction
Class "Driver virtual function" as DriverVirtualFunction
Class "Driver function implementation" as DriverFunction
SdaasFunction --|> BashFunction
DriverFunction --|> SdaasFunction
DriverVirtualFunction --|> SdaasFunction
DriverFunction <- DriverVirtualFunction: calls
Bash Function
Is the interface of a generic function defined in the scope of a bash process.
Driver virtual function
It is a function that act as a proxy for a driver method, its first parameter is always the sid. A _driver virtual function has the syntax <sd_driver_<method name>
(e.g. sd_driver_load
) and has require a set of fixed position parameters.
Driver method implementation
It is a function that implements a driver virtual function for a specific graph store engine driver. A driver method implementation has the syntax <sd_<driver name>_<method name>
(e.g. sd_w3c_load
) and expects a set of fixed position parameters (unchecked). Driver function implementation functions should be called only by a driver virtual function.
SDaaS commands
A Command is a SDaaS function that conforms to the SDaaS command requirements. For example, sd_sparql_update
. A command writes output on stdout, logs on std error and returns 0 on success or an error code otherwise.
In a script, SDaaS commands should be called through the sd
function using the following syntax: sd <module name> <function name> [options] [operands]
. The sd
function allows a flexible output error management and auto-includes all required module. Direct calls to the command functions should be done only inside modules implementation.
For instance calling sd -A sparql update
executes the command function sd_sparql_update
including the sparql
module and aborting the script in case of error. This is equivalent to sd_include sparql; sd_sparql_update || sd_abort
The following taxonomy applies to commands:
Class Command
Class "Facts Provision" as DataProvidingCommand
Class "Ingestion Command" as IngestionCommand
Class "Query Command" as QueryCommand
Class "Learning Command" as LearningCommand
Class "Reasoning Command" as ReasoningCommand
Class "Enriching Command" as EnrichingCommand
Class "SDaaS function" as SDaaSFunction
Class "Store Command" as StoreCommand
Class "Compound Command" as CompoundCommand
Command --|>SDaaSFunction
DataProvidingCommand --|> Command
StoreCommand --|> Command
IngestionCommand --|> StoreCommand
QueryCommand --|> StoreCommand
LearningCommand -|> CompoundCommand
LearningCommand --|> DataProvidingCommand
LearningCommand --|> IngestionCommand
CompoundCommand <|- ReasoningCommand
ReasoningCommand --|> IngestionCommand
ReasoningCommand --|> QueryCommand
EnrichingCommand --|> ReasoningCommand
EnrichingCommand --|> LearningCommand
Compound Command
It is a command resulting from the composition of two or more commands, usually in a pipeline.
Facts Provision
It is a command that provides RDF triples in output.
Query Command
It is a command that can extract information from the knowledge graph.
Ingestion Command
It is a command that stores facts into a knowledge graph.
Reasoning Command
It is a command that both queries and ingests data into the same knowledge graph according to some rules.
Learning Command
It is a command that provides and ingests facts into the knowledge graph.
Enriching Command
It is a command that queries the knowledge base, discovers new data, and injects the results back into the knowledge base.
Store Command
It is a command that interact with a knowledge base. It accepts -s *SID*
and -D "sid=*SID*"
options.
SDaaS modules
A SDaaS module is a collection of commands and configuration variables that conforms to the module building requirements.
You can explicitly include a module content with the command sd_include
The module taxonomy is depicted in the following image:
class "SDaaS module" as Module
Abstract "Driver" as AbstractDriver
Class "Core Module" as Core
Class "Driver implementation" as DriverImplementation
Class "Command Module" as CommandModule
Class "Store module" as StoreModule
Core --|> Module
CommandModule --|> Module
AbstractDriver --|> Module
DriverImplementation <- AbstractDriver : calls
StoreModule --|> CommandModule
StoreModule --> AbstractDriver : includes
Core <- CommandModule : includes
Command Module
Modules that implement a set of related SDaaS command. They always include the Core Module and can depend from other modules.
Core Module
A module singleton exposes core commands and must be loaded before using any SDaaS feature.
Driver
A module singleton that exposes the the abstract Driver functions interface to handle connections with a generic graph store engine.
Driver implementation
Modules that implement the function interface exposed by the Abstract Driver for a specific graph store engine.
Store Module
Modules that export store commands that connects to a graph store using the functions exposed by the Driver module. A store module always includes the driver module.
The big picture
The resulting SDaaS platform data model big picture is:
package "User Application" {
class "User Application" as Application
}
package "SDaaS platform" #aliceblue;line:blue;line.dotted;text:blue {
class Command
class Module
class "SDaaS function" as SDaaSFunction
Abstract "Driver" as Driver
class ConfigVariable
Abstract "SID Variable" as SidVariable
interface "bash function" as Function
interface EnvironmentVariable
}
package "Backing services" {
interface "Backing service" as BakingService
interface "Graph Store" as GraphStore
interface "Knowledge Graph" as KnowledgeGraph
interface "Linked Data Platform" as RDFResource
}
package "smart data service" #aliceblue;line:blue;line.dotted;text:blue {
class "SDaaS script" as Script
class "smart data Service" as SDaaS
}
KnowledgeGraph : KEES compliance
GraphStore : SPARQL endpoint
RDFResource : URL
Command --|> SDaaSFunction
SDaaSFunction --|> Function
ConfigVariable --|> EnvironmentVariable
SidVariable --|> ConfigVariable
Driver --|> Module
GraphStore --|> BakingService
RDFResource --|> BakingService
KnowledgeGraph --|> GraphStore
BakingService <|-- SDaaS
Module *-- Command : exports
Module *-- ConfigVariable : declares
Command o.. RDFResource : learns
Driver --> KnowledgeGraph : connects
Module .> Module : includes
Function --o Script
EnvironmentVariable --o Script
Script -o SDaaS : contains
SidVariable <- Driver : uses
ConfigVariable .o Command : uses
Application ..> KnowledgeGraph : access
Backing service
It is a type of software that operates on servers, handling data storage, resource publishing , and processing tasks for an application.
Graph Store
It is backed service that provides support for SPARQL protocol and an optional support to Graph Store Protocol.
Knowledge Graph
It is a Graph Store compliant with the KEES specification.
Linked Data Platform
It is a web informative resource that exposes RDF data in one of supported serialization according to W3C LDP specifications.
SDaaS script
It is a bash script that uses SDaaS commands.
smart data service
It is a backing service that include the SDaaS platform and implements one or more SDaaS scripts.
User application
It is a (Sematic Web) Application that uses a knowledge graph.
5.2 - Module building
Conformance Notes
This chapter defines some conventions to extend the SDaaS platform.
Conformance requirements are expressed with a combination of descriptive assertions and RFC 2119 terminology. The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in the normative parts of this document are to be interpreted as described in RFC 2119. However, for readability, these words may not always appear in uppercase letters in this specification.
Module Requirements
An SDaaS module is a bash script file fragment that complies with the following restrictions:
- It MUST be placed in the
SDAAS_INSTALL_DIR
or in the modules directory in the$HOME/modules
directory. Modules defined in the$HOME/modules
directory take precedence over the ones defined in theSDAAS_INSTALL_DIR
directory. - Its name MUST match the regular expression
^[a-z][a-z0-9-]+$
. - The first line of the module MUST be
if [[ ! -z ${__module_MODULENAME+x} ]]; then return ; else __module_MODULENAME=1 ; fi
where MODULENAME is the name of the module. - all commands defined in the module SHOULD match sd_MODULENAME_FUNCTIONAME here the MODULENAME is the same name of the module that contains the command and FUNCTIONNAME is an unique name inside the module. To rewrite existing commands it is allowed but discouraged.
- all commands MUST follow the syntax conventions described below.
A module CAN contain:
- Constants (read only variables): constant SHOULD be prefixed by SDAAS_ and MUST be an unique name in the platform. Overrriding default
- Configuration variables: constant SHOULD be prefixed by SD_ and MUST be an unique name in the platform.
- Module commands definition.
- Module initialization: a set of bash commands that always runs on module loading
For example, see the modules implementation in SDaaS community edition
Command requirements
Commands MUST support the Utility Syntax Guidelines described in the Base Definitions volume of POSIX.1‐2017, Section 12.2, from section 3 to 10, inclusive.
All commands that accepts options and/or operands SHOULD accept also the option -D
. Such option is used for define local variables in the form of key-value that provides a context for the command.
Options SHOULD conform to the following naming conventions:
-A
- abort if the command returns is > 0 .
- -a, -D “accrual_method=ACCRUAL METHOD”
- accrual method, the method by which items are added to a collection. PUT and POST method SHOULD be recognized
- -f FILE NAME, -D “input_file=FILE NAME”
- to be use to refer a local file object with relative or absolute path
- -i INPUT FORMAT, -D “input_format=INPUT FORMAT”
- input format specification, these values SHOULD be recognized (from libraptor):
Format | Description |
---|---|
rdfxml | RDF/XML (default) |
ntriples | N-Triples |
turtle | Turtle Terse RDF Triple Language |
trig | TriG - Turtle with Named Graphs |
guess | Pick the parser to use using content type and URI |
rdfa | RDF/A via librdfa |
json | RDF/JSON (either Triples or Resource-Centric) |
nquads | N-Quads |
- -h
- prints an help description
- -o OUTPUT FORMAT, -D “output_format=OUTPUT FORMAT”
- output format specification This values SHOULD be recognized:
- csv
- csv-h
- csv-1
- csv-f1
- boolean
- tsl
- json
- ntriples
- xml
- turtle
- rdfxml
- test
- -p PRIORITY
- to be use to reference a priority
level | mnemonic | explanation |
---|---|---|
2 | CRITICAL | Should be corrected immediately, but indicates failure in a primary system - fix CRITICAL problems before ALERT - an example is loss of primary ISP connection. |
3 | ERROR | Non-urgent failures - these should be relayed to developers or admins; each item must be resolved within a given time. |
4 | WARNING | Warning messages - not an error, but indicated that an error will occur if action is not taken, e.g. file system 85% full - each item must be resolved within a given time. |
5 | NOTICE | Events that are unusual but not error conditions - might be summarized in an email to developers or admins to spot potential problems - no immediate action required. |
6 | INFORMATIONAL | Normal operational messages - may be harvested for reporting, measuring throughput, etc. - no action required. |
7 | DEBUG | Info is useful to developers for debugging the app, not useful during operations. |
- -s SID , “sid=SID”
- connect to Graph Store named SID (
STORE
by default) - -S SIZE
- to be use to reference a size
Evaluation of the local command context
The process of evaluating the local context for a command is as follows:
- First, the local context hardcoded in the command implementation is evaluated.
- The hardcoded local context can be overridden by the global configuration variable SD_DEFAULT_CONTEXT.
- The resulting context can be overridden by specific command options (e.g., -s SID). Command options are evaluated left to right
- The resulting context can be overridden by specific command operands.
For example, all these calls will have the same result:
sd_sparql_graph
: the hardcoded local context ingests data into a named graph inside the graph store connected to the sid STORE, using a generated UUID URI as the name of the named graph.sd_sparql_graph -s STORE $(sd_uuid)
: hardcoded local context overridden by specific options and operand.sd_sparql_graph -D "sid=STORE
: hardcoded local context overridden by the -D option.sd_sparql_graph -D "sid=STORE" -D "graph=$(sd_uuid)"
: the same as above but using multiple -D options (always evaluated left to right).SD_DEFAULT_CONTEXT="sid=STORE"; sd_sparql_graph $(sd_uuid)
: hardcoded local context overridden by SD_DEFAULT_CONTEXT and operand.SD_DEFAULT_CONTEXT="sid=XXX"; sd_sparql_graph -s YYY -D "sid=STORE graph=ZZZZ" $(sd_uuid)
: a silly command call that demonstrates the overriding precedence in local context evaluation.
5.3 - Driver building
Conformance Notes
This chapter defines some conventions to extend the SDaaS platform.
Conformance requirements are expressed with a combination of descriptive assertions and RFC 2119 terminology. The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in the normative parts of this document are to be interpreted as described in RFC 2119. However, for readability, these words may not always appear in uppercase letters in this specification.
Driver Requirements
- a driver MUST be a valid a [SDaaS module]({{ < ref “module building” >}})
- it MUST implements the driver interface with SDaaS functions that conforms to the following naming conventions:
sd_<driver name>_<driver method>
; for example if you want to implement a special the driver for thestardog
graph store you MUST implement thesd_stardog_query
function - all driver function MUST be positional, with no defaults. The validity check of the parameters is responsability of the caller (ie. the driver module)
- Following methods MUST be implemented:
method name( parameters ) | description |
---|---|
erase(sid) | erase the graph store |
load(sid, graph, accrual_method) | loads a stream of nTripes into a named graph in a graph store according the accrual policy " |
query(sid, mime_type) | execute in a graph store a SPARQL query read from std input, requesting in output one of the supported mime types |
size(sid) | return the number of triples in a store |
update(sid) | execute in a graph store a SPARQL update request read from std input |
All method parameters are strings that MUST matches following regular expressions:
- sid MUST match the
^[a-zA-Z]+$
regular expression and MUST point to an http(s) URL. It is assumed that sid is the name of a SID variable - graph MUST be a valid URI
- mime_type MUST be a valid mime type
- accrualMethod MUST match
PUT
orPOST
Implementing a new graph store driver
A driver is the implementation is described by the following UML class diagram:
package "SDaaS Community Edition" {
interface DriverInterface
DriverInterface : + erase(sid)
DriverInterface : + load(sid, graph, accrual_method)
DriverInterface : + query(sid, mime_type)
DriverInterface : + size(sid)
DriverInterface : + update(sid)
DriverInterface : + validate(sid)
class w3c
class testdriver
}
package "SDaaS Enterprise Edition" {
class blazegraph
class gsp
class neptune
}
testdriver ..|> DriverInterface
w3c ..|> DriverInterface
blazegraph --|> w3c
gsp --|> w3c
neptune --|> w3c
There is a reference implementation of the driver interface known as the w3c
driver compliant with W3C SPARQL 1.1 protocol specification, along with the testdriver
stub driver implementation to be used in unit tests. The SDaaS Enterprise Edition offers additional drivers that are specialized versions of the w3c
driver, optimized for specific graph store technologies:
- gsp driver: a
w3c
driver extension that uses the SPARQL 1.1 Graph Store HTTP Protocol. It defines the configuration variable<sid>_GSP_ENDPOINT
that contains the http address of the service providing a Graph Store Protocol interface. - blazegraph driver: a optimized implementation for Blazegraph graph store.
- neptune driver: a optimized implementation for AWS Neptune service.
In commands, do not call the driver method implementation function directly. Instead, call the corresponding abstract driver module functions.