This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Get hands-on with the platform.

This section provides an overview of the components that builds up the SDaaS architecture.

This section addressed to platform developer and integrators.

1 - 🚀 Overview

An overview of the SDaaS platform

SDaaS™ (Smart Data as a Service) is a software platform designed by LinkedData.Center to build Enterprise Knowledge Graphs.

The SDaaS platform empowers your applications, allowing them to integrate any Linked Data and leverage the full potential of the semantic web. It is the reference implementation of KEES (Knowledge Exchange Engine Specifications), facilitating the exchange and sharing of domain-specific knowledge.

node "User Application" as Application
cloud "Linked Data cloud" as RDFResource
    usecase "Data access" as query
package "Application Data Management System" {
    node "Smart Data service" as SDaaS #aliceblue;line:blue;line.dotted;text:blue
    database "Knowledge Graph" as KnowledgeGraph
    usecase "Data management" as maintain
}
Application -- query
KnowledgeGraph  -- query
SDaaS -- maintain
KnowledgeGraph -- maintain
KnowledgeGraph  <- SDaaS
RDFResource -> SDaaS

note top of SDaaS
Here is where the SDaaS platform is used
end note

The typical SDaaS user is a DevOps professional who utilizes the commands provided by the platform to develop the service that learns, reasons, enriches and publish linked data within a graph store. The resulting knowledge graph is then queried by an application using SPARQL or REST APIs. SDaaS developers and system integrators can extend the platform by adding custom modules and creating new commands.

What is a Knowledge Graph?

The knowledge graph is a pivotal component within any modern Data Management Platform (DMP) consisting in:

  • a graph store containing first, second and third-party data;
  • a formal knowledge organization that allows linking the data;
  • metadata to evaluate the data quality;
  • axioms and rules computed by “reasoners” that are able to understand the meaning of the data;
  • an interface to query the knowledge graph and to answers questions

What is KEES?

KEES (Knowledge Exchange Engine Specifications and Services) is an architectural design pattern that establishes specific requirements for Semantic Web Applications. Its purpose is to formally describe domain knowledge with the goal of making it tradeable and shareable. The full specifications are available on the KEES homepage.

SDaaS™ features

SDaaS provides the following features:

  • compliance with W3C Semantic Web standards
  • compatibility with various storage engines (e.g., Wikimedia BlazeGraph, AWS Neptune, etc., etc.)
  • performance optimization for the vendor implementations
  • concurrent use of many knowledge instances for a virtually unlimited capacity
  • ingestion automation of Linked Data according with VoiD standard
  • data provenance management automation according with PROV ontology
  • extensible deductive reasoning through configurable OWL axioms materialization
  • abductive reasoning through configurable rules
  • full support of the KEES specification

SDaaS is available as:

  • the community edition platform (ce), released as unsupported Open Source in GITHUB that provides a CLI interface and core tools for creating a knowledge base;
  • the enterprise edition platform (ee), only available in the commercial version, that extends the community edition with additional tools and interfaces

SDaaS is a precious companion for:

  • Anti-Money Laundry and Fraud detection (e.g. https://mospo.eu/)
  • AI applications (e.g. https://EZClassifier.com and https://AdvisorLens)
  • Identity management (eg. W3C Verifiable Credential schema)
  • Financial data processing (e.g. https://budget.g0v.it/)
  • Social Network
  • Recommendation engines
  • Scientific applications
  • … many other use case

have a look to the Getting Started guide to see SDaaS in action

Who wrote the code of SDaaS

LinkedData.Center is a small technology company based in Europe, specializing in Neuro-Symbolic A.I. For more information and contacts, please visit the LinkedData.Center website.

2 - Installation

How to install the SDaaS platform

Since from version 2.1.0 the deployment of the SDaaS platform is based on Docker. The SDaaS Enterprise Edition can also be installed on a standalone host (virtual or physical). If you do not already have Docker on your computer, it’s the right time to install it.

Requirements

SDaaS is contained in a docker image For SDaaS to operate effectively, certain prerequisites are necessary:

  1. Docker Orchestrator: A capable Docker orchestrator such as Docker Compose or Kubernetes is essential. This orchestrator manages the deployment and execution of Docker images, ensuring efficient resource utilization and scalability.
  2. Accessible Graph Store: The SDaaS platform requires access to at least one graph store. This store serves as the repository for the graph data and must be accessible by the SDaaS platform for storing and retrieving data as needed.

Docker requirements

The SDaaS requires:

  • Docker version ~20.10
  • Docker Compose version ~2.17

Graph store requirements

SDaaS requires read/write access to at least one RDF Graph Store that requires:

The RDF Graph Store must provide an http(s) service endpoint compliant with the following minimal service description:

    @prefix sd: <http://www.w3.org/ns/sparql-service-description#>
    sd:supportedLanguage 
        sd:SPARQL11Query, 
        sd:SPARQL11Update ;
    sd:resultFormat 
        <http://www.w3.org/ns/formats/RDF_XML>, 
        <http://www.w3.org/ns/formats/Turtle>,
        <http://www.w3.org/ns/formats/N-Triples>,
        <http://www.w3.org/ns/formats/N-Quads>
    .
    sd:resultFormat
        <http://www.w3.org/ns/formats/RDF_XML>,
        <http://www.w3.org/ns/formats/Turtle>,
        <http://www.w3.org/ns/formats/N-Triples>
        <http://www.w3.org/ns/formats/SPARQL_Results_CSV>
    .
    sd:feature sd:UnionDefaultGraph.

The SDaaS license encompasses access to the sdaas-rdfstore graph store docker image, which is constructed upon a tailored iteration of the Blazegraph graph database. This customized version of Blazegraph is engineered to align with the specifications and demands set forth for a compliant graph store within the SDaaS framework.

SDaaS docker image

The community edition image of SDaaS is available at https://hub.docker.com/r/linkeddatacenter/sdaas-ce

With your license you will receive a repository url and key that allows you to download the sdaas-ee docker image. Before using SDaaS it you need to login to the LinkedData.Center repository with using the received key:

docker login <sdaas repository URL you received>

Customizing your SDaaS docker image

You can personalize your SDaaS instance writing your Dockerfile from this example:

FROM linkeddatacenter/sdaas-(ce|ee):latest # you can substitute latest your preferred version id
## here your docker customization

then build your docker image with the command:

docker build -t sdaas <here the path of your dockerfile>

Now you can store the create docker image in your private registry or keep it local.

Do not save your generated docker image in a public registry nor distribute it, because it contains information about your license and because this breaks the license agreement.

Enterprise Edition can use any RDF store or service compatible with SPARQL 1.1 specifications, e.g. LinkedData.Center SGaaS, Blazegraph, Stardog, AWS Neptune, virtuoso, etc. etc

The best security practices suggest to run SDaaS platform and the RDF store in a dedicated VPN, but this is not mandatory, you are free to adopt your preferred network topology.

Configuration variables

These variables are defined by SDaaS docker, you can use them but should be considered readonly :

Platform variableDefaultDescription
AGENT_IDdefined at script bootan unique id for the SDaaS running session
SDAAS_INSTALL_DIR/opt/sdaasRoot of the distribution of the SDaaS platform modules
SDAAS_WORKSPACE/workspaceDefault working directory
SDAAS_ETC/etc/sdaasWhere internal configuration files are located
SDAAS_REFERENCE_DOChttps://linkeddata.center/sdaasBase URL for command documentation (http/https)

These variables have a global scope and can be changed runtime in docker instance or inside a script:

Configuration variableDefaultDescription
SDAAS_APPLICATION_IDCommunity EditionUsed in HTTP protocol agent signature
SD_LOG_PRIORITY5Sets the NOTICE default priority in logging
SD_ABORT_ON_FAILfalseAbort scripts on command failure
SD_DEFAULT_CONTEXTcommands’ context defaultsit allows to specify some couple in the form = used modify the SDaaS commands’ behavior
STOREhttp://kb:8080/sdaas/sparqlDefault graph store SPARQL endpoint
STORE_TYPEw3cDefault store engine driver

Each SDaaS command has a local context that can be overridden by setting the SD_DEFAULT_CONTEXT global configuration variable or through command options and operands.

For instance, after setting SD_DEFAULT_CONTEXT="sid=MYSTORE" all subsequent calls to SDaaS commands will use the sid MYSTORE instead of the default STORE

3 - Getting started

A quick tour of the SDaaS platform

The SDaaS Platform offers a programmatic approach to building and update Knowledge Graphs. It includes a language and a command-line interface (CLI), offering optimized access to one or more RDF graph stores.

The subsequent chapters assume you’ve installed the SDaaS Platform and have some familiarity with the bash shell, Docker, and SPARQL. Additionally, you should possess a basic understanding of the key concepts and definitions outlined in the Knowledge Exchange Engine Specifications (KEES).

Connecting to an RDF Graph Store

SDaaS requires access to a RDF SPARQL service: let’s launch a graph store using a public docker image in a vpn named myvpn

docker network create myvpn
docker run --network myvpn --name kb -d linkeddatacenter/sdaas-rdfstore

This will run in background a small, full featured RDF Graph Store instance compliant with SDaaS requirements

Now you can get SDaaS prompt with:

docker run --network myvpn --rm -ti sdaas 

Your terminal will show the SDaaS command prompt as an extension of the bash shell:

         ____  ____              ____  
        / ___||  _ \  __ _  __ _/ ___| 
        \___ \| | | |/ _` |/ _` \___ \ 
         ___) | |_| | (_| | (_| |___) |
        |____/|____/ \__,_|\__,_|____/ 

        Smart Data as a Service platform - Pitagora
        Community Edition 4.0.0 connected to http://kb:8080/sdaas/sparql (w3c)

        Copyright (C) 2018-2024 LinkedData.Center
        more info at https://linkeddata.center/sdaas

sdaas >

What is happening behind the scene?

SDaaS needs to connect to a graph store; to create such connection you have to specify one sid (store ID) that is an environment variable containing the URL of a SPARQL service endpoint for the graph store. By convention, the default sid is named STORE. The SDaaS platform comes out-of-the-box configured with a default sid named STORE=http://kb:8080/sdaas/sparql that you can change in any moment.

All SDaaS commands that require to access a graph store provides the -s <sid> option. If omitted, SDaaS platform will use use the name STORE.

Each sid_ requires a driver specified by the driver variable <sid>_TYPE. For instance, the store engine driver for STORE is defined in STORE_TYPE. By default SDaaS uses the w3c driver that is suitable for any standard SPARQL service implementations. In addition to the standard driver, SDaaS Enterprise Edition provides some optimized drivers.

For instance: given that the linkeddatacenter/sdaas-rdfstore Docker image is based on blazegraph engine, in Enterprise Edition you have the flexibility to utilize an optimized driver. To enable this use STORE_TYPE=blazegraph

The first look to the platform

SDaaS provides a set of commands to introspect the platform. Try typing:

# to get the platform version:
sd view version

# to see SDaaS configuration variables:
sd view config

# to list all installed platform modules. Modules are cached on first use.  The cached modules are flagged with "--cached".
sd view modules

# to see all commands exported by exported by the [view module](/module/view):
sd view module view

# to download the whole SDaaS Language profile in turtle RDF serialization
sd view ontology -o turtle

see the Calling SDaaS commands section in the Application building guide to learn more about SDaaS commands

Boot the knowledge base

Cleanup a knowledge graph using the sd store erase command (be careful, this zap your default knowledge graph):

sd store erase

You can verify that the store is empty with

sd store size

KEES compliance:

The command sd kees boot (only Enterprise Edition) uses an optimized algorithm to cleanup the knowledge graph adding all metadata required by KEES specifications.

Ingest facts

To ingest RDF data, you have some options available:

  • using the sd sparql update command to execute SPARQL Update operations for server side ingestion.
  • using a ETL pipeline with the sd sparql graph command : to transfer and load a stream of RDF triples to a named graph in the the graph store.
  • using the learn module that provides some specialized shortcuts to ingest data and KEES metadata

Using SPARQL update

the sparql update commands are executed by SPARQL service. Therefore, the resource must be visible to the graph store server. For example, to load the entire definition of schema.org:

echo 'LOAD <https://schema.org/version/latest/schemaorg-current-http.ttl> INTO GRAPH <urn:graph:0>' | sd sparql update 

Using SPARQL graph

sd sparql graph command is executed by the SDaaS processor that store a stream of RDF triples (in nTriples serialization) into a named graph inside the graph store. This command offers increased control over the resource by allowing enforcement of the resource type. SDaaS optimizes the transfer of resource triples to the graph store using the most driver-optimized method.

In a ETL process, this command realizes the load stage. It is tipally used in a piped command.

Some examples:

# get data from a command
sd view ontology \
        | sd sparql graph urn:sdaas:tbox


# get data from a local file
sd_cat mydata.nt \
        | sd sparql graph


# retrieve linked data from a ntriple remote resource
sd_curl -s -f https://schema.org/version/latest/schemaorg-current-http.nt \
        | sd sparql graph

# retrieve RFD data serialized with turtle
sd_rapper -i turtle https://dbpedia.org/data/Milan.ttl \
        | sd sparql graph https://dbpedia.org/resource/Milan


# retrieve a linked data resource with content negotiation
ld=https://dbpedia.org/resource/Lecco
sd_curl_rdf $ld \
        | sd_rapper -g - $ld \
        | sd sparql graph -a PUT $ld

# same as above but with KEES metadata
sd_curl_rdf $ld \
        | sd_rapper -g - $ld \
        | sd kees metadata -D "activity_type=Learning source=$ld trust=0.8" $ld \
        | sd sparql graph -a PUT $ld

sd_curl, sd_curl_rdf, sd_rapper, sd_cat are just wrappers for standard bash commands to trap and log errors.

The sd sparql graph supports two method for graph accrual:

  • -a PUT for override named graph; it creates new named graph if needed
  • -a POST (default) to append data to a named graph; it creates new named graph if needed.

The gsp driver implementation is capable of utilizing the SPARQL 1.1 Graph Store Protocol (GSP). To enable this support, define the driver type <sid>_TYPE=gsp and set <sid>_GSP_ENDPOINT to point to the URL of the service providing the Graph Store Protocol.

WARNING: many graph store engines have limitations regarding the size of data ingestion using just SPARQL update features with the default w3c driver. Whenever possible, utilize a driver optimized for your graph store or a GSP capable endpoint.

Using the learn module (EE)

The learn module provides some shortcuts to loads linked data into a graph store together to ther KEES metadata.

Here are some examples that loads RDF triples:

sd learn resource -D "graph=urn:dataset:dbpedia" https://dbpedia.org/resource/Milan
sd learn file /etc/app.config
sd learn dataset urn:dataset:dbpedia
sd learn datalake https://data.exemple.org/

WARNING:

if the sd learn dataset command fails, the target named graph could be incomplete, or annotated with a prov:wasInvalidatedBy property

Querying the knowledge graph

To query the store you can use SPARQL :

cat <<-EOF | sd sparql query -o csv 
SELECT ?g (COUNT (?s) AS ?subjects) WHERE {
        GRAPH ?g{?s?p ?o}
} GROUP BY ?g
EOF     

This command prints a CSV table with all named graphs and the number of triples they contain. The -o option specifies the format you want for the result.

The sd sparql query command, by default, outputs XML serialization. However, it allows for specification of a preferred serialization using the -o flag. Additionally, the sparql module provides convenient command aliases, e.g.:

  • sd sparql list to print a select query as csv without header on stdout
  • sd sparql rule to print the result of a SPARQL CONSTRUCT as a stream of nTriples on stdout

e.g.echo "SELECT DISTINCT ?class WHERE { ?s a ?class} LIMIT 10" | sd sparql list

Reasoning about facts

To materialize inference in the knowledge graph you have some options available:

You can use SPARQL update or a combination of SPARQL query as in the following examples:

# insert data evaluates the expression at SPARQL server side.
cat <<EOF | sd sparql update
INSERT {...} 
WHERE {...}
EOF
        

#  pipe two commands
cat <<EOF | sd sparql rule | sd sparql graph
CONSTRUCT ...
EOF
        
# pipe four commands adding metadata
cat <<EOF | sd sparql rule | sd kees metadata -D "trust=0.9" urn:graph:mycostructor | sd sparql graph urn:graph:mycostructor
CONSTRUCT ...
EOF

# same of above shortcut:
cat <<EOF | sd learn rule -D "trust=0.9" urn:graph:mycostructor
CONSTRUCT ...
EOF

Using plans (EE)

The plan module provides some specialized commands to run a SDaaS scripts

You can think a plan similar to a stored procedure in a SQL database.

For example, assume that STORE contains the following three plans:

prefix sdaas: <http://linkeddata.center/sdaas/reference/v4#> .

<urn:myapp:cities> a sdaas:Plan;   sdaas:script """
        sd learn resource -D "graph=urn:dataset:dbpedia" http://dbpedia.org/resource/Milano
        sd learn resource -D "graph=urn:dataset:dbpedia" http://dbpedia.org/resource/Lecco
""" .

<urn:reasoning:recursive> a sdaas:Plan ; sdaas:script """
        sd learn rule 'CONSTRUCT ...'
        sd learn rule http://example.org/rules/rule1.rq 
""" .
<urn:test:acceptance> a sdaas:Plan ; sdaas:script """
        sd_curl_sparql http://example.org/tests/test1.rq | sd sparql test
        sd_curl_sparql http://example.org/tests/test2.rq | sd sparql test
""" .

Then this commands use plans to automate some common activities:

sd plan run -D "activity_type=Ingestion" urn:myapp:cities
sd plan loop -D "activity_type=Reasoning trust=0.75" urn:reasoning:recursive
sd -A plan run -D "activity_type=Publishing" urn:test:acceptance

the sd plan loop command executes a plan until there are no more changes in the knowledge base. It is useful to implement incremental or recursive reasoning rules.

Managing the Knowledge Base Status (EE)

You can signal the publication status of a specific knowledge base using KEES status poperties.

For setting, getting, and checking the status of a specific window in the knowledge graph, use:

# prints the date of the last status changes:
sd kees date published

# test a status
sd kees is published || echo "Knowledge Graph is not Published"
sd kees is stable || echo "Knowledge Graph is not in a stable status"

Connecting to multiple RDF Graph Stores

You can direct the SDaaS platform to connect to multiple RDF store instances, using standard or optimized drivers:

AWS="http://mystore.example.com/sparql"
AWS_TYPE="neptune"
WIKIDATA="https://query.wikidata.org/sparql"
WIKIDATA_TYPE="w3c"

This allow to import or reasoning using specialized SPARQL end point For instance, the above example imports all cat from wikidata into the default graph store and then list the first five cat names:

cat <<EOF | sd sparql rule -s WIKIDATA | sd sparql graph
DESCRIBE ?item WHERE { 
        ?item wdt:P31 wd:Q146 
} 
EOF

cat <<EOF | sd sparql list
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wd: <http://www.wikidata.org/entity/>
SELECT DISTINCT ?cat WHERE {
        ?item wdt:P31 wd:Q146; rdfs:label ?cat
        FILTER( LANG(?cat)= "en")
} ORDER BY ?cat LIMIT 5 
EOF

Scripting

You have the ability to create a bash script containing various commands.

Refer to the Application building guide for more info about SDaaS scripting

Quitting the platform

When you type exit you can safely destroy the sdaas container but the created data will persist in the external store.

Free allocated docker resources by typing:

docker rm -f kb
docker network rm myvpn

4 - Guide to Application building

best practice to build SDaaS applications.

SDaaS™ is a software platform that helps to build Semantic Web Applications and Smart Data Plaforms.

SDaaS assists in constructing and managing a knowledge graph that organizes linked data resources annotated with application-specific semantics.

Instead of accessing various data silos, applications utilize the knowledge graph as a reliable central repository, offering a semantic query service. This repository contains a semantic enriched replica of all the data necessary for the application. Because of the inherently distributed nature of the web and the continuous changes in data, an application using the SDaaS platform adopts the Eventual Consistency model. This model is highly popular today as it represents a reasonable trade-off between performance and complexity.

What is a semantic web application

A semantic web application is a software system designed to leverage and utilize the principles and technologies of the Semantic Web.

These applications utilize linked data, ontologies, and metadata to create richer connections between different pieces of information on the internet. They typically involve:

  1. Structured Data Representation: Semantic web apps use RDF (Resource Description Framework) to represent data in a structured and machine-readable format. This allows for better understanding and interpretation of relationships between different data points.

  2. Ontologies and Vocabularies: They employ ontologies and vocabularies (such as OWL - Web Ontology Language) to define relationships and meaning between entities, making it easier for systems to understand the context of the data.

  3. Data Integration and Interoperability: These applications facilitate data integration from diverse sources, enabling different systems to exchange and use information more effectively.

  4. Inference and Reasoning: Semantic web apps can perform logical inference and reasoning to derive new information or insights from existing data based on defined rules and relationships.

  5. Enhanced Search and Discovery: They enable more sophisticated search functionalities by understanding the semantics of the data, providing more relevant and contextualized results.

In summary, a semantic web application harnesses Semantic Web technologies to enable machines to comprehend and process data more intelligently, facilitating better data integration, discovery, and utilization across various platforms and domains.

What is a smart data platform

A Smart Data Platform refers to a technological infrastructure designed to collect, process, analyze, and leverage data intelligently to generate insights, make decisions, and power various applications or services. These platforms often incorporate advanced technologies such as artificial intelligence (AI), machine learning (ML), data analytics, and automation to handle vast amounts of data from diverse sources.

A Smart Data Platform typically integrates multiple functionalities, including data ingestion, storage, processing, analysis, visualization, and often includes features for data governance, security, and compliance.

What is Eventual Consistency

Eventual Consistency is a concept in distributed computing where, in a system with multiple replicas of data, changes made to the data will eventually propagate through the system and all replicas will converge to the same state. However, this convergence is not instantaneous; it occurs over time due to factors like network latency, system failures, or concurrent updates. The Knowledge Graph can be considered as a semantically enriched replica of the ingested distributed data

The typical SDaaS user is a DevOps professional who utilizes the commands provided by the platform to script the building and updating of a knowledge graph. This knowledge graph is then queried by an application using SPARQL or REST APIs. SDaaS developers and system integrators can extend the platform by adding custom modules and creating new commands.

More in details the typical SDaaS use case scenario is summarized by the following diagram:

cloud "Linked Data cloud" as data
usecase "SDaas script\ndevelopment" as writesScript
usecase "smart data service\ndeployment" as managesSDaaS
usecase "application development" as developsApplication
usecase "queries\nKnowledge Graph" as usesKnowledge
usecase "installs\nSDaaS modules" as installsSDaaS
usecase "configure\nSDaaS" as configuresSDaaS
usecase "knowledge update" as updatesKnowledge
actor "App devops" as user
package "SDaaS distribution" as Distribution <<Docker image>>
node "smart data service" as SDaaS {
    component "SDaaS script" as Script
    package Module {
        component "SDaaS Command" as Command
        interface ConfigVariable
    }
}
database "Knowledge Graph" as Store
node Application

user .. developsApplication
user .. writesScript
user .. managesSDaaS
Command o-> ConfigVariable : uses
writesScript .. Script

managesSDaaS -- installsSDaaS
managesSDaaS -- configuresSDaaS
configuresSDaaS .. ConfigVariable 
installsSDaaS .. Module 
Command .. updatesKnowledge 
data . updatesKnowledge
updatesKnowledge . Store

Script o--> Command : calls
Distribution .. installsSDaaS

Application .. usesKnowledge

developsApplication .. Application
usesKnowledge .. Store

Calling SDaaS commands

The SDaaS Platform operates through a set of bash commands and functions. The general syntax to call a SDaaS command is sd <module> <name> [*OPTIONS*] [*OPERANDS*], while the syntax of an SDaaS function is sd_<name>.

The modules are bash script fragments that define a set of SDaaS functions, providing a namespace for them.

Before calling an SDaaS Function, you must explicitly load its module cache with sd_include <module> core function. Core functions are contained in the core module that is loaded at startup. SDaaS commands automatically include the required modules.

SDaaS commands MAY depend on a set of context variables you can pass using options.The global configuration variable SD_DEFAULT_CONTEXT provides a default local context used by all commands.

For instance these calls are all equivalent:

sd sparql graph urn:myapp:abox
sd sparql graph -s STORE -D "graph=urn:myapp:abox"
sd sparql graph -D "sid=STORE" -D "graph=urn:myapp:abox"
sd sparql graph -D "sid=STORE graph=urn:myapp:abox"
sd sparql graph -D "sid=OVERRDEN_BY-s graph=urn:myapp:overridden_by_operand"  -s STORE  urn:myapp:abox
SD_DEFAULT_CONTEXT="sid=STORE graph=urn:myapp:abox"; sd sparql graph

SDaaS scripting

The smart data service is usually includes SDaaS script and an application config file. The SDaaS script is normal bash scrips that include the SDaaS platform with the command source $SDAAS_INSTALL_DIR/core

Usually you create an application config file that contains the definition of the dataset and rules used by ingestion, reasoning and publishing plan. For instance:

#!/usr/bin/env bash
source $SDAAS_INSTALL_DIR/core

sd store erase

## loads the language profile and the application specific configurations
sd view ontology | sd sparql graph urn:tbox
sd_curl -s -f https://schema.org/version/latest/schemaorg-current-http.nt | sd sparql graph urn:tbox

## loading some facts from dbpedia
for ld in https://dbpedia.org/resource/Lecco https://dbpedia.org/resource/Milan; do
	sd_curl_rdf $ld | sd_rapper -g - $ld | sd sparql graph urn:abox
done

The script MAY implements a never-ending loop, similar to this pseudo-code using SDaaS Enterprise Edition Platform:

#!/usr/bin/env bash
source $SDAAS_INSTALL_DIR/core # Loads the SDaaS platform
while NEW_DATA_DISCOVERED ; do 

    # Boot and lock platform ######################
    sd kees boot -L 

    ## loads the language profile and the application specific configurations
    sd -A view ontology | sd sparql graph urn:myapp:tbox
	sd learn file /etc/myapp.config

    ## loading facts
    sd learn dataset -D "activity_type=Learning trust=0.9" urn:myapp:facts

    # reasoning window loop #########################
    sd plan loop -D "activity_type=Reasoning trust=0.9" urn:myapp:reasonings
	

    # publishing window  ########################
    sd -A plan run -D "activity_type=Publishing" urn:myapp:tests

    sd kees unlock

    sleep $TIME_SLOT
done

Application architectures enabled by SDaaS

In this chapter you find some typical architectures that are enabled by SDaaS

Use case 1: autonomous agent

an ETL agent that transform raw data into linked data:

Folder "Raw data" as Data
Folder "Linked data" as RDF

note right of RDF: RDF data according to\nan application language profile

package "ETL application" #aliceblue;line:blue;line.dotted;text:blue {
    node "Autonomous Agent" as aa #white;line:blue;line.dotted;text:blue
    database "graph store" as graphStore

    aa -(0 graphStore : run mapping\nrules

Data ..> aa
aa ..> RDF

The autonomous agent uses SDaaS to upload raw data in an intermediate form to a grap store and to use SPARQL rules to map the intermediate format into the application language profile.

Use case 2: linked data plaform

The SDaaS platform is used to implement an agents that transform and loads raw data into a knowledge graph, doing some ontology mappings and providing a Linked Data Platform interface to applications. It is compatible with a SOLID protocol

cloud "LinkedOpen Data Cloud" as Data

package "LOD smart cache" #aliceblue;line:blue;line.dotted;text:blue {
    node "Autonomous\nDiscovery Agent" as DiscoveryAgent #white;line:blue;line.dotted;text:blue
    database "graph store" as DataLake
    node "Linked data Proxy" as DataLakeProxy

    
    DiscoveryAgent -(0 DataLake : writes RDF data
    DataLake 0)- DataLakeProxy 
}

Data ..> DiscoveryAgent
DataLakeProxy -LDP

Linked-data proxy is a standard component providing support the VOiD ontology and HTTP cache features. Linked data center provides a free open source implementation that can be used out-of-the-box or as reference implementation for this component.

Use case 3: smart data warehouse

The typical SDaaS application architecture to build an RDF based data warehouse is the following:

cloud "1st, 2nd and 3rd-party raw data" as Data

package "data management platform" #aliceblue;line:blue;line.dotted;text:blue {
    node "Fact provider" as DiscoveryAgent #white;line:blue;line.dotted;text:blue
    folder "Linked-data lake" as DataLake
    node "smart data service" as SDaaSApplication #white;line:blue;text:blue
    database "RDF\nGraph Store" as GraphStore
    
    DiscoveryAgent --(0 DataLake : writes RDF data
    DataLake 0)-- SDaaSApplication : learn data
    SDaaSApplication --(0 GraphStore : updates


    note right of DiscoveryAgent
    Here the application injects
    its specific semantic in raw data
    end note

    note left of SDaaSApplication
    Here KEES cycle
	is implemented
    end note
}

interface "SPARQL QUERY" AS SQ
GraphStore - SQ

package "application" {
    node "Application backend" as Backend
    node "Application frontend" as Frontend
    database "application local data" as firstPartyData

 
    Backend 0)- Frontend : calls
    Backend --( SQ : queries
    firstPartyData 0)-- Backend  : writes
}

Data ..> DiscoveryAgent : gets raw data
Data <..... firstPartyData : copy 1st-party data

You can distinguish two distinct threads: the development of a data management platform and the development of the application. The knowledge graph built in the data platform is used by the application as the primary source of all data. The data produced by the application can be reinjected into the data management platform.

The SDaaS platform is used in the development of the data management platform, primarily in the development of the smart data service and optionally in the Autonomous Discovery Agent.

More in detail, the main components of the data platform are:

Autonomous Discovery Agent
its an application-specific ETL process triggered by changes in data. This process transforms raw data into linked data annotated with information recognized by the application and stores it in a linked-data lake. Multiple Autonomous Discovery Agents may exist and operate concurrently. Each agent can access the Graph Store to identify enrichment opportunities or to detect changes in the data source.
Linked-data lake
it is a repository, for example an s3 bucket or a shared file system, that contains RDF files, that is Linked Data Platform RDF Sources [LDP-RS] expressed with a known language profile. This files can be mirrors or existing web resources, mappings of databases or even private data described natively in RDF.
smart data service
it is a service that includes the SDaaS platform and that contains a script processing data conforming with the KEES specifications.
RDF Graph Store
implements the Knowledge Graph supporting the SPARQL protocol interface. Linked data center provides a free full featured RDF graph database you can use to learn and thes the SDaaS platform

Use case 4: Smart data agent

All activities ar performed by the same agent that embeds its workflow

cloud "3rd-party data" as ldc
database "Knowledge Graph" as GraphStore
folder "Reports" as RawData

Folder "Linked data lake" as ldl

node "Smart data agent" as agent #white;line:blue;text:blue

ldl --> agent
ldl <-- agent
ldc -> agent

agent --> GraphStore
agent -> RawData

The workflow is just a definition of activities that should be completed by agent

Use case 5: semantic web agency

In this architecture, multiple agents run at the same time, agents coordinated using knowledge graph status and locks.

cloud "3rd-party data" as ldc
database "Knowledge Graph" as GraphStore
folder "reports" as RawData


note bottom of RawData: can be used as raw data

Folder "Linked data lake" as ldl
note top of ldl: contains activity plans

package "Agency" #aliceblue;line:blue;line.dotted;text:blue {
    node "Smart data agent" as Ingestor #white;line:blue;text:blue
    node "Reasoning\nagent(s)" as Reasoner #white;line:blue;line.dotted;text:blue
    node "Enriching\nagent(s)" as Enricher #white;line:blue;line.dotted;text:blue
    node "Publishing\nagent(s)" as Publisher #white;line:blue;line.dotted;text:blue
}

ldl --> Ingestor
ldl <-- Enricher
ldc --> Enricher

Ingestor --> GraphStore
Reasoner <--> GraphStore
Publisher <-- GraphStore
Enricher <-- GraphStore
Publisher --> RawData

The workflow is just a definition of activities that should be completed by agent

5 - Customizing the SDaaS platform

Learn how to tailor the platform to your needs

5.1 - Architecture overview

The components of the SDaaS platform.Start from here.

The SDaaS platform is utilized for creating smart data platforms as backing services, empowering your applications to leverage the full potential of the Semantic Web.

The SDaaS platform out-of-the-box contains an extensible set of modules that connect to several Knowledge Graphs through optimized driver modules.

A smart data service conforms to the KEES specifications and is realized by a customization of the SDaaS docker image. It contains one or more scripts calling a set of commands implemented by modules. The command behavior can be modified through configuration variables.

node "SDaaS platform " as SDaaSDocker <<docker image>>{
    component Module {
        collections "commands" as Command
        collections "configuration variables" as ConfigurationVariable
    }
}
node "smart data service" as SDaaSService <<docker image>>{
    component "SDaaS Script" as App
}
database "Knowledge Graphs" as GraphStore
cloud "Linked Data" as Data


SDaaSService ---|> SDaaSDocker : extends
Command ..(0 Data : HTTP
Command ..(0 GraphStore : SPARQL
Command o. ConfigurationVariable
App --> Command : calls
App --> ConfigurationVariable : set/get

It is possible to add a new modules to extend the power of the platform to match special needs.

Data Model

The SDaaS data model is designed around few concepts: Configuration Variables, Functions, Commands, and Modules.

SDaaS configuration variables

A SDaaS configuration variable is a bash environment variable that the platform uses as configuration option. Configuration variables have a default value; they can be changed statically in the Docker image or runtime in Docker run, Docker orchestration, or in user scripts.

The following taxonomy applies to SDaaS functions:

class "SDaaS Configuration variable" as ConfigVariable
class "SID Variable" as SidVariable
class "Platform Variable" as PlatformVariable <<read only>>
interface EnvironmentVariable

ConfigVariable --|> EnvironmentVariable
SidVariable --|> ConfigVariable
ConfigVariable <|- PlatformVariable

Environment Variable

It is a shell variable.

Platform variable

It is a variable defined by the SDaaS docker tha should not be changed outside the dokerfile.

SID Variable

It is a special configuration variable that states a graph store property. The general syntax is <sid>_<var name>. For example the variable STORE_TYPE refers to a driver module name that it must be used to access the graph store connected by the sid STORE. Some driver can require/define other sid variables with their default values.

See all available configuration variables in the installation guide

SDaaS Functions

An SDaaS function is a bash function embedded in the platform. For example, sd_log. A bash function accepts a fixed set of positional parameters, writes output on stdout, and returns 0 on success or an error code otherwise.

The following taxonomy applies to SDaaS functions:

Interface "Bash function" as BashFunction 
Class "Sdaas function" as SdaasFunction
Class "Driver virtual function" as DriverVirtualFunction
Class "Driver function implementation" as DriverFunction
SdaasFunction --|> BashFunction
DriverFunction --|> SdaasFunction
DriverVirtualFunction --|> SdaasFunction

DriverFunction <- DriverVirtualFunction: calls

Bash Function

Is the interface of a generic function defined in the scope of a bash process.

Driver virtual function

It is a function that act as a proxy for a driver method, its first parameter is always the sid. A _driver virtual function has the syntax <sd_driver_<method name> (e.g. sd_driver_load) and has require a set of fixed position parameters.

Driver method implementation

It is a function that implements a driver virtual function for a specific graph store engine driver. A driver method implementation has the syntax <sd_<driver name>_<method name> (e.g. sd_w3c_load) and expects a set of fixed position parameters (unchecked). Driver function implementation functions should be called only by a driver virtual function.

SDaaS commands

A Command is a SDaaS function that conforms to the SDaaS command requirements. For example, sd_sparql_update. A command writes output on stdout, logs on std error and returns 0 on success or an error code otherwise.

In a script, SDaaS commands should be called through the sd function using the following syntax: sd <module name> <function name> [options] [operands]. The sd function allows a flexible output error management and auto-includes all required module. Direct calls to the command functions should be done only inside modules implementation.

For instance calling sd -A sparql update executes the command function sd_sparql_update including the sparql module and aborting the script in case of error. This is equivalent to sd_include sparql; sd_sparql_update || sd_abort

The following taxonomy applies to commands:

Class Command
Class "Facts Provision" as DataProvidingCommand
Class "Ingestion Command" as IngestionCommand
Class "Query Command" as QueryCommand
Class "Learning Command" as LearningCommand
Class "Reasoning Command" as ReasoningCommand
Class "Enriching Command" as EnrichingCommand
Class "SDaaS function" as SDaaSFunction
Class "Store Command" as StoreCommand
Class "Compound Command" as CompoundCommand

Command --|>SDaaSFunction
DataProvidingCommand --|> Command
StoreCommand --|> Command
IngestionCommand --|> StoreCommand
QueryCommand --|> StoreCommand
LearningCommand -|> CompoundCommand
LearningCommand --|> DataProvidingCommand
LearningCommand --|> IngestionCommand
CompoundCommand <|- ReasoningCommand
ReasoningCommand --|> IngestionCommand
ReasoningCommand --|> QueryCommand
EnrichingCommand --|> ReasoningCommand
EnrichingCommand --|> LearningCommand

Compound Command

It is a command resulting from the composition of two or more commands, usually in a pipeline.

Facts Provision

It is a command that provides RDF triples in output.

Query Command

It is a command that can extract information from the knowledge graph.

Ingestion Command

It is a command that stores facts into a knowledge graph.

Reasoning Command

It is a command that both queries and ingests data into the same knowledge graph according to some rules.

Learning Command

It is a command that provides and ingests facts into the knowledge graph.

Enriching Command

It is a command that queries the knowledge base, discovers new data, and injects the results back into the knowledge base.

Store Command

It is a command that interact with a knowledge base. It accepts -s *SID* and -D "sid=*SID*" options.

SDaaS modules

A SDaaS module is a collection of commands and configuration variables that conforms to the module building requirements.

You can explicitly include a module content with the command sd_include

The module taxonomy is depicted in the following image:

class "SDaaS module" as Module
Abstract "Driver" as AbstractDriver
Class "Core Module"  as Core
Class "Driver implementation" as DriverImplementation
Class "Command Module" as CommandModule
Class "Store module" as StoreModule

Core --|> Module
CommandModule --|> Module
AbstractDriver --|> Module
DriverImplementation <- AbstractDriver : calls
StoreModule --|> CommandModule
StoreModule --> AbstractDriver : includes
Core <- CommandModule : includes

Command Module

Modules that implement a set of related SDaaS command. They always include the Core Module and can depend from other modules.

Core Module

A module singleton exposes core commands and must be loaded before using any SDaaS feature.

Driver

A module singleton that exposes the the abstract Driver functions interface to handle connections with a generic graph store engine.

Driver implementation

Modules that implement the function interface exposed by the Abstract Driver for a specific graph store engine.

Store Module

Modules that export store commands that connects to a graph store using the functions exposed by the Driver module. A store module always includes the driver module.

The big picture

The resulting SDaaS platform data model big picture is:

package "User Application" { 
    class "User Application" as Application
}
package "SDaaS platform" #aliceblue;line:blue;line.dotted;text:blue { 
    class Command
    class Module
    class "SDaaS function" as SDaaSFunction
    Abstract "Driver" as Driver
    class ConfigVariable
    Abstract "SID Variable" as SidVariable

    interface "bash function" as Function
    interface EnvironmentVariable
}
package "Backing services" {
    interface "Backing service" as BakingService
    interface "Graph Store" as GraphStore
    interface "Knowledge Graph" as KnowledgeGraph
    interface "Linked Data Platform" as RDFResource
}
package "smart data service" #aliceblue;line:blue;line.dotted;text:blue {
    class "SDaaS script" as Script
    class "smart data Service" as SDaaS 
}

KnowledgeGraph : KEES compliance

GraphStore : SPARQL endpoint
RDFResource : URL 
Command --|> SDaaSFunction
SDaaSFunction --|> Function
ConfigVariable --|> EnvironmentVariable
SidVariable --|> ConfigVariable
Driver --|> Module
GraphStore --|> BakingService
RDFResource --|> BakingService
KnowledgeGraph --|> GraphStore

BakingService <|-- SDaaS

Module *-- Command : exports
Module *-- ConfigVariable : declares

Command o.. RDFResource : learns

Driver --> KnowledgeGraph : connects
Module .> Module : includes

Function --o Script
EnvironmentVariable --o Script
Script -o SDaaS : contains
SidVariable <- Driver : uses
ConfigVariable .o Command : uses

Application ..> KnowledgeGraph : access

Backing service

It is a type of software that operates on servers, handling data storage, resource publishing , and processing tasks for an application.

Graph Store

It is backed service that provides support for SPARQL protocol and an optional support to Graph Store Protocol.

Knowledge Graph

It is a Graph Store compliant with the KEES specification.

Linked Data Platform

It is a web informative resource that exposes RDF data in one of supported serialization according to W3C LDP specifications.

SDaaS script

It is a bash script that uses SDaaS commands.

smart data service

It is a backing service that include the SDaaS platform and implements one or more SDaaS scripts.

User application

It is a (Sematic Web) Application that uses a knowledge graph.

5.2 - Module building

how to extend the platform creating new modules

Conformance Notes

This chapter defines some conventions to extend the SDaaS platform.

Conformance requirements are expressed with a combination of descriptive assertions and RFC 2119 terminology. The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in the normative parts of this document are to be interpreted as described in RFC 2119. However, for readability, these words may not always appear in uppercase letters in this specification.

Module Requirements

An SDaaS module is a bash script file fragment that complies with the following restrictions:

  • It MUST be placed in the SDAAS_INSTALL_DIR or in the modules directory in the $HOME/modules directory. Modules defined in the $HOME/modules directory take precedence over the ones defined in the SDAAS_INSTALL_DIR directory.
  • Its name MUST match the regular expression ^[a-z][a-z0-9-]+$.
  • The first line of the module MUST be if [[ ! -z ${__module_MODULENAME+x} ]]; then return ; else __module_MODULENAME=1 ; fi where MODULENAME is the name of the module.
  • all commands defined in the module SHOULD match sd_MODULENAME_FUNCTIONAME here the MODULENAME is the same name of the module that contains the command and FUNCTIONNAME is an unique name inside the module. To rewrite existing commands it is allowed but discouraged.
  • all commands MUST follow the syntax conventions described below.

A module CAN contain:

  1. Constants (read only variables): constant SHOULD be prefixed by SDAAS_ and MUST be an unique name in the platform. Overrriding default
  2. Configuration variables: constant SHOULD be prefixed by SD_ and MUST be an unique name in the platform.
  3. Module commands definition.
  4. Module initialization: a set of bash commands that always runs on module loading

For example, see the modules implementation in SDaaS community edition

Command requirements

Commands MUST support the Utility Syntax Guidelines described in the Base Definitions volume of POSIX.1‐2017, Section 12.2, from section 3 to 10, inclusive.

All commands that accepts options and/or operands SHOULD accept also the option -D. Such option is used for define local variables in the form of key-value that provides a context for the command.

Options SHOULD conform to the following naming conventions:

-A
abort if the command returns is > 0 .
-a, -D “accrual_method=ACCRUAL METHOD
accrual method, the method by which items are added to a collection. PUT and POST method SHOULD be recognized
-f FILE NAME, -D “input_file=FILE NAME
to be use to refer a local file object with relative or absolute path
-i INPUT FORMAT, -D “input_format=INPUT FORMAT
input format specification, these values SHOULD be recognized (from libraptor):
FormatDescription
rdfxmlRDF/XML (default)
ntriplesN-Triples
turtleTurtle Terse RDF Triple Language
trigTriG - Turtle with Named Graphs
guessPick the parser to use using content type and URI
rdfaRDF/A via librdfa
jsonRDF/JSON (either Triples or Resource-Centric)
nquadsN-Quads
-h
prints an help description
-o OUTPUT FORMAT, -D “output_format=OUTPUT FORMAT
output format specification This values SHOULD be recognized:
  • csv
  • csv-h
  • csv-1
  • csv-f1
  • boolean
  • tsl
  • json
  • ntriples
  • xml
  • turtle
  • rdfxml
  • test
-p PRIORITY
to be use to reference a priority
levelmnemonicexplanation
2CRITICALShould be corrected immediately, but indicates failure in a primary system - fix CRITICAL problems before ALERT - an example is loss of primary ISP connection.
3ERRORNon-urgent failures - these should be relayed to developers or admins; each item must be resolved within a given time.
4WARNINGWarning messages - not an error, but indicated that an error will occur if action is not taken, e.g. file system 85% full - each item must be resolved within a given time.
5NOTICEEvents that are unusual but not error conditions - might be summarized in an email to developers or admins to spot potential problems - no immediate action required.
6INFORMATIONALNormal operational messages - may be harvested for reporting, measuring throughput, etc. - no action required.
7DEBUGInfo is useful to developers for debugging the app, not useful during operations.
-s SID , “sid=SID
connect to Graph Store named SID ( STORE by default)
-S SIZE
to be use to reference a size

Evaluation of the local command context

The process of evaluating the local context for a command is as follows:

  1. First, the local context hardcoded in the command implementation is evaluated.
  2. The hardcoded local context can be overridden by the global configuration variable SD_DEFAULT_CONTEXT.
  3. The resulting context can be overridden by specific command options (e.g., -s SID). Command options are evaluated left to right
  4. The resulting context can be overridden by specific command operands.

For example, all these calls will have the same result:

  • sd_sparql_graph: the hardcoded local context ingests data into a named graph inside the graph store connected to the sid STORE, using a generated UUID URI as the name of the named graph.
  • sd_sparql_graph -s STORE $(sd_uuid): hardcoded local context overridden by specific options and operand.
  • sd_sparql_graph -D "sid=STORE: hardcoded local context overridden by the -D option.
  • sd_sparql_graph -D "sid=STORE" -D "graph=$(sd_uuid)": the same as above but using multiple -D options (always evaluated left to right).
  • SD_DEFAULT_CONTEXT="sid=STORE"; sd_sparql_graph $(sd_uuid): hardcoded local context overridden by SD_DEFAULT_CONTEXT and operand.
  • SD_DEFAULT_CONTEXT="sid=XXX"; sd_sparql_graph -s YYY -D "sid=STORE graph=ZZZZ" $(sd_uuid): a silly command call that demonstrates the overriding precedence in local context evaluation.

5.3 - Driver building

how write a custom graph store driver

Conformance Notes

This chapter defines some conventions to extend the SDaaS platform.

Conformance requirements are expressed with a combination of descriptive assertions and RFC 2119 terminology. The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in the normative parts of this document are to be interpreted as described in RFC 2119. However, for readability, these words may not always appear in uppercase letters in this specification.

Driver Requirements

  • a driver MUST be a valid a [SDaaS module]({{ < ref “module building” >}})
  • it MUST implements the driver interface with SDaaS functions that conforms to the following naming conventions: sd_<driver name>_<driver method>; for example if you want to implement a special the driver for the stardog graph store you MUST implement the sd_stardog_query function
  • all driver function MUST be positional, with no defaults. The validity check of the parameters is responsability of the caller (ie. the driver module)
  • Following methods MUST be implemented:
method name( parameters )description
erase(sid)erase the graph store
load(sid, graph, accrual_method)loads a stream of nTripes into a named graph in a graph store according the accrual policy "
query(sid, mime_type)execute in a graph store a SPARQL query read from std input, requesting in output one of the supported mime types
size(sid)return the number of triples in a store
update(sid)execute in a graph store a SPARQL update request read from std input

All method parameters are strings that MUST matches following regular expressions:

  • sid MUST match the ^[a-zA-Z]+$ regular expression and MUST point to an http(s) URL. It is assumed that sid is the name of a SID variable
  • graph MUST be a valid URI
  • mime_type MUST be a valid mime type
  • accrualMethod MUST match PUT or POST

Implementing a new graph store driver

A driver is the implementation is described by the following UML class diagram:

package "SDaaS Community Edition" {
    interface DriverInterface
    DriverInterface : + erase(sid)
    DriverInterface : + load(sid, graph, accrual_method)
    DriverInterface : + query(sid, mime_type)
    DriverInterface : + size(sid)
    DriverInterface : + update(sid)
    DriverInterface : + validate(sid)

    class w3c
    class testdriver
}
package "SDaaS Enterprise Edition" {
    class blazegraph
    class gsp
    class neptune
}

testdriver ..|> DriverInterface
w3c ..|> DriverInterface 
blazegraph --|> w3c
gsp --|> w3c
neptune --|> w3c

There is a reference implementation of the driver interface known as the w3c driver compliant with W3C SPARQL 1.1 protocol specification, along with the testdriver stub driver implementation to be used in unit tests. The SDaaS Enterprise Edition offers additional drivers that are specialized versions of the w3c driver, optimized for specific graph store technologies:

  • gsp driver: a w3c driver extension that uses the SPARQL 1.1 Graph Store HTTP Protocol. It defines the configuration variable <sid>_GSP_ENDPOINT that contains the http address of the service providing a Graph Store Protocol interface.
  • blazegraph driver: a optimized implementation for Blazegraph graph store.
  • neptune driver: a optimized implementation for AWS Neptune service.

In commands, do not call the driver method implementation function directly. Instead, call the corresponding abstract driver module functions.