Getting started

A quick tour of the SDaaS platform

The SDaaS Platform offers a programmatic approach to building and update Knowledge Graphs. It includes a language and a command-line interface (CLI), offering optimized access to one or more RDF graph stores.

The subsequent chapters assume you’ve installed the SDaaS Platform and have some familiarity with the bash shell, Docker, and SPARQL. Additionally, you should possess a basic understanding of the key concepts and definitions outlined in the Knowledge Exchange Engine Specifications (KEES).

Connecting to an RDF Graph Store

SDaaS requires access to a RDF SPARQL service: let’s launch a graph store using a public docker image in a vpn named myvpn

docker network create myvpn
docker run --network myvpn --name kb -d linkeddatacenter/sdaas-rdfstore

This will run in background a small, full featured RDF Graph Store instance compliant with SDaaS requirements

Now you can get SDaaS prompt with:

docker run --network myvpn --rm -ti sdaas 

Your terminal will show the SDaaS command prompt as an extension of the bash shell:

         ____  ____              ____  
        / ___||  _ \  __ _  __ _/ ___| 
        \___ \| | | |/ _` |/ _` \___ \ 
         ___) | |_| | (_| | (_| |___) |
        |____/|____/ \__,_|\__,_|____/ 

        Smart Data as a Service platform - Pitagora
        Community Edition 4.0.0 connected to http://kb:8080/sdaas/sparql (w3c)

        Copyright (C) 2018-2024 LinkedData.Center
        more info at https://linkeddata.center/sdaas

sdaas >

What is happening behind the scene?

SDaaS needs to connect to a graph store; to create such connection you have to specify one sid (store ID) that is an environment variable containing the URL of a SPARQL service endpoint for the graph store. By convention, the default sid is named STORE. The SDaaS platform comes out-of-the-box configured with a default sid named STORE=http://kb:8080/sdaas/sparql that you can change in any moment.

All SDaaS commands that require to access a graph store provides the -s <sid> option. If omitted, SDaaS platform will use use the name STORE.

Each sid_ requires a driver specified by the driver variable <sid>_TYPE. For instance, the store engine driver for STORE is defined in STORE_TYPE. By default SDaaS uses the w3c driver that is suitable for any standard SPARQL service implementations. In addition to the standard driver, SDaaS Enterprise Edition provides some optimized drivers.

For instance: given that the linkeddatacenter/sdaas-rdfstore Docker image is based on blazegraph engine, in Enterprise Edition you have the flexibility to utilize an optimized driver. To enable this use STORE_TYPE=blazegraph

The first look to the platform

SDaaS provides a set of commands to introspect the platform. Try typing:

# to get the platform version:
sd view version

# to see SDaaS configuration variables:
sd view config

# to list all installed platform modules. Modules are cached on first use.  The cached modules are flagged with "--cached".
sd view modules

# to see all commands exported by exported by the [view module](/module/view):
sd view module view

# to download the whole SDaaS Language profile in turtle RDF serialization
sd view ontology -o turtle

see the Calling SDaaS commands section in the Application building guide to learn more about SDaaS commands

Boot the knowledge base

Cleanup a knowledge graph using the sd store erase command (be careful, this zap your default knowledge graph):

sd store erase

You can verify that the store is empty with

sd store size

KEES compliance:

The command sd kees boot (only Enterprise Edition) uses an optimized algorithm to cleanup the knowledge graph adding all metadata required by KEES specifications.

Ingest facts

To ingest RDF data, you have some options available:

  • using the sd sparql update command to execute SPARQL Update operations for server side ingestion.
  • using a ETL pipeline with the sd sparql graph command : to transfer and load a stream of RDF triples to a named graph in the the graph store.
  • using the learn module that provides some specialized shortcuts to ingest data and KEES metadata

Using SPARQL update

the sparql update commands are executed by SPARQL service. Therefore, the resource must be visible to the graph store server. For example, to load the entire definition of schema.org:

echo 'LOAD <https://schema.org/version/latest/schemaorg-current-http.ttl> INTO GRAPH <urn:graph:0>' | sd sparql update 

Using SPARQL graph

sd sparql graph command is executed by the SDaaS processor that store a stream of RDF triples (in nTriples serialization) into a named graph inside the graph store. This command offers increased control over the resource by allowing enforcement of the resource type. SDaaS optimizes the transfer of resource triples to the graph store using the most driver-optimized method.

In a ETL process, this command realizes the load stage. It is tipally used in a piped command.

Some examples:

# get data from a command
sd view ontology \
        | sd sparql graph urn:sdaas:tbox


# get data from a local file
sd_cat mydata.nt \
        | sd sparql graph


# retrieve linked data from a ntriple remote resource
sd_curl -s -f https://schema.org/version/latest/schemaorg-current-http.nt \
        | sd sparql graph

# retrieve RFD data serialized with turtle
sd_rapper -i turtle https://dbpedia.org/data/Milan.ttl \
        | sd sparql graph https://dbpedia.org/resource/Milan


# retrieve a linked data resource with content negotiation
ld=https://dbpedia.org/resource/Lecco
sd_curl_rdf $ld \
        | sd_rapper -g - $ld \
        | sd sparql graph -a PUT $ld

# same as above but with KEES metadata
sd_curl_rdf $ld \
        | sd_rapper -g - $ld \
        | sd kees metadata -D "activity_type=Learning source=$ld trust=0.8" $ld \
        | sd sparql graph -a PUT $ld

sd_curl, sd_curl_rdf, sd_rapper, sd_cat are just wrappers for standard bash commands to trap and log errors.

The sd sparql graph supports two method for graph accrual:

  • -a PUT for override named graph; it creates new named graph if needed
  • -a POST (default) to append data to a named graph; it creates new named graph if needed.

The gsp driver implementation is capable of utilizing the SPARQL 1.1 Graph Store Protocol (GSP). To enable this support, define the driver type <sid>_TYPE=gsp and set <sid>_GSP_ENDPOINT to point to the URL of the service providing the Graph Store Protocol.

WARNING: many graph store engines have limitations regarding the size of data ingestion using just SPARQL update features with the default w3c driver. Whenever possible, utilize a driver optimized for your graph store or a GSP capable endpoint.

Using the learn module (EE)

The learn module provides some shortcuts to loads linked data into a graph store together to ther KEES metadata.

Here are some examples that loads RDF triples:

sd learn resource -D "graph=urn:dataset:dbpedia" https://dbpedia.org/resource/Milan
sd learn file /etc/app.config
sd learn dataset urn:dataset:dbpedia
sd learn datalake https://data.exemple.org/

WARNING:

if the sd learn dataset command fails, the target named graph could be incomplete, or annotated with a prov:wasInvalidatedBy property

Querying the knowledge graph

To query the store you can use SPARQL :

cat <<-EOF | sd sparql query -o csv 
SELECT ?g (COUNT (?s) AS ?subjects) WHERE {
        GRAPH ?g{?s?p ?o}
} GROUP BY ?g
EOF     

This command prints a CSV table with all named graphs and the number of triples they contain. The -o option specifies the format you want for the result.

The sd sparql query command, by default, outputs XML serialization. However, it allows for specification of a preferred serialization using the -o flag. Additionally, the sparql module provides convenient command aliases, e.g.:

  • sd sparql list to print a select query as csv without header on stdout
  • sd sparql rule to print the result of a SPARQL CONSTRUCT as a stream of nTriples on stdout

e.g.echo "SELECT DISTINCT ?class WHERE { ?s a ?class} LIMIT 10" | sd sparql list

Reasoning about facts

To materialize inference in the knowledge graph you have some options available:

You can use SPARQL update or a combination of SPARQL query as in the following examples:

# insert data evaluates the expression at SPARQL server side.
cat <<EOF | sd sparql update
INSERT {...} 
WHERE {...}
EOF
        

#  pipe two commands
cat <<EOF | sd sparql rule | sd sparql graph
CONSTRUCT ...
EOF
        
# pipe four commands adding metadata
cat <<EOF | sd sparql rule | sd kees metadata -D "trust=0.9" urn:graph:mycostructor | sd sparql graph urn:graph:mycostructor
CONSTRUCT ...
EOF

# same of above shortcut:
cat <<EOF | sd learn rule -D "trust=0.9" urn:graph:mycostructor
CONSTRUCT ...
EOF

Using plans (EE)

The plan module provides some specialized commands to run a SDaaS scripts

You can think a plan similar to a stored procedure in a SQL database.

For example, assume that STORE contains the following three plans:

prefix sdaas: <http://linkeddata.center/sdaas/reference/v4#> .

<urn:myapp:cities> a sdaas:Plan;   sdaas:script """
        sd learn resource -D "graph=urn:dataset:dbpedia" http://dbpedia.org/resource/Milano
        sd learn resource -D "graph=urn:dataset:dbpedia" http://dbpedia.org/resource/Lecco
""" .

<urn:reasoning:recursive> a sdaas:Plan ; sdaas:script """
        sd learn rule 'CONSTRUCT ...'
        sd learn rule http://example.org/rules/rule1.rq 
""" .
<urn:test:acceptance> a sdaas:Plan ; sdaas:script """
        sd_curl_sparql http://example.org/tests/test1.rq | sd sparql test
        sd_curl_sparql http://example.org/tests/test2.rq | sd sparql test
""" .

Then this commands use plans to automate some common activities:

sd plan run -D "activity_type=Ingestion" urn:myapp:cities
sd plan loop -D "activity_type=Reasoning trust=0.75" urn:reasoning:recursive
sd -A plan run -D "activity_type=Publishing" urn:test:acceptance

the sd plan loop command executes a plan until there are no more changes in the knowledge base. It is useful to implement incremental or recursive reasoning rules.

Managing the Knowledge Base Status (EE)

You can signal the publication status of a specific knowledge base using KEES status poperties.

For setting, getting, and checking the status of a specific window in the knowledge graph, use:

# prints the date of the last status changes:
sd kees date published

# test a status
sd kees is published || echo "Knowledge Graph is not Published"
sd kees is stable || echo "Knowledge Graph is not in a stable status"

Connecting to multiple RDF Graph Stores

You can direct the SDaaS platform to connect to multiple RDF store instances, using standard or optimized drivers:

AWS="http://mystore.example.com/sparql"
AWS_TYPE="neptune"
WIKIDATA="https://query.wikidata.org/sparql"
WIKIDATA_TYPE="w3c"

This allow to import or reasoning using specialized SPARQL end point For instance, the above example imports all cat from wikidata into the default graph store and then list the first five cat names:

cat <<EOF | sd sparql rule -s WIKIDATA | sd sparql graph
DESCRIBE ?item WHERE { 
        ?item wdt:P31 wd:Q146 
} 
EOF

cat <<EOF | sd sparql list
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wd: <http://www.wikidata.org/entity/>
SELECT DISTINCT ?cat WHERE {
        ?item wdt:P31 wd:Q146; rdfs:label ?cat
        FILTER( LANG(?cat)= "en")
} ORDER BY ?cat LIMIT 5 
EOF

Scripting

You have the ability to create a bash script containing various commands.

Refer to the Application building guide for more info about SDaaS scripting

Quitting the platform

When you type exit you can safely destroy the sdaas container but the created data will persist in the external store.

Free allocated docker resources by typing:

docker rm -f kb
docker network rm myvpn
Last modified May 25, 2024: update (b072536)