Getting started
The SDaaS Platform offers a programmatic approach to building and update Knowledge Graphs. It includes a language and a command-line interface (CLI), offering optimized access to one or more RDF graph stores.
The subsequent chapters assume you’ve installed the SDaaS Platform and have some familiarity with the bash shell, Docker, and SPARQL. Additionally, you should possess a basic understanding of the key concepts and definitions outlined in the Knowledge Exchange Engine Specifications (KEES).
Connecting to an RDF Graph Store
SDaaS requires access to a RDF SPARQL service: let’s launch a graph store using a public docker image in a vpn named myvpn
docker network create myvpn
docker run --network myvpn --name kb -d linkeddatacenter/sdaas-rdfstore
This will run in background a small, full featured RDF Graph Store instance compliant with SDaaS requirements
Now you can get SDaaS prompt with:
docker run --network myvpn --rm -ti sdaas
Your terminal will show the SDaaS command prompt as an extension of the bash shell:
____ ____ ____
/ ___|| _ \ __ _ __ _/ ___|
\___ \| | | |/ _` |/ _` \___ \
___) | |_| | (_| | (_| |___) |
|____/|____/ \__,_|\__,_|____/
Smart Data as a Service platform - Pitagora
Community Edition 4.0.0 connected to http://kb:8080/sdaas/sparql (w3c)
Copyright (C) 2018-2024 LinkedData.Center
more info at https://linkeddata.center/sdaas
sdaas >
What is happening behind the scene?
SDaaS needs to connect to a graph store; to create such connection you have to specify one sid (store ID) that is an environment variable containing the URL of a SPARQL service endpoint for the graph store. By convention, the default sid is named STORE
. The SDaaS platform comes out-of-the-box configured with a default sid named STORE=http://kb:8080/sdaas/sparql
that you can change in any moment.
All SDaaS commands that require to access a graph store provides the -s <sid>
option.
If omitted, SDaaS platform will use use the name STORE
.
Each sid_ requires a driver specified by the driver variable <sid>_TYPE
. For instance, the store engine driver for STORE
is defined in STORE_TYPE
. By default SDaaS uses the w3c
driver that is suitable for any standard SPARQL service implementations. In addition to the standard driver, SDaaS Enterprise Edition provides some optimized drivers.
For instance: given that the linkeddatacenter/sdaas-rdfstore
Docker image is based on blazegraph engine, in Enterprise Edition you have the flexibility to utilize an optimized driver. To enable this use STORE_TYPE=blazegraph
The first look to the platform
SDaaS provides a set of commands to introspect the platform. Try typing:
# to get the platform version:
sd view version
# to see SDaaS configuration variables:
sd view config
# to list all installed platform modules. Modules are cached on first use. The cached modules are flagged with "--cached".
sd view modules
# to see all commands exported by exported by the [view module](/module/view):
sd view module view
# to download the whole SDaaS Language profile in turtle RDF serialization
sd view ontology -o turtle
see the Calling SDaaS commands section in the Application building guide to learn more about SDaaS commands
Boot the knowledge base
Cleanup a knowledge graph using the sd store erase command (be careful, this zap your default knowledge graph):
sd store erase
You can verify that the store is empty with
sd store size
KEES compliance:
The command sd kees boot (only Enterprise Edition) uses an optimized algorithm to cleanup the knowledge graph adding all metadata required by KEES specifications.
Ingest facts
To ingest RDF data, you have some options available:
- using the sd sparql update command to execute SPARQL Update operations for server side ingestion.
- using a ETL pipeline with the sd sparql graph command : to transfer and load a stream of RDF triples to a named graph in the the graph store.
- using the learn module that provides some specialized shortcuts to ingest data and KEES metadata
Using SPARQL update
the sparql update commands are executed by SPARQL service. Therefore, the resource must be visible to the graph store server. For example, to load the entire definition of schema.org:
echo 'LOAD <https://schema.org/version/latest/schemaorg-current-http.ttl> INTO GRAPH <urn:graph:0>' | sd sparql update
Using SPARQL graph
sd sparql graph command is executed by the SDaaS processor that store a stream of RDF triples (in nTriples serialization) into a named graph inside the graph store. This command offers increased control over the resource by allowing enforcement of the resource type. SDaaS optimizes the transfer of resource triples to the graph store using the most driver-optimized method.
In a ETL process, this command realizes the load stage. It is tipally used in a piped command.
Some examples:
# get data from a command
sd view ontology \
| sd sparql graph urn:sdaas:tbox
# get data from a local file
sd_cat mydata.nt \
| sd sparql graph
# retrieve linked data from a ntriple remote resource
sd_curl -s -f https://schema.org/version/latest/schemaorg-current-http.nt \
| sd sparql graph
# retrieve RFD data serialized with turtle
sd_rapper -i turtle https://dbpedia.org/data/Milan.ttl \
| sd sparql graph https://dbpedia.org/resource/Milan
# retrieve a linked data resource with content negotiation
ld=https://dbpedia.org/resource/Lecco
sd_curl_rdf $ld \
| sd_rapper -g - $ld \
| sd sparql graph -a PUT $ld
# same as above but with KEES metadata
sd_curl_rdf $ld \
| sd_rapper -g - $ld \
| sd kees metadata -D "activity_type=Learning source=$ld trust=0.8" $ld \
| sd sparql graph -a PUT $ld
sd_curl
, sd_curl_rdf
, sd_rapper
, sd_cat
are just wrappers for standard bash commands to trap and log errors.
The sd sparql graph supports two method for graph accrual:
-a PUT
for override named graph; it creates new named graph if needed-a POST
(default) to append data to a named graph; it creates new named graph if needed.
The gsp
driver implementation is capable of utilizing the SPARQL 1.1 Graph Store Protocol (GSP). To enable this support, define the driver type <sid>_TYPE=gsp
and set <sid>_GSP_ENDPOINT
to point to the URL of the service providing the Graph Store Protocol.
WARNING: many graph store engines have limitations regarding the size of data ingestion using just SPARQL update features with the default w3c
driver. Whenever possible, utilize a driver optimized for your graph store or a GSP capable endpoint.
Using the learn module (EE)
The learn module provides some shortcuts to loads linked data into a graph store together to ther KEES metadata.
Here are some examples that loads RDF triples:
sd learn resource -D "graph=urn:dataset:dbpedia" https://dbpedia.org/resource/Milan
sd learn file /etc/app.config
sd learn dataset urn:dataset:dbpedia
sd learn datalake https://data.exemple.org/
WARNING:
if the sd learn dataset
command fails, the target named graph could be incomplete, or annotated with a prov:wasInvalidatedBy
property
Querying the knowledge graph
To query the store you can use SPARQL :
cat <<-EOF | sd sparql query -o csv
SELECT ?g (COUNT (?s) AS ?subjects) WHERE {
GRAPH ?g{?s?p ?o}
} GROUP BY ?g
EOF
This command prints a CSV table with all named graphs and the number of triples they contain. The -o
option
specifies the format you want for the result.
The sd sparql query command, by default, outputs XML serialization. However, it allows for specification of a preferred serialization using the -o
flag. Additionally, the sparql module provides convenient command aliases, e.g.:
sd sparql list
to print a select query as csv without header on stdoutsd sparql rule
to print the result of a SPARQL CONSTRUCT as a stream of nTriples on stdout
e.g.echo "SELECT DISTINCT ?class WHERE { ?s a ?class} LIMIT 10" | sd sparql list
Reasoning about facts
To materialize inference in the knowledge graph you have some options available:
- using sd sparql update commands using SPARQL Update or SPARQL Query constructors
- using the sd sparql rule with constructors
- using the sd learn rule (EE) with constructors
You can use SPARQL update or a combination of SPARQL query as in the following examples:
# insert data evaluates the expression at SPARQL server side.
cat <<EOF | sd sparql update
INSERT {...}
WHERE {...}
EOF
# pipe two commands
cat <<EOF | sd sparql rule | sd sparql graph
CONSTRUCT ...
EOF
# pipe four commands adding metadata
cat <<EOF | sd sparql rule | sd kees metadata -D "trust=0.9" urn:graph:mycostructor | sd sparql graph urn:graph:mycostructor
CONSTRUCT ...
EOF
# same of above shortcut:
cat <<EOF | sd learn rule -D "trust=0.9" urn:graph:mycostructor
CONSTRUCT ...
EOF
Using plans (EE)
The plan module provides some specialized commands to run a SDaaS scripts
You can think a plan similar to a stored procedure in a SQL database.
For example, assume that STORE contains the following three plans:
prefix sdaas: <http://linkeddata.center/sdaas/reference/v4#> .
<urn:myapp:cities> a sdaas:Plan; sdaas:script """
sd learn resource -D "graph=urn:dataset:dbpedia" http://dbpedia.org/resource/Milano
sd learn resource -D "graph=urn:dataset:dbpedia" http://dbpedia.org/resource/Lecco
""" .
<urn:reasoning:recursive> a sdaas:Plan ; sdaas:script """
sd learn rule 'CONSTRUCT ...'
sd learn rule http://example.org/rules/rule1.rq
""" .
<urn:test:acceptance> a sdaas:Plan ; sdaas:script """
sd_curl_sparql http://example.org/tests/test1.rq | sd sparql test
sd_curl_sparql http://example.org/tests/test2.rq | sd sparql test
""" .
Then this commands use plans to automate some common activities:
sd plan run -D "activity_type=Ingestion" urn:myapp:cities
sd plan loop -D "activity_type=Reasoning trust=0.75" urn:reasoning:recursive
sd -A plan run -D "activity_type=Publishing" urn:test:acceptance
the sd plan loop
command executes a plan until there are no more changes in the knowledge base. It is useful to implement incremental or recursive reasoning rules.
Managing the Knowledge Base Status (EE)
You can signal the publication status of a specific knowledge base using KEES status poperties.
For setting, getting, and checking the status of a specific window in the knowledge graph, use:
# prints the date of the last status changes:
sd kees date published
# test a status
sd kees is published || echo "Knowledge Graph is not Published"
sd kees is stable || echo "Knowledge Graph is not in a stable status"
Connecting to multiple RDF Graph Stores
You can direct the SDaaS platform to connect to multiple RDF store instances, using standard or optimized drivers:
AWS="http://mystore.example.com/sparql"
AWS_TYPE="neptune"
WIKIDATA="https://query.wikidata.org/sparql"
WIKIDATA_TYPE="w3c"
This allow to import or reasoning using specialized SPARQL end point For instance, the above example imports all cat from wikidata into the default graph store and then list the first five cat names:
cat <<EOF | sd sparql rule -s WIKIDATA | sd sparql graph
DESCRIBE ?item WHERE {
?item wdt:P31 wd:Q146
}
EOF
cat <<EOF | sd sparql list
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wd: <http://www.wikidata.org/entity/>
SELECT DISTINCT ?cat WHERE {
?item wdt:P31 wd:Q146; rdfs:label ?cat
FILTER( LANG(?cat)= "en")
} ORDER BY ?cat LIMIT 5
EOF
Scripting
You have the ability to create a bash script containing various commands.
Refer to the Application building guide for more info about SDaaS scripting
Quitting the platform
When you type exit
you can safely destroy the sdaas container but the created data will persist in the external store.
Free allocated docker resources by typing:
docker rm -f kb
docker network rm myvpn