Advanced Tutorials¶
Below are additional tutorials to perform after the Getting Started Tutorial/Demo.
These tutorials require the installation of the AllegroGraph database and the BaseX database.
If you are not familiar with semantic graph datases please see the AllegroGraph Documentation
Optionally, you may install MongoDB to persist the generated JSON files.
Prerequisites¶
BaseX¶
BaseX requires Java 8 for your platform.
Please download the ZIP file and extract it into your home directory.
Start the server using the Client/Server instructions. You will use the client in later parts of the tutorial.
AllegroGraph¶
Download and Install the server for your platform based on these instructions .
When asked for the superuser username and password use these:
user: admin
password: admin
If you use another username or password, you must edit the entries in kunteksto.conf using a text editor. See below for editing kunteksto.conf.
When the server is installed and running, install the Gruff GUI client for AllegroGraph. You will use this later in the tutorials.
Configuration¶
Using a text editor, edit the status entries in kunteksto.conf for [BASEX] and [ALLEGROGRAPH]. Change them from INACTIVE to ACTIVE. When completed they should look like this:
For BaseX:
[BASEX]
status: ACTIVE
host: localhost
port: 1984
dbname: Kunteksto
user: admin
password: admin
For AllegroGraph:
[ALLEGROGRAPH]
status: ACTIVE
host: localhost
port: 10035
repo: Kunteksto
user: admin
password: admin
Unless you are using MongoDB for JSON persistence you will likely want to turn off JSON generation.
; Default data formats to create. Values are True or False.
; These can be changed in the UI before generating data.
xml: True
rdf: True
json: False
Database Checks¶
From the kunteksto directory run
python utils/db_setup.py
This python script tests the database connections and installs the S3Model ontology and 3.1.0 Reference Model RDF.
During execution, the script displays several lines of output to the terminal. Specifically look for AllegroGraph connections are okay. and BaseX connections are okay. or any lines that start with ERROR:.
Caution
If you see the okay output lines and no ERROR: lines, then all went well. Otherwise, you must troubleshoot these issues before continuing.
Viewing the RDF Repository¶
You can view the Kunteksto repository by using this link in a browser. Right click and open it in a new tab. Then under Explore the Repository click the View Triples link. These triples are the S3Model ontology and the S3Model 3.1.0 RDF. These triples connect all of your RDF into a graph, even when they do not have other semantics linking them.
You may also use the Gruff GUI Client to explore the respoitory at any time. See the Franz, Inc. Learning Center for more information.
US Honey Production¶
The source of this data is from the Kaggle project
The dataset is available here.
Download honeyproduction.csv data set for this tutorial and place it in the kunteksto/example_data directory.
For those without an account on Kaggle, we have included a copy in the example_data directory.
The metadata (click on Data then on the Column Metadata tab) information is useful when filling in the database model and record tables.
You can find more metadata information about this dataset in Wrangling The Honey Production Dataset.
Following the same step by step procedures outlined in the Getting Started section.
- Navigate to the directory where you installed Kunteksto.
- Be certain the virtual environment is active.
Caution
If you closed and reopened a new window, then you need to activate the environment again. Also, be sure that you are in the kunteksto directory.
Windows
activate <path/to/directory>
or Linux/MacOSX
source activate <path/to/directory>
For this tutorial, you will run Kunteksto in commandline mode.
kunteksto -m all -i example_data/honeyproduction.csv
Kunteksto takes a few minutes analyzing the input file and creates a results database in the output directory.
The database editor opens and just like in the previous tutorial, prompts you for model metadata which you can collect from the links above. After you click the Save & Exit button. The column editor will open.
Caution
As you edit the data for each column, be sure to persist your changes using the Save button before advancing to the Next column.
As before, each column is presented for you to add constraints and metadata from the information you collect from the links above or from your own personal knowledge. Remember, this is your model of this data. Using the best details creates the best models.
Often we must be creative when deciding which URI to use for a Defining URL. Our suggested approach when you do not have a specific, online vocabulary or ontology is to use resources such as the metadata mentioned above.
For the The state column we might use https://www.kaggle.com/jessicali9530/honey-production/data#state for the Defining URL and then copy the description from that row in the Column Metadata tab.
For additional semantics (Predicates & Objects box) it is best to use open vocabularies when possible. This gives you the ability to easily connect data across models. If you go to the link for open vocabularies and type “State” into the search box you will see a list of options to choose from. A good choice here is to use the one from Schema.org because this is a popular vocabulary for website mark up. We have an Object now we need a Predicate. Since we want to indicate that this is the meaning of this item, type meaning of into the search box on the open vocabularies site. Notice that rdf:type is one of the first choices and the description makes sense. If you put together the two description phrases you get; “The subject is an instance of a class” “A state or province of a country”. The values in this column are instances of (a representation) of a state or province. Therefore we have a good match.
In the Predicates & Objects enter:
rdf:type http://schema.org/State
Click the Save button, then the Next button to move to the The numcol column. Looking at the meatdata you may choose to change the label to something more readable, like Colonies.
Go through each of the column definitions and complete as many data points about each column as you can and that make sense. Feel free to use meaning names as the labels.
Remember also that numeric columns need a Units designator. Also some columns may be detected as integer or decimal and the range of values are outside the boundaries of those types. In this case be sure to change the type to Float.
Columns like The year are detected as integers. However, this is really a temporal value. In Kunteksto we cannot have a temporal datatype with just a year. So change this to String and in the Predicates and Objects box use
rdf:type http://www.w3.org/2001/XMLSchema#gYear
Note
In S3Model it is possible to have all of the temporal types. The Datacentric Tools Suite provides facilities to create these datatypes.
Once you complete editing of all of the columns, click the Exit button. The GUI will remain on the screen while the data generation process is running. The terminal where you started Kunteksto will scroll messages about the progress.
After the processing is complete review output/honeyproduction/honeyproduction_validation_log.csv to see which files are invalid. The error message from the validator may be a bit cryptic but it’s what we have to work with. Just like with the Demo tutorial, the errors are also included in the Semantic Graph via the RDF.
The output RDF will be in the Kunteksto repository in AllegroGraph which you can explore through the AllegroGraph WebView browser tool or using Gruff which I HIGHLY recommend. You can also explore the XML using the BaseX GUI.
There are many written and video tutorials on using these tools. Check the AllegroGraph YouTube Channel and the BaseX Getting Started.
Global Commodity Trade Statistics¶
Warning
This dataset contains more than 8 million rows of data. If you are using the free version of AllegroGraphDB then processing this file will exceed the 5 million triples limit many times over. The file will still process and all of the XML files will be generated. However, most of the triples will not be stored in AllegroGraph.
The original data set is provided at UNdata Comtrade site.
The best source (easiest download) of this data is the Kaggle Competition
Download the dataset Extract the CSV data and place it the kunteksto/example_data directory.
The metadata (click on Data then on the Column Metadata tab) information may be useful in filling in the database model and record tables. However, it is somewhat incomplete. You can find more metadata information about this dataset in the UNdata Glossary. There is also a knowledgebase that describes how the data was collected and some hints on how to use it. Asyou can see, the metadata is not very organized nor is it computable. S3Model and related datacentric tools allow you to solve this issue with any data of interest.
After you have downloaded the dataset from Kaggle or even a subset from the UNdata site; you are ready to proceed with the tutorial.
Following the same step by step procedures outlined in the Getting Started section.
- Navigate to the directory where you installed Kunteksto.
- Be certain the virtual environment is active.
Caution
If you closed and reopened a new window, then you need to activate the environment again. Also, be sure that you are in the kunteksto directory.
Windows
activate <path/to/directory>
or Linux/MacOSX
source activate <path/to/directory>
For this tutorial, you start Kunteksto in commandline mode.
kunteksto -m all -i example_data/commodity_trade_statistics_data.csv
Kunteksto takes a few minutes analyzing the input file and creates a results database in the output directory.
The database editor opens and just like in the previous tutorial, prompts you for model metadata which you can collect from the links above.
Caution
As you edit the data for each column, be sure to persist your changes using the Save button before advancing to the Next column.
As before, each column is presented for you to add constraints and metadata from the information you collect from the links above or from your own personal knowledge. Remember, this is your model of this data. Using the best details creates the best models.
Be sure to check the datatype detected of columns as well as value constraints. For example the “The year” column is detected as an integer column. Obviously this is not valid. For temporals, Kunteksto only offers data, time and datetime options. Using the Datacentric Tool Suite would allow you to create this properly as a Year datatype column. So, using Kunteksto you must choose the most appropriate which in this case is more likely to be String.
Often we must be creative when deciding which URI to use for a Defining URL. Our suggested approach when you do not have a specific, online vocabulary or ontology is to use resources such as the glossary mentioned above. For the The Year column we might use https://comtrade.un.org/db/mr/rfGlossaryList.aspx#Year for the Defining URL and then copy the description from that row in the table.
In the Predicates & Objects we can use
skos:exactMatch http://www.w3.org/2001/XMLSchema#gYear
Go through each of the column definitions and complete as many data points about each column as you can and that make sense. For example, changing the The weight kg column from String to Decimal will help detect missing or invalid values. Then add the ‘kg’ for the units.
The output RDF will be in the Kunteksto repository in AllegroGraph which you can explore through the AllegroGraph WebView browser tool or using Gruff which I HIGHLY recommend. You can also explore the XML using the BaseX GUI.
There are many written and video tutorials on using these tools. Check the AllegroGraph YouTube Channel and the BaseX Getting Started.