Command Line Interface (CLI)

TriplyETL allows you to manually perform various tasks in a terminal application (a Command-Line Interface or CLI).

Installing dependencies

When you work on an existing TriplyETL project, you sometimes pull in changes made by your team members. Such changes are typically obtained by running the following Git command:

git pull

This command prints a list of files that were changed by your team members. If this list includes changes to the file package.json, this means that one or more dependencies were changed. In order to effectuate these changes in your local copy of the TriplyETL project, you must run the following command:

npm i

Transpiling to JavaScript

When you make changes to one or more TypeScript files, the corresponding JavaScript files will have become outdated. If you now use the TriplyETL Runner, it will use one or more outdated JavaScript files, and will not take into account your most recent changes to the TypeScript files.

In order to keep your JavaScript files up-to-date relative to your TypeScript files, you must run the following command after making changes to TypeScript files:

npm run build

If you edit your TypeScript files repeatedly, having to run this extra command may get tedious. In such cases, you can run the following command to automatically perform the transpile step in the background:

npm run dev

Notice that this prevents you from using the terminal application for new commands. It is typical to open a new terminal application window, and run the npx etl command from there.

TriplyETL Runner

The TriplyETL Runner allows you to run a local TriplyETL project in your terminal application.

We assume that you have a local TriplyETL project in which you can successfully run the npx etl command. Follow the Getting Started instructions for TriplyETL Runner if this is not yet the case.

Run the following command to run the ETL pipeline:

npx etl

This command implicitly uses the file lib/main.js, which is the transpiled JavaScript file that corresponds to the TypeScript file src/main.ts. The following command has the same behavior, but makes explicit which file is used:

npx etl lib/main.js

Some TriplyETL projects have multiple top-level scripts. In such cases, it is possible to run each of these scripts individually as follows:

npx etl lib/some-script.js

Output summary

TriplyETL Runner will start processing data. Depending on the size of the data source, the Runner may take more or less time to finish. When the Runner finishes successfully, it will print the following summary:

 ┌──────────────────────────────────────────────────────────────┐
 │ Etl:                #Error    0 | #Warning    0 | #Info    0 │
 │ #Statements         2                                        │
 │ #Records            2                                        │
 │ Started at          2023-06-18 10:05:20                      │
 │ Runtime             0 sec                                    │
 └──────────────────────────────────────────────────────────────┘

This summary includes the following information:

  • "#Error" shows the number of errors encountered. With default settings, this number is at most 1, since the Runner will immediately stop after an error occurs.

  • "#Warning" shows the number of warnings encountered. With default settings, this includes warnings emitted by the SHACL Validator.

  • "#Info" shows the number of informational messages. With default settings, this includes informational messages emitted by the SHACL Validator.

  • "#Statements" shows the number of triples or quads that was generated. This number is equal to or higher than the number of statements that is uploaded to the triple store. The reason for this is that TriplyETL processes records in parallel. If the same statement is generated for two records, the number of statements with be incremented by 2, but only 1 unique statement will be uploaded to the triple store.

  • "#Records" shows the number of records that was processed.

  • "Started at" shows the date and time at which the Runner started.

  • "Runtime" shows the wall time duration of the run.

Limit the number of records

When developing a pipeline, it is almost never necessary to process all records from the source data. Instead, it is common to run the ETL for a small number of example record, which results in quick feedback. The --head flag indicates the maximum number of records that is processed by the Runner:

npx etl --head 1
npx etl --head 10

These commands run the ETL for the first record (if one is available) and for the first 10 records (if these are available).

Specify a range of records

When developing a pipeline over a large source data collection, it is often standard practice to use the first 10 or 100 records most of the time.

The benefit of this approach is that the feedback loop between making changes and receiving feedback is short. A downside of this approach is that the ETL may be overly optimized towards these first few records. For example, if a value is missing in the first 1.000 records, then transformations that are necessary for when the value is present will not be developed initially. An alternative is to run the entire ETL, but that takes a long time.

To avoid the downsides of using --head, TriplyETL also supports the --from-record-id flag. This flag specifies the number of records that are skipped. This allows us to specify an arbitrary consecutive range of records. For example, the following processes the 1.001-st until and including the 1.010-th record:

npx etl --from-record-id 1000 --head 10

Process a specific record

When the --head flag is set to 1, the --from-record-id flag specifies the index of a single specific record that is processed. This is useful when a record is known to be problematic, for instance during debugging.

The following command runs TriplyETL for the 27th record:

npx etl --from-record-id 26 --head 1

Set a timeout

For large ETL pipelines, it is sometimes useful to specify a maximum duration for which the TriplyETL Runner is allowed to run. In such cases, the --timeout flag can be used.

The --timeout option accepts human-readable duration strings, such as '1h 30m 5s', '1hr', '1 hour', or '3hrs'.

When the indicated timeout is reached before the pipeline finishes, the TriplyETL Runner will gracefully terminate the ETL by acting as if there are no more incoming records. As a result, the Runner will upload all linked data (graphs) that was produced up to that point, and it will write a performance log.

For TriplyETLs that run in a CI/CD environment, the timeout must be set lower than the CI/CD timeout, in order for the Runner to be able to perform the termination step.

Verbose mode

When TriplyETL is run normally, the following information is displayed:

  • The number of added triples.
  • The runtime of the script.
  • An error message, if any occurred.

It is possible to also show the following additional information by specifying the --verbose flag:

  • In case of an error, the first 20 values from the last processed record.
  • In case of an error, the full stack trace.

The following example shows how the --verbose flag can be used:

npx etl --verbose

Secure verbose mode

Verbose mode may perform a reset of your current terminal session. If this happens you lose visible access to the commands that were run prior to the last TriplyETL invocation.

This destructive behavior of verbose mode can be disabled by setting the following environment variable:

export CI=true

This fixes the reset issue, but also makes the output less colorful.

TriplyETL Tools

TriplyETL Tools is a collection of small tools that can be used to run isolated tasks from your terminal application. TriplyETL Tools can be used when you are inside a TriplyETL project.

If you do not have an ETL project yet, use the TriplyETL Generator first to create one.

The following command prints an overview of the supported tools:

npx tools

The following tools are supported:

Tool Description
compare Compare the contents of two RDF files
create-token Create a new TriplyDB API Token
print-token Print the currently set TriplyDB API Token, if any
validate Validate a data file against a SHACL shapes file

For each tool, the following command prints more information on how to use it:

npx tools {name} --help

Compare

The compare tool checks whether two RDF files encode the same linked data: - If the two files contain the same data, the command succeeds and does not print any output. - If the two files do not contain the same data, the command exits with an error code, and the difference between the two files is printed.

The compare tools is invoked over the two RDF files one.ttl and two.ttl as follows:

npx tools compare one.ttl two.ttl

This tool can be used to compare two RDF files that contain multiple graphs, for example:

npx tools compare one.trig two.trig

This tool uses the graph isomorphism property as defined in the RDF 1.1 standard: link

Create TriplyDB API Token

This tool creates a new TriplyDB API Token from the command-line. This command can be used as follows:

npx tools create-token

The command will ask a couple of questions in order to create the TriplyDB API Token:

  • The hostname of the TriplyDB instance
  • The name of the token
  • Your TriplyDB account e-mail
  • Your TriplyDB account password

The command exists in case a TriplyDB API Token is already configured.

This tool prints the currently configured TriplyDB API Token, if any. This command can be used as follows:

npx tools print-token

This command is useful when there are issues with configuring a TriplyDB API Token.

Validate

This tool validates the content of one data file against the SHACL shapes in another file. The resulting SHACL validation report is printed to standard output.

The command can be used as follows:

$ npx tools validate -d data.trig -s model.trig

See this section to learn more about the SHACL validation report.