- RATT transformations
RATT transformations¶
RATT transformations are a core set of functions that are commonly used to change the content of TriplyETL Records.
RATT transformations started out as TypeScript transformations that turned out to be useful in a wide variety of TriplyETL pipelines. Triply maintains this core set of transformation functions to allow new ETLs to make use of off-the-shelf functionality that has proven useful in the past.
Overview¶
The following transformation functions are currently available:
Function | Description |
---|---|
addHashedIri() | Creates an IRI with a content-based hash as the local name. |
addIri() | Create a new IRI based on a prefix and a local name. |
addLiteral() | Create a new literal based on a lexical for and a datatype IRI or language tag. |
addSkolemIri() | Create a new IRI with a random local name, which advertises that it may be consistently replaced with blank nodes. |
addTag() | Create a language tag. |
addValue() | Create a TypeScript value. |
capitalize() | Transforms a string value to its capitalized variant. |
concat() | Combine multiple strings into a new string. |
copy() | Copy a value from an old into a new key. |
decodeHtml() | Decode HTML entities that occur in strings. |
geojsonToWkt() | Change GeoJSON strings to WKT strings. |
jpath() | Uses the JSONPath query language to select a value from the record. |
lowercase() | Change strings to their lowercase variants. |
padEnd() | Pad the end of strings. |
padStart() | Pad the start of strings. |
replace() | Replace part of a string. |
split() | Split a string into multiple substrings. |
substring() | Extract a substring from a string. |
translateAll() | Translate all string values to other values. |
translateSome() | Translate some string values to other strings. |
tryLiteral() | Create literals for which the datatype is not know beforehand. |
uppercase() | Change a string to its uppercase variant. |
wkt.addPoint() | Add a geospatial point using the Well-Known Text (WKT) format. |
wkt.project() | Change the projection of a Well-Known Text (WKT) literal from from Coordinate Reference System into another. |
addHashedIri()¶
Creates an IRI with a content-based hash as the local name.
Signature¶
This function has the following signature:
addHashedIri({ prefix, content, key })
Parameters¶
prefix
is a dynamic or static IRI.content
is an array with static value, or a key that contains a dynamic arrays of values.key
is a new key where the created IRI is stored.
When to use?¶
This function is used under the following circumstances:
- Something must be identified with an IRI.
- The thing that must be identified does not have a readily available identifier.
- The thing that must be identified does have properties whose combination of values is unique, and can therefore act as a composite identifier. This is called a composite key in database theory.
- The composed URL is not allowed to be skolemised anymore, e.g. the
prefix
cannot be an IRI where the pathname starts with/.well-known/genid/
. For this purpose, use transformation addskolemiri() instead.
Benefits:
- The created IRIs are the same across different ETL runs over the same source data. This supports important use cases like change managements / versioning and graph signing.
Downsides:
- It can take a lot of time to figure out which set of properties make every IRI unique. In database theory this process is known as composite key detection. Furthermore, keeping the list of identifying properties up-to-date over time poses a maintenance burden.
Example: lazy identifiers¶
Some source data does not include good identifiers for all data items. For example, the following source table contains first names and last names of persons, but neither property is by itself unique:
First name | Last name |
---|---|
Jane | Doe |
Jane | Smith |
John | Doe |
In such cases it may be an option to take a combination of columns, and use that combined sequence of columns for identification.
The following snippet uses the combination of the first name and last name fields (in that order) to create a locally unique hash. Together with an IRI prefix, this can be used to create globally unique IRIs:
fromJson([{ 'First name': 'John', 'Last name': 'Doe' }]),
addHashedIri({
prefix: prefix.person,
content: ['First name', 'Last name'],
key: '_person',
}),
pairs('_person',
[a, sdo.Person],
[sdo.givenName, 'First name'],
[sdo.familyName, 'Last name'],
),
This results in the following linked data:
person:70020947bea6c39cccea20d27e30fbdf
a sdo:Person;
sdo:givenName 'John';
familyName 'Doe'.
Or diagrammatically:
Example: dynamic IRI prefix¶
It is possible to specify a dynamic IRI prefix to addHashedIri()
.
The following code snippet uses a dynamic IRI prefix from the data source record:
fromJson([
{
prefix: Iri('https://triplydb.com/my-account/my-dataset/id/person/'),
name: 'John Doe',
},
{
prefix: Iri('https://triplydb.com/my-account/my-dataset/id/product/'),
name: '123',
},
]),
addHashedIri({
prefix: 'prefix',
content: ['name'],
key: '_subject',
}),
triple('_subject', a, owl.NamedIndividual),
This results in the following linked data:
prefix person: <https://triplydb.com/my-account/my-dataset/id/person/>
prefix product: <https://triplydb.com/my-account/my-dataset/id/product/>
person:76f294ac31199b65ec25048439b66f78 a owl:NamedIndividual.
product:9154deaa364b289c6b012e99f947f30e a owl:NamedIndividual.
Example: statement reification¶
The RDF standard allows individual statements to be identified by a node. This approach is called statement reification and can be used to assert metadata about statements or can represent modalities such as probability and belief.
The following snippet uses addHashedIri()
to create a unique identifier for each reified statement:
fromJson([{ id: '1', name: 'John Doe' }]),
// Step 1. Create the subject, predicate, and object terms.
addIri({ prefix: prefix.person, content: 'id', key: 'subject' }),
addIri({ prefix: prefix.def, content: str('name'), key: 'predicate' }),
addLiteral({ content: 'name', key: 'object' }),
// Step 2. Create the triple statement.
triple('subject', 'predicate', 'object'),
// Step 3. Create the reified statement.
addHashedIri({
prefix: prefix.statement,
content: ['subject', 'predicate', 'object'],
key: 'statement',
}),
pairs('statement',
[a, rdf.Statement],
[rdf.subject, 'subject'],
[rdf.predicate, 'predicate'],
[rdf.object, 'object'],
),
This results in the following linked data:
person:1 def:name 'John Doe'.
statement:549decc4c44204a907aa32b4cc9bfaba
a rdf:Statement;
rdf:subject person:1;
rdf:predicate def:name;
rdf:object 'John Doe'.
Or diagrammatically:
addIri()¶
Creates an IRI based on a specified local name.
Signature¶
addIri({ prefix, content, key })
Parameters¶
prefix
Optionally, a static or dynamic IRI. This IRI will appear before the local name specified for thecontent
parameter. If theprefix
parameter is absent, parametercontent
is must contain an absolute IRI.content
A string, or a key that contains a string. If theprefix
parameter is specified,content
specifies the IRI local name that appears after the IRI prefix. If theprefix
argument is not specified,content
is assumed to encode a full absolute IRI.key
A new key where the created IRI is stored.
When to use?¶
This transformation can be used in the following two ways:
1. By using an IRI prefix and a local name.
2. By using a full absolute IRI.
3. The composed URL is not allowed to be skolemised anymore, e.g. the prefix
cannot be an IRI where the pathname starts with /.well-known/genid/
. For this purpose, use transformation addskolemiri() instead.
See also¶
If the created IRI is used exactly once, it is often better to use inline function iri() instead.
Example: Prefix declaration and local name¶
The following snippet creates an IRI based on the specified IRI prefix and local name:
addIri({
prefix: prefix.person,
content: 'username',
key: '_person',
}),
triple('_person', a, sdo.Person),
This results in the following linked data assertions:
person:johndoe a sdo:Person.
Or diagrammatically:
The following snippet makes the same assertion, but uses assertion iri() instead of transformation addIri()
:
triple(iri(prefix.person, 'username'), a, sdo.Person),
Example: Absolute IRI¶
The following snippet creates the same IRI, but does not use a predefined prefix IRI:
addIri({
content: 'https://example.com/id/person/johndoe',
key: '_person',
}),
triple('_person', a, sdo.Person),
Or diagrammatically:
The following snippet uses assertion iri() instead of transformation addIri()
:
triple(iri('https://example.com/id/person/johndoe'), a, sdo.Person),
addLiteral()¶
Creates an new literal and adds it to the Record under the specified key.
This transformation can be used in the following 3 ways:
- If a datatype (key:
datatype
) is given, a typed literal is created. - If a language tag (key:
languageTag
) is given, a language-tagged string (datatyperdf:langString
) is created. - If neither a datatype not a language tag is given, a literal with datatype
xsd:string
is created.
When to use¶
This transformation is typically used when:
- The same literal occurs in two or more statement assertions (function triple() or quad()). This avoids having to specify the same literal multiple times using function literal().
- The datatype or language tag is derived from the source data record.
Parameters¶
content
A key that contains a string value, or a string specified with function str().datatype
Optionally, a key that stores an IRI or a static IRI.language
Optionally, a language tag from thelang
object, or a key that stores such a language tag.validate
Optionally provide a single validator condition or an array of validator conditions that the literalcontent
should hold to, will return a boolean and throw and error when a validator condition does not hold.key
A new key where the created literal is stored.
See also¶
If the created literal is used exactly once, it is often better to use the inline function literal() instead.
Example: Typed literal¶
The following snippet asserts a triple with a typed literal with datatype IRI xsd:date
:
fromJson([{ id: '123', date: '2022-01-30' }]),
addLiteral({
content: 'date',
datatype: xsd.date,
key: '_dateCreated',
}),
triple(iri(prefix.book, 'id'), sdo.dateCreated, '_dateCreated'),
This makes the following linked data assertion:
book:123 sdo:dateCreated '2022-30-01'^^xsd:date.
Notice that the same linked data could have been asserted with the following use the the literal() assertion middleware:
fromJson([{ id: '123', date: '2022-01-30' }]),
triple(iri(prefix.book, 'id'), sdo.dateCreated, literal('date', xsd.date)),
Example: String literal¶
The following snippet asserts a triple with a string literal in the object position:
fromJson([{name: 'London'}]),
addLiteral({
content: 'name',
key: '_name',
}),
triple(iri(prefix.city, '_name'), skos.prefLabel, '_name')
This makes the following assertion:
city:London sdo:name 'London'.
The literal 'London' has type xsd:string
. This is the standard datatype IRI for typed literals in the linked data languages (i.e. Turtle, TriG, and SPARQL).
Notice that the same linked data could have been asserted with the following snippet, where the string value 'London'
is automatically cast into a string literal:
fromJson([{ name: 'London' }]),
triple(iri(prefix.city, 'name'), skos.prefLabel, 'name'),
Example: Language-tagged string¶
The following snippet asserts a triple with a language-tagged string in the object position:
fromJson([{ name: 'London' }]),
addLiteral({
content: 'name',
language: language.en,
key: '_name',
}),
triple(iri(prefix.city, 'name'), skos.prefLabel, '_name'),
This results in the following linked data assertion:
city:London skos:prefLabel 'London'@en.
Notice that the same linked data could have been asserted with the following use the the literal() assertion middleware:
fromJson([{ name: 'London' }]),
triple(iri(prefix.city, 'name'), skos.prefLabel, literal('name', lang['en'])),
Example: Validate usage¶
The following snippet asserts a triple of a person's email, with the email address being validated in the object position, and should throw an error when the record contains an invalid email address:
fromJson([{ name: "John", email: 'john.appleseed@example.com' }, {name: 'NA', email: 'notAnEmail' } ]),
addLiteral({
content: 'email',
validate: isEmail(),
key: '_email',
}),
triple(iri(prefix.person, 'name'), foaf.mbox, '_email'),
This results in the following error for the second record:
ERROR (Record #2) String "notAnEmail" is not an email address.
Notice that when using only correct email addresses:
fromJson([{ name: "John", email: 'john.appleseed@example.com' }, { name: "Lisa", email: 'lisa.appleseed@example.com' } ]),
addLiteral({
content: 'email',
validate: isEmail(),
key: '_email',
}),
triple(iri(prefix.person, 'name'), foaf.mbox, '_email'),
It results in the following correct linked data assertion:
person:John foaf:mbox "john.appleseed@example.com"
person:Lisa foaf:mbox "lisa.appleseed@example.com"
addSkolemIri()¶
Creates a globally unique IRI that is intended to be used as a blank node identifier.
Blank nodes are nodes without identification. It relatively difficult to work which such nodes in graph data, since they cannot be identified or dereferenced online. For this reason TriplyETL uses Skolem IRIs to denote blank nodes. This allows blank nodes to be identified and dereferenced. This Skolemization approach is part of the RDF standard.
Skolem IRIs are random IRIs whose root path starts with .well-known/genid/
. This makes it easy to distinguish them from other random IRIs that are not used to denote blank nodes.
prefix
A IRI or a key that contains an IRI whose path starts with.well-known/genid/
.key
A new key where the created IRI is stored.
See also¶
Tne Skolemization section in the RDF standard explains what Skolem IRIs are and how they should be used.
Example¶
The following snippet uses a hashed IRI to create a predictable identifier for a geospatial feature, and a Skolem IRI to create an unpredictable identifier for the geometry.
The snippet includes the prefix declarations to illustrate that the path of the Skolem IRI must start with .well-known/genid.
.
const base = 'https://example.com/'
const prefix = {
feature: declarePrefix(base('id/feature/')),
skolem: declarePrefix(base('.well-known/genid/')),
}
// Etc
fromJson([{ point: 'Point(1.1 2.2)' }]),
addHashedIri({
prefix: prefix.feature,
content: 'point',
key: '_feature',
}),
addSkolemIri({
prefix: prefix.skolem,
key: '_geometry',
}),
triple('_feature', geo.hasGeometry, '_geometry'),
triple('_geometry', geo.asWKT, literal('point', geo.wktLiteral)),
This results in the following linked data assertions:
feature:22238008e490f725979118f8f2dd9b5a geo:hasGeometry skolem:0cf4b63252a0476a8afc20735aa03da6.
skolem:0cf4b63252a0476a8afc20735aa03da6 geo:asWKT 'Point(1.1 2.2)'^^geo:wktLiteral.
Notice that the feature IRI will be the same across ELT runs if the source data stays the same, but the Skolem will always be different.
Since the Skolem IRIs can be identified by the start of their path
(.well-known/genid/
), the same linked data assertions can be displayed as
follows:
feature:22238008e490f725979118f8f2dd9b5a geo:hasGeometry
[ geo:asWKT 'Point(1.1 2.2)'^^geo:wktLiteral ].
addTag()¶
This middleware creates a language tag based on a given string value.
Parameters¶
content
A string value that encodes a language tag according to the IANA language subtag registry.key
A new key where the language tag will be stored.
Throws¶
An error is emitted if the given string value does not follow the language tag format, or denotes a language tag that is not currently registered.
See also¶
- The language tag format is defined in the IETF BCP 47 standard (RFC 5646).
- Language tags are registered in the IANA language subtag registry.
Example¶
The following snippet created a language tag for the Dutch language as spoken in The Netherlands, and uses it to assert a language-tagged string:
fromJson([{ label: 'Amsterdam' }]),
addTag({
content: 'nl-nl',
key: 'lang',
}),
triple(iri(prefix.city, 'label'), rdfs.label, literal('label', 'lang')),
addValue()¶
This middleware allows any value to be added to the Record.
Description¶
This middleware is useful for data that is not present in the source data record, but must be used in one or more assertions.
Parameters¶
content
Any value that can be represented in TypeScript.key
A new key where the value is stored.
Example¶
The following snippet starts out with an empty source record ({}
), and adds a new data
key to it. The added value is an array that contains a string and a number (in that order).
This new value is used in the triple assertion, where 'data[0]'
extracts the string element and 'data[1]'
extracts the number elements.
fromJson([{}]),
addValue({
content: ['johndoe', 22],
key: 'data',
}),
triple(iri(prefix.person, 'data[0]'), foaf.age, 'data[1]'),
This results in the following linked data assertion:
person:johndoe foaf:age 22.
Example¶
The following snippet adds a key called _startDate
that either contains the start date as specified in the data source record, or the value 'unknown'
:
fromJson([
{ id: '123', start: '2022-02-12' },
{ id: '456' },
]),
ifElse({
if: 'start',
then: addLiteral({
content: 'start',
datatype: xsd.date,
key: '_start',
}),
}, {
else: addValue({
content: 'unknown',
key: '_start',
}),
}),
triple(iri(prefix.event, 'id'), sdo.startDate, '_start'),
This results in the following linked data assertions:
event:123 sdo:startDate '2022-02-12'^^xsd:date.
event:456 sdo:startDate 'unknown'.
capitalize()¶
Transforms a string value to its capitalized variant.
If the first character of a string has an uppercase variant, then that variant is used. If the first character does not have an uppercase variant -- because the character is already uppercase or is a punctuation character -- then the string remains unchanged.
This transformation can uppercase the first character in any language; the Unicode Default Case Conversion algorithm is used.
Parameters¶
content
A key that contains a string value.key
A new key where the capitalized result is stored.
Example: Class IRIs¶
According to convention, classes in linked data are denoted by IRIs whose local name starts with a capital letter.
The following source data contains nice values for the type
key, but they do not start with a capital letter yet.
The following snippet capitalizes the values of the type
keys, and uses them to create class IRIs.
fromJson([
{ id: '1', type: 'location' },
{ id: '2', type: 'person' },
]),
capitalize({
content: 'type',
key: '_type',
}),
triple(iri(prefix.id, 'id'), a, iri(prefix.def, '_type')),
This results in the following linked data assertions:
id:1 a def:Location.
id:2 a def:Person.
concat()¶
Description¶
Concatenates an array of strings into one new string.
An optionally specified separator is placed in between every two consecutive string values.
Parameters¶
content
An array of key that contain a string and/or strings specified with assertion str().separator
Optionally, the string that is places between every two consecutive string values.key
A new key where the concatenated string is stored.
Example¶
The following snippet concatenates the first and last name of a person (in that order), using a space separator.
fromJson([{ id: '1', first: 'John', last: 'Doe' }]),
concat({
content: ['first', 'last'],
separator: ' ',
key: '_name',
}),
triple(iri(prefix.person, 'id'), foaf.name, '_name'),
This results in the following linked data assertion:
person:1 foaf:name 'John Doe'.
copy()¶
Makes a plain copy from the value stored in the given key to a new key.
Parameters¶
content
A value of any type, or a key that contains a value of any type.type
Optionally, the name of the TypeScript type of the value. The default value is'string'
.key
A new key where the plain copy is stored.
Example¶
Plain copies can be used to abbreviate long keys, especially in tree-shaped data like JSON or XML. In the following example, values stored in a long nested key are copies into a short and descriptive key. This is even more useful if the key is used many times in the script.
copy({
content: 'record[0].family[0].children.child[0].id.$text',
key: 'childId',
}),
Example¶
Since plain copies introduce a new name for an existing value, the new name can be used to store extra information about the value. The following example stores an English name, if available; or a Dutch name, if available; or no name at all. This is a relatively complex example that can only be accomplished by copying the names for the encountered languages under descriptive key names.
fromJson([
{
id: '1',
names: [
{ language: 'en', value: 'London' },
{ language: 'fr', value: 'Londres' },
{ language: 'nl', value: 'Londen' },
],
},
{
id: '2',
names: [
{ language: 'fr', value: 'Paris' },
{ language: 'nl', value: 'Parijs' },
],
},
]),
forEach('names', [
_switch('language',
// Plain copy of the English label, if available.
['en', copy({ content: 'value', key: '$parent.en' })],
// Plain copy of the Dutch label, if available.
['nl', copy({ content: 'value', key: '$parent.nl' })],
),
]),
ifElse({
// Prefer an English label over a Dutch label.
if: 'en',
then: copy({ content: 'en', key: '_preferredName' }),
}, {
// If there is no English label, a Dutch label is a second-best option.
if: 'nl',
then: copy({ content: 'nl', key: '_preferredName' }),
}),
// If there is either an English or a Dutch label, assert it.
when('_preferredName', [
triple(iri(prefix.city, 'id'), rdfs.label, '_preferredName'),
]),
This results in the following linked data assertions:
city:1 rdfs:label 'London'.
city:2 rdfs:label 'Parijs'.
encodeHtml()¶
Description¶
This transformation decodes any HTML entities that appear in a given key.
The following HTML entities are common in source data:
HTML entity | Decoded |
---|---|
& |
& |
> |
> |
< |
< |
You do not need to use this transformation if you want to assert literals with datatype IRI rdf:HTML
. HTML entities are meaningful in HTML, so there they should be preserved.
Parameters¶
content
A key in the Record that contains string values with HTML entities.key
A new key where the decoded content is stored.
Example¶
The following snippet takes HTML texts from the source data and asserts them as regular text literals. Since HTML entities are meaningless in regular text, decodeHtml
is used to denote these entities.
fromJson([
{ id: '1', label: 'A&B' },
{ id: '2', label: '1 < 2' },
]),
decodeHtml({
content: 'label',
key: '_label',
}),
triple(iri(prefix.id, 'id'), rdfs.label, '_label'),
This results in the following linked data assertions:
id:1 rdfs:label 'A&B'.
id:2 rdfs:label '1 < 2'.
geojsonToWkt()¶
Transforms GeoJSON objects to their corresponding Well-Known Text (WKT) serialization strings.
Parameters¶
content
A key that stores a GeoJSON object.crs
Optionally, an IRI that denotes a Coordinate Reference System (CRS). You can use IRIs from theepsg
object. If absent, uses https://epsg.io/4326 as the CRS.key
A new key where the WKT serialization string is stored
GeoJSON and Well-Known Text (WKT)¶
According to the GeoJSON standard, the only Coordinate Reference System (CRS) that is allowed to be used is EPSG:4326/WGS84. In practice, source data sometimes (incorrectly) stores GeoJSON formatted data in other CRSes. An example of this is the GISCO dataset of the European Union, which uses the EPSG:3857 CRS. For cases like these, the optional crs
parameter comes in handy.
See also¶
- The GeoJSON format is standardized in RFC 7946.
- The Well-Known Text (WKT) serialization format is standardized as part of ISO/IEC 13249-3:2016 standard.
Example¶
The following snippet converts GeoJSON objects that denote traffic light locations to their GeoSPARQL representation.
fromJson([
{
id: '123',
geometry: {
type: 'Point',
coordinates: [6.256, 48.480],
},
},
]),
addIri({
prefix: prefix.feature,
content: 'id',
key: '_feature',
}),
geojsonToWkt({
content: 'geometry',
crs: epsg[3857],
key: '_wkt',
}),
addHashedIri({
prefix: prefix.geometry,
content: '_wkt',
key: '_geometry'
}),
pairs('_feature',
[a, def.TrafficLight],
[geo.hasGeometry, '_geometry'],
),
pairs('_geometry',
[a, geo.Geometry],
[geo.asWKT, literal('_wkt', geo.wktLiteral)],
),
This results in the following linked data assertions:
feature:123
a def:TrafficLight;
geo:hasGeometry geometry:197e6376c2bd8192c24911f88c330606.
geometry:197e6376c2bd8192c24911f88c330606
a geo:Geometry;
geo:asWKT 'Point(6.256 48.480)'^^geo:wktLiteral.
Or diagrammatically:
jpath()¶
Description¶
Filters a value based on a JSON Path expression. JSON Path is a query language for JSON. For the syntax of JSON Path expressions, please visit the JSON Path documentation page.
Use cases¶
This function simplifies the complex key specification to filter specific values. It can only be used for an object of a triple to create a literal. The result of a function must have a fundamental type.
Parameters¶
value
A JSON Path expression.
Example¶
The following examples will create a literal based on key value
:
If key 'ISO_639-2'
exists:
fromJson({
language: [
{ 'ISO_639-1': 'en', lcid: 2057, value: 'Paris' },
{ 'ISO_639-1': 'nl', 'ISO_639-2': 'nld', lcid: 1043, value: 'Parijs' },
],
}),
triple(
iri(prefix.city, '$recordId'),
rdfs.label,
literal(jpath("$.language[?(@['ISO_639-2'])].value"), language.nl)
),
If key 'ISO_639-1'
is equal to nl
:
fromJson({
language: [
{ 'ISO_639-1': 'en', lcid: 2057, value: 'Paris' },
{ 'ISO_639-1': 'nl', 'ISO_639-2': 'nld', lcid: 1043, value: 'Parijs' },
],
}),
triple(
iri(prefix.city, '$recordId'),
rdfs.label,
literal(jpath("$.language[?(@['ISO_639-1'] =='nl')].value"), language.nl)
),
If key 'lcid'
is lower than 1,100:
fromJson({
language: [
{ 'ISO_639-1': 'en', lcid: 2057, value: 'Paris' },
{ 'ISO_639-1': 'nl', 'ISO_639-2': 'nld', lcid: 1043, value: 'Parijs' },
],
}),
triple(
iri(prefix.city, '$recordId'),
rdfs.label,
literal(jpath('$.language[?(@.lcid < 1100)].value'), language.nl)
),
All three examples generate the following linked data:
record:1 rdfs:label 'Parijs'@nl.
lowercase()¶
Description¶
Translates a string value to its lowercase variant.
This middleware can lowercase strings in any language; the Unicode Default Case Conversion algorithm is used.
Use cases¶
Older data formats sometimes use uppercase letters for header names or codes. The lowercase
transformation middleware may be used to change such string values to lowercase.
Parameters¶
content
A key that contains a string value.key
A new key where the lowercase variant is stored.
Example¶
The following snippet starts out with header values that use uppercase characters exclusively. The lowerCase
transformation is used to create lowercase names that can be used to create property IRIs.
fromJson([
{ from: '1', rel: 'PARENT', to: '2' },
{ from: '2', rel: 'CHILD', to: '1' },
]),
lowercase({
content: 'rel',
key: '_relationship',
}),
triple(
iri(prefix.id, 'from'),
iri(prefix.def, '_relationship'),
iri(prefix.id, 'to'),
),
This results in the following linked data assertions:
id:1 def:parent id:2.
id:2 def:child id:1.
padEnd()¶
Description¶
Adds a given padding string zero or more times to the end of a string value, until the resulting string value is exactly a given number of characters long.
Use cases¶
This transformation is useful for identifiers that must have fixed length and that may be suffixed by zero's.
Parameters¶
content
A key that contains a string value. If the key contains a numeric value, that value is first cast to string.padString
The string that is added to the end of the string value in keycontent
, until the result string has exactlytargetLength
characters. Can be a static string or a key.targetLength
The exact number of characters that the resulting string should have. The string value is copied over as-is whentargetLength
is smaller than or equal to the length of the string value in keycontent
. This includes cases wheretargetLength
is negative or zero.key
A new key where the padded string is stored.
Example¶
The following snippet processes identifiers of varying length, and ensures that they have the same length after suffixing '0'
characters.
fromJson([
{ id: '16784' },
{ id: '129' },
]),
padEnd({
content: 'id',
padString: '0',
targetLength: 6,
key: '_id',
}),
This results in the following two Records:
[
{
"id": "16784",
"_id": "167840"
},
{
"id": "129",
"_id": "129000"
}
]
padStart()¶
Description¶
Adds a given padding string zero or more times in front of a string value, until the resulting string value is exactly a given number of characters long.
Use cases¶
This transformation is useful for identifiers that must have fixed length and that may be prepended by zero's.
If key content
contains a numeric value, then that value is first cast to
string.
content
A key that contains a string value.padString
The string that is added in front of the string value in keycontent
, until the result string has exactlytargetLength
characters.targetLength
The exact number of characters that the resulting string should have. The string value is copied over as-is whentargetLength
is smaller than or equal to the length of the string value in keycontent
. This includes cases wheretargetLength
is negative or zero.key
A new key where the lowercased string is stored.
Example: Fixed-length identifiers¶
The following snippet processes identifiers of varying length, and ensures that they have the same length after prefixing '0'
characters.
fromJson([
{ id: '16784' },
{ id: '129' },
]),
padStart({
content: 'id',
padString: '0',
targetLength: 6,
key: '_id',
}),
This results in the following two records:
[
{
"id": "16784",
"_id": "016784"
},
{
"id": "129",
"_id": "000129"
}
]
Example: Create year literals¶
In order to create standards-conforming temporal literal, we need to pad the year component to be at least 4 decimal digits long. (This requirement is defined in the XML Schema Datatypes 1.1: Part 2 Datatypes standard.)
Suppose that the source data looks as follows:
Artifact | Year |
---|---|
0001 | 612 |
0002 | 1702 |
We can ensure that all years have at least 4 decimal digits by calling the following function:
padStart({
content: 'Year',
padString: '0',
targetLength: 4,
key: '_lexicalForm',
}),
triple(
iri(prefix.id, 'Artifact'),
dct.created,
literal('_lexicalForm', xsd.gYear),
),
This makes the following linked data assertions:
id:0001 dct:created '0612'^^xsd:gYear.
id:0002 dct:created '1702'^^xsd:gYear.
replace()¶
Description¶
Performs a regular expression replacement to the given input string, and stores the result in a new key.
Parameters¶
content
A key that contains a string value, or a static string specified with assertion str().from
A JavaScript Regular Expression.to
Optionally, a string that replaces potential matches of the Regular Expression (from
). Use$1
,$2
, etc. to insert matches. If absent, the empty string is used.key
A new key where the result of the replacement is stored.
Example¶
Suppose the source data contains date/time strings, but only the date component is needed:
{
"created": "2020-01-02T00:00:00.0Z"
}
It is possible to extract only the date part (everything up to the T
) in the following way:
replace({
content: 'created',
from: /^([^T]*).*$/,
to: '$1',
key: '_created',
}),
triple('_creativeWork', dct.created, literal('_created', xsd.date)),
This results in the following Record:
{
"created": "2020-01-02T00:00:00.0Z",
"_created": "2020-01-02"
}
split()¶
Description¶
Splits a string into an array of strings, and stores that array in a new key.
Whitespace handling¶
This transformation removes any trailing whitespace that remains after the strings are split. This ensures that irregular use of whitespace in the source data is taken care of automatically.
Empty values¶
This transformation removes any elements of the splitted string that are empty (after trimmimng). To keep empty entries, use the `` flag.
Use cases¶
The transformation is used when: - Tabular source data encodes multiple values inside singular cells. (Such concatenated storage inside cells is a data quality issue, because the table format cannot guarantee that the separator character does not (accidentally) occur inside individual values inside a cell. Tree-shaped source formats are able to store multiple values for the same key reliably, e.g. JSON and XML.) - Source data contains complex string values that can be decomposed into stand-alone components with distinct meaning.
Parameters¶
content
A key that stores a string, or a string specified with assertion str().separator
A string or a regular expression that is used to separate the content.key
A new key where the array of split strings is stored.keepEmptyEntities
A boolean flag indicating if the empty values of a splitted string should be kept or not. By default empty values are removed.
Example: Multiple values in singular table cells¶
Tabular formats are unable to store more than one value in a cell. Because of this limitation, tabular data sources sometimes encode multiple values in cells by encoding them into one string. A separator character is typically used to distinguish between these multiple values.
Suppose that the source data looks as follows:
Parent | Child |
---|---|
John | Jane, Jake , |
Kate | ,, |
The following transformation splits the cells that encode zero or more children for each parent:
split({
content: 'Child',
separator: ',',
key: 'Children',
}),
This results in the following transformed records:
{
"Parent": "John",
"Child": "Jane, Jake , ",
"Children": [ "Jane", "Jake" ]
}
{
"Parent": "Kate",
"Child": ",, ",
"Children": []
}
Notice that trailing whitespace and empty values are dealt with automatically.
Since the split()
transformation always results in an array of strings, we can use the term assertion iris()
afterwards:
split({
content: 'children',
separator: ',',
key: '_children',
}),
triple(
iri(prefix.person, 'parent'),
sdo.children,
iris(prefix.person, '_children')
),
This results in the following linked data assertions:
person:johndoe
sdo:children
person:janedoe,
person:jakedoe.
Example: Split a complex string into components¶
The following snippet uses a regular expression to split a KIX code. (A KIX code is a standardized format for representing postal addresses in The Netherlands.)
fromJson([{ id: '1', KIX: '1231FZ13Xhs' }]),
split({
content: 'KIX',
separator: /^(\d{4}[A-Z]{2})(\d{1,5})(?:X(.{1,6}))/,
key: 'KIX_components',
}),
triple(iri(prefix.id, 'id'), sdo.postalCode, 'KIX_components[1]'),
This results in the following record:
{
"id": "1",
"KIX": "1231FZ13Xhs",
"KIX_components": ["", "1231FZ", "13", "hs", ""]
}
And in the following linked data assertion:
id:1 sdo:postalCode '1231FZ'.
substring()¶
Description¶
This middleware takes a substring from the input string and stores the result in a new key.
Parameters¶
content
A key that stores a string value, or a string specified with assertion str().start
The index of the first character that is included in the substring. The first character has index 0.end
Optionally, the index of the first character that is excluded from the substring. If absent, the substring ends at the end of the source string.key
The new key in which the substring is stored.
Example¶
The Library of Congress MARC format stores the type of record in the sixth character that appears in the leader key. We use substring()
to extract this character, and then use transformation translateAll() to map them to a corresponding class IRI:
substring({
content: 'metadata.marc:record.marc:leader.$text',
start: 6,
end: 7,
key: '_typeOfRecord',
}),
translateAll({
content: '_typeOfRecord',
table: {
a: dcm.Text,
k: dcm.StillImage,
},
key: '_class',
}),
triple('_iri', a, '_class'),
translateAll()¶
Description¶
Translates all dynamic strings from a specific key to new values of an arbitrary type To
, according to a specified translation table.
Since this function translates all values, the mapped values can have any type T
; they do not need to be strings. For example, this allows strings to be translated to IRIs or to literals.
When to use?¶
This approach is used when:
- The set of source data values is small.
- The set of source data values is known ahead of time.
- The corresponding linked data terms are known ahead of time.
- The appearance of a new value is considered to be an error in the source data.
Parameters¶
content
A key that contains a string value.table
A translation table from strings to values of some arbitrary typeT
.nulls
Optionally, a list of string values that are considered denote NULL values in the source data. When a NULL value is encountered, the special valueundefined
is added for the targetkey
.default
Optionally, a default value or a default value-determining function that is used for string values that are neither in the translation table (table
) nor in the NULL values list (nulls
). The function must return a value of typeT
. Use of a default valuevalue
is equivalent to using the following value-determining function:_ => value
.key
A new key where the results of the translation are stored.
Example: Map source data to IRI values¶
Suppose that source data contains country names. In linked data we want to use IRIs to denote countries, so that we can link additional information. Since the list of countries that appears in the source data is not that long, we can specify a translation table from names to IRIs by hand:
change.translateAll({
content: 'country',
table: {
'Belgium': country.be,
'Germany': country.de,
'England': country.gb,
...,
},
nulls: ['Unknown'],
key: '_country',
}),
when('country', [
triple('_country', a, sdo.Country),
]),
Example: Map source data to IRI properties¶
When we relate a creative work to its creator, we sometimes know whether the creator was the actor, architect, author, etc. of the creative work. But in other cases we only know that there is a generic creator relationship. The Library of Congress Relators vocabulary allows us to express specific and generic predicates of this kind.
transform.translateAll({
table: {
'actor': rel.act,
'architect': rel.arc,
'author': rel.aut,
...,
},
default: rel.oth, // generic relator
key: '_relator',
}),
triple('_creativeWork', '_relator', '_creator'),
translateSome()¶
Description¶
Translates some strings, according to the specified translation table, to other strings. Strings that are not translated according to the translation table are copied over as-is.
Parameters¶
content
A key that contains a string value.table
A translation table that specifies translations from strings to strings.key
A new key where the translated strings are stored.
Use cases¶
Source data often contains some strings that are correct and some that are incorrect. For example, if source data contains a key with city names, some of the names may be misspelled. In such cases, translateSome()
can be used to translate the incorrect strings into correct ones.
A translateSome()
transformation is often performed directly before a translateAll()
transformation. The former ensures that all string values are correct (e.g. fixing typo's in city names); the latter ensures that all strings are mapped onto IRIs (e.g. city names mapped onto city-denoting IRIs).
Example¶
The following example fixes an encoding issue that occurs in the source data:
transform.translateSome({
content: 'name',
table: {
'Frysl��n': 'Fryslân',
// Other entries for typographic fixes go here.
...,
},
key: '_name',
}),
tryLiteral()¶
Description¶
This transformation is used when string values must be mapped onto literals with varying datatype IRIs.
The datatype IRIs that could apply are specified in a list. The specified datatype IRIs are tried out from left to right. The first datatype IRI that results in a valid literal is chosen.
content
A key that contains a string value, or a string value specified with assertion str().datatypes
An array of two or more datatype IRIs.key
A new key where the created literal is stored.
Throws¶
An exception is emitted if a string value does not belong to the lexical space of any of the specified datatype IRIs.
Example¶
A literal is valid if the given string value appears in the lexical space of a specific datatype IRI. This is best explained with an example:
tryLiteral({
content: 'date',
datatypes: [xsd.date, xsd.gYearMonth, xsd.gYear],
key: '_publicationDate',
}),
Source data in key 'date' |
Result in key '_date' |
---|---|
'1900-01-02' |
'1900-01-02'^^xsd:date |
'1900' |
'1900'^^xsd:gYear |
'02-01-1900' |
An error is emitted. |
If we do not want to emit errors for string values that cannot be satisfy any of the specified datatype IRIs, we may choose to include xsd.string
as the last datatype IRI in the list. Do notice however that this will result in dates that cannot be compared on a timeline, since they were not transformed to an XSD date/time datatype.
See also¶
You only need to use tryLiteral()
if the datatype IRI varies from record to record. If the datatype IRI is the same for every record, then the regular assertion function literal() should be used instead.
uppercase()¶
Description¶
Translates a string value to its uppercase variant.
This middleware can uppercase strings in any language; the Unicode Default Case Conversion algorithm is used for this.
Parameters¶
content
A key that contains a string value.key
A new key where the uppercase variant is stored.
Example¶
In the following example, the string in the key 'countryCode'
becomes the uppercase string:
fromJson({ place: 'Amsterdam', countryCode: 'nl' }),
uppercase({
content: 'countryCode',
key: '_countryCode'
}),
triple(iri(prefix.id, 'place'), iri(prefix.geonames, str('countryCode')), '_countryCode')
This results in the following linked data assertion:
city:Amsterdam geonames:countryCode "NL"
wkt.addPoint()¶
Description¶
Creates a Well-Known Text (WKT) serialization string from the corresponding geospatial point.
Parameters¶
latitude
A key or a string assertion (str()) with latitude.longitude
A key or a string assertion (str()) with longitude.crs
Optionally, an IRI that denotes a Coordinate Reference System (CRS). You can use IRIs from theepsg
object. If absent, uses EPSG:4326/WGS84 as the CRS.key
A new key where the WKT string is stored.
Example¶
The following example creates a WKT literal from the geo coordinates of Amsterdam:
fromJson({ place: 'Amsterdam', lat: 52.37308, long: 4.89245 }),
wkt.addPoint({
latitude: 'lat',
longitude: 'long',
key: '_point'
}),
triple(iri(prefix.city, 'place'), geo.asWKT, '_point'),
This results in the following record of the key '_point'
:
{
"_point": {
"termType": "Literal",
"value": "Point (52.37308 4.89245)",
"language": "",
"datatype": {
"termType": "NamedNode",
"value": "http://www.opengis.net/ont/geosparql#wktLiteral",
"validationStatus": "canonical"
},
"validationStatus": "canonical"
}
}
And in the following linked data assertion:
city:Amstedam geo:asWKT "Point (52.37308 4.89245)"^^geo:wktLiteral
wkt.project()¶
Description¶
Converts the projection of a Well-Known Text (WKT) literal from one Coordinate Reference System to another one.
Parameters¶
content
An array of keys or numbers.key
A new key where the new projection is stored.fromCrs
: an IRI that denotes a Coordinate Reference System (CRS) of thecontent
.toCrs
: Optionally, an IRI that denotes a Coordinate Reference System (CRS) we want to convert to. If absent, uses EPSG:4326/WGS84 as the CRS.
Example¶
The following example converts an array with latitude and longitude in content
key from Dutch grid coordinates (Rijksdriehoeks-coordinates) to WGS84 coordinates.
fromJson({ place: 'Amsterdam', lat: 121307, long: 487360 }),
wkt.project({
content: ['lat', 'long'],
key: '_coordinates',
fromCrs: epsg[666],
toCrs: epsg[4326]
}),
This results in the following record of the key '_coordinates'
:
{
"_coordinates": [
52.374671935135474,
4.892803721020475
]
}
We can now use the converted result to create a WKT Point()
using addPoint():
wkt.addPoint({
latitude: '_coordinates[0]',
longitude: '_coordinates[1]',
key: '_point'
}),
triple(iri(prefix.id, 'place'), geo.asWKT, '_point')
This code snippet creates the following linked data assertion:
city:Amstedam geo:asWKT "Point (52.374671935135474 4.892803721020475)"^^geo:asWKT