Finding a place for Semantic Technology in the Era of LLM – Part I “The Semantic Pain”

“Oh crap, I am screwed … ChatGPT basically did what I have been working on for years effortlessly …”

This is exactly what I have in mind in 2022, while I was doing my PhD on Knoweldge Graph Question Answering Systems, and I trust this is a shared feeling for many researchers and developers at that time. The Semantic community (and people who work with knowledge representations, information extractions, question answering systems etc,.) is one of the fields that took heaviest impact.

After brief shock and panic, I wrapped up my PhD project, and started my attempts to answer this question “Where does my expertise fit in this new world with Large Language Models? “

This blog aims to record some of my thoughts on this question, and demostrate a case where I (sort of) made some progress addressing this question.

What is Semantic Technology and why is/was it important

Semantic Techology is surely important, with the following major features:

A formal representation of complex data, where complex query language (e.g., SPARQL) is enabled.
Non-ambiguous representation of entities, concepts, and links, where things are represented uniquely with Universal Resource Identifiers (URI)
With Linked Data principle implemented, data silos that are previously isolated, can be linked

Those features, in theory, enables the following potential applications:

1. Rigour and interconnected access/manipulation of complex data:

For example, say I have a semantic database (i.e., a Knowledge Graph, KG) for people in my neighbourhood, where their jobs, ages, house prices etc are properly described and inter-connneted, I can write a SPARQL query that answer this question :

Find the job description of my neighours, ranked from high to low according to their house prices.

PREFIX ex:   <http://example.org/>
PREFIX schema: <http://schema.org/>

SELECT ?neighbor ?jobDesc ?house ?price
WHERE {
  ?neighbor a ex:Neighbor ;
            schema:jobTitle|schema:description ?jobDesc ;
            ex:livesIn ?house .
  ?house a ex:House ;
         ex:marketPrice ?price .
}
ORDER BY DESC(?price)

And the actual queries can be more complex. Before the age of LLM, this feature is very tempting for researcher and engineers, in the sense that, if you want to “answer” a complex natural language question via a query, all you need to do, is to “translate” the question to queries, enabling a Question Answering application. We call this field “Knowledge Graph Question Answering”

2. Precise reference to entities:

This is an important feature for fields of scientific and engineering, where “semantic precision” matters a lot. A typical example we use is “Cambridge”, there are many cities named “Cambridge”. In natural langauge, we do disambiguation via context. e.g., we say “You know, the one looks like in Harry Potter.

In the Semantic world, identifiers are the primary reference to an entity instead of a label (e.g., “Cambridge). In Wikidata, there is a list of “Cambridge”:

https://www.wikidata.org/wiki/Q350 (“the one looks like in Harry Potter”)

https://www.wikidata.org/wiki/Q49111 (“the one where MIT is”)

https://www.wikidata.org/wiki/Q1028279 (“the one in Canada”)

With those identifiers, ambiguity of language is more or less addressed, especially when building up a large-scale database, where you want to be precise on references, without the need for interpreting context.

3. Cross-domain/Multi-level common data grounds:

This is what in my mind, the “killer application” of Semantic Technology.

Lets extend our neighbourhood KG example a little bit, where we assume that we have include information on people, household, electricity, power networks, transportation, weather, everything, and they are connected to each other.

Then I can “ask” very complex questions, which relies on retrieval and integration of information across multiple domains, e.g.,

“Which households in my neighbourhood are most vulnerable to power outages during heatwaves, considering residents’ age profiles, medical equipment usage, building insulation ratings, and the topology of the local power grid?”

Of course, with such a powerful interconnected database, the questions you can “ask” is nearly unlimitted. Of course, the database itself only offers a nice way for accessing and integrating those data (when you build a KG at this level, your data is already integrated).

What are the problems of Semantic Technology

These were all great, but from my personal experience, there are some major burderns for building up a knowledge graph.

The upfront and overhead cost is really high:

Of course, on the assumption that the “neighbourhood knowledge graph” where multiple domain data is integrated sounds really great, which can support countless potential applications with just one common data ground. However, it is painful to build and maintain one.

A typical first step of building a Knowedge Graph is schema design, where we create abstract definition for how things should be represented, called “ontology“.

@prefix ex:  <http://example.org/neighbourhood#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdfs:<http://www.w3.org/2000/01/rdf-schema#> .

ex:Neighbour a owl:Class .
ex:House     a owl:Class .
ex:livesIn   a owl:ObjectProperty ; rdfs:domain ex:Neighbour ; rdfs:range ex:House .
ex:price     a owl:DatatypeProperty ; rdfs:domain ex:House ; rdfs:range xsd:decimal .

# constraint: every Neighbour lives in exactly one House
ex:Neighbour rdfs:subClassOf [
  a owl:Restriction ; owl:onProperty ex:livesIn ;
  owl:cardinality "1"^^xsd:nonNegativeInteger
] .

This is just the tip of the iceberg, for scientifc or engineering domains, the ontologies can contain hundreds of concepts, relations, and the logic constraints between their connections. This is surely painful, the core problem is that it requires both domain knowledge (e.g., building, energy, chemistry, etc.,) and understanding of Semantic Technologies. Of course, by just looking at this example, you should already know this is a tricky job.

Then the natural second step is instantiation after you have a schema, where you put actual data in. The “actual data” is often something domain-specific and “friendly“, like a JSON, csv, or a relational database. They are clear, simple, straightforward, and can be easily edited and accessed. However, now you need to stuff those nice data into some monstrously complex data schema, and make sure that data from different domains are connected. For example, you have dataset A for buildings, and dataset B for roads, they are isolated in the first place, but in a Knowedge Graph, you need to “link” them, meaning you need to build relations between roads and buildings locating at them.

Often you encounter the following issues:

Dataset vs. ontology mismatch. Your CSV/JSON tables weren’t made for the ontology. Column names don’t line up, values are in different formats, units are missing or inconsistent, and the ontology may require fields your dataset simply doesn’t have.
Ontologies don’t “snap together”. The building ontology and the road ontology might both talk about “location” or “address”, but in different ways. So linking them isn’t automatic—you first have to decide what “the same thing” means across the two models.
Mapping is tedious and ongoing. You end up writing lots of glue rules: “this column maps to that property”, plus special cases and cleanup. Then when either the dataset or the ontology changes, you revisit and fix those rules again.

As a result, a Knowledge Graph does “promise” you a bright future, however, due to its complexity, you will have to suffer from very high upfrond cost to start benefit from the promises. In addition, for the same reason, as long as you want to update existing data, or integrate more data into your knowledge graph, you will suffer from overhead cost.

It is really hard to bring someone into the team:

This is a very practical problem, especially for a research or development team that tries to apply Semantic Technologies in their specific area, say environment. A typical team will have their computer scientist and domain experts, and several members who know bit of both. My role in the team is usually the computer sicentist, thus, sometimes it is my job to introduce Semantic Technologies to experts in environment, chemistry, buildings etc, whenever we want to extend the knowledge graph to a new domain. This is already a rather time-consuming job, but keep in mind that there are three big issues waiting for us: dataset vs. ontology mismatch, ontologies compatibility issue, and onging maintenance issues.

The first two issues mean, for each domain you want to extend your knowledge graph to, you will need to make sure that the domain experts are properly trained so that they make ontologies that won’t cause issues in the future when you further utilize them, giving an extra layer of difficulties. This might raise the two following questions:

– Why do you have to extend your knowledge graph to other domains?

That is the whole point of having a knowledge graph right? Eliminating data silos so that cross-domain data are brought together.

– Do you really need domain experts to design the ontologies? Can you just let someone who knows Semantic Technology do that?

This is an interesting question. From my experience, domain experts know the downstream tasks much better, thus knowing better how to do the conceptual design for the domain ontologies. To be honest, it is way easier to fix a “badly implemented ontology” than to fix a “conceptually bad ontology”.

However, the problem continues due to the third issue: you need to maintain them. Dataset might change in structure, new applications might require more data be integrated or restructured in the knowledge graph. This is where the real pain comes.

AI for Engineering