“Oh crap, I am screwed … ChatGPT basically did what I have been working on for years effortlessly …”

This is exactly what I have in mind in 2022, while I was doing my PhD on Knoweldge Graph Question Answering Systems, and I trust this is a shared feeling for many researchers and developers at that time. The Semantic community (and people who work with knowledge representations, information extractions, question answering systems etc,.) is one of the fields that took heaviest impact.

After brief shock and panic, I wrapped up my PhD project, and started my attempts to answer this question “Where does my expertise fit in this new world with Large Language Models? “

This blog aims to record some of my thoughts on this question, and demostrate a case where I (sort of) made some progress addressing this question.

What is Semantic Technology and why is/was it important

Semantic Techology is surely important, with the following major features:

  1. A formal representation of complex data, where complex query language (e.g., SPARQL) is enabled.
  2. Non-ambiguous representation of entities, concepts, and links, where things are represented uniquely with Universal Resource Identifiers (URI)
  3. With Linked Data principle implemented, data silos that are previously isolated, can be linked

Those features, in theory, enables the following potential applications:

1. Rigour and interconnected access/manipulation of complex data:

For example, say I have a semantic database (i.e., a Knowledge Graph, KG) for people in my neighbourhood, where their jobs, ages, house prices etc are properly described and inter-connneted, I can write a SPARQL query that answer this question :

Find the job description of my neighours, ranked from high to low according to their house prices.

PREFIX ex:   <http://example.org/>
PREFIX schema: <http://schema.org/>

SELECT ?neighbor ?jobDesc ?house ?price
WHERE {
  ?neighbor a ex:Neighbor ;
            schema:jobTitle|schema:description ?jobDesc ;
            ex:livesIn ?house .
  ?house a ex:House ;
         ex:marketPrice ?price .
}
ORDER BY DESC(?price)

And the actual queries can be more complex. Before the age of LLM, this feature is very tempting for researcher and engineers, in the sense that, if you want to “answer” a complex natural language question via a query, all you need to do, is to “translate” the question to queries, enabling a Question Answering application. We call this field “Knowledge Graph Question Answering”

2. Precise reference to entities:

This is an important feature for fields of scientific and engineering, where “semantic precision” matters a lot. A typical example we use is “Cambridge”, there are many cities named “Cambridge”. In natural langauge, we do disambiguation via context. e.g., we say “You know, the one looks like in Harry Potter.

In the Semantic world, identifiers are the primary reference to an entity instead of a label (e.g., “Cambridge). In Wikidata, there is a list of “Cambridge”:

https://www.wikidata.org/wiki/Q350 (“the one looks like in Harry Potter”)

https://www.wikidata.org/wiki/Q49111 (“the one where MIT is”)

https://www.wikidata.org/wiki/Q1028279 (“the one in Canada”)

With those identifiers, ambiguity of language is more or less addressed, especially when building up a large-scale database, where you want to be precise on references, without the need for interpreting context.

3. Cross-domain/Multi-level common data grounds:

This is what in my mind, the “killer application” of Semantic Technology.

Lets extend our neighbourhood KG example a little bit, where we assume that we have include information on people, household, electricity, power networks, transportation, weather, everything, and they are connected to each other.

Then I can “ask” very complex questions, which relies on retrieval and integration of information across multiple domains, e.g.,

“Which households in my neighbourhood are most vulnerable to power outages during heatwaves, considering residents’ age profiles, medical equipment usage, building insulation ratings, and the topology of the local power grid?”

Of course, with such a powerful interconnected database, the questions you can “ask” is nearly unlimitted. Of course, the database itself only offers a nice way for accessing and integrating those data (when you build a KG at this level, your data is already integrated).

What are the problems of Semantic Technology

These were all great, but from my personal experience, there are some major burderns for building up a knowledge graph.

The upfront and overhead cost is really high:

Of course, on the assumption that the “neighbourhood knowledge graph” where multiple domain data is integrated sounds really great, which can support countless potential applications with just one common data ground. However, it is painful to build and maintain one.

    A typical first step of building a Knowedge Graph is schema design, where we create abstract definition for how things should be represented, called “ontology“.

    @prefix ex:  <http://example.org/neighbourhood#> .
    @prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
    @prefix owl: <http://www.w3.org/2002/07/owl#> .
    @prefix rdfs:<http://www.w3.org/2000/01/rdf-schema#> .
    
    ex:Neighbour a owl:Class .
    ex:House     a owl:Class .
    ex:livesIn   a owl:ObjectProperty ; rdfs:domain ex:Neighbour ; rdfs:range ex:House .
    ex:price     a owl:DatatypeProperty ; rdfs:domain ex:House ; rdfs:range xsd:decimal .
    
    # constraint: every Neighbour lives in exactly one House
    ex:Neighbour rdfs:subClassOf [
      a owl:Restriction ; owl:onProperty ex:livesIn ;
      owl:cardinality "1"^^xsd:nonNegativeInteger
    ] .

    This is just the tip of the iceberg, for scientifc or engineering domains, the ontologies can contain hundreds of concepts, relations, and the logic constraints between their connections. This is surely painful, the core problem is that it requires both domain knowledge (e.g., building, energy, chemistry, etc.,) and understanding of Semantic Technologies. Of course, by just looking at this example, you should already know this is a tricky job.

    Then the natural second step is instantiation after you have a schema, where you put actual data in. The “actual data” is often something domain-specific and “friendly“, like a JSON, csv, or a relational database. They are clear, simple, straightforward, and can be easily edited and accessed. However, now you need to stuff those nice data into some monstrously complex data schema, and make sure that data from different domains are connected. For example, you have dataset A for buildings, and dataset B for roads, they are isolated in the first place, but in a Knowedge Graph, you need to “link” them, meaning you need to build relations between roads and buildings locating at them.

    Often you encounter the following issues:

    • Dataset vs. ontology mismatch. Your CSV/JSON tables weren’t made for the ontology. Column names don’t line up, values are in different formats, units are missing or inconsistent, and the ontology may require fields your dataset simply doesn’t have.
    • Ontologies don’t “snap together”. The building ontology and the road ontology might both talk about “location” or “address”, but in different ways. So linking them isn’t automatic—you first have to decide what “the same thing” means across the two models.
    • Mapping is tedious and ongoing. You end up writing lots of glue rules: “this column maps to that property”, plus special cases and cleanup. Then when either the dataset or the ontology changes, you revisit and fix those rules again.

    As a result, a Knowledge Graph does “promise” you a bright future

    Why do I doubt whether Semantic Technology not fitting this Era

    What researches that I think is valuable in this Era

    What solid examples are we doing at the moment

    Leave a comment

    I’m Dr. X

    Welcome to AI for Engineering, where I write non-academic papers on applying Artificial Intelligence in hard scientific and engineering fields.

    Let’s connect