Knowledge Graph Construction

Graphlet AI is a data engineering, data science and artificial intelligence consultancy specializing in knowledge graph construction, also known as property graph construction. We build data pipelines that take raw data and feed your graph database clean data. We transform and refine raw data on your data lake to build large networks ranging in the millions, billions or even trillions of nodes and edges that model entire business domains to solve complex problems with global footprints.
We love big data and large networks. We use big data tools to scale data pipelines that go beyond traditional ETL and entity resolution using artificial intelligenecs - graph machine learning - to construct a high fidelity network model of your business domain that maps directly to solutions to your business problems. It lets you run the queries that answer problems vexing you and driving features your customers demand. Using a modern graph database, your data science and machine learning teams can then efficiently mine this refined graph to find solutions to your most pressing data science problems.

We build property graph factories

Knowledge Graph Construction Architecture

Property graph factories build property graphs in 4 steps...

Step 1) Extract, Transform, Load Datasets to a Common Format

The first step in building an enterprise knowledge graph is to build the raw node and edge lists making up your network by combining multiple, structured datasets from your organization into a common schema. In larger domains this will warrant a full blown ontology, but starting with a smaller network and building out one use case is a good strategy.
We Extract, Transform, Load (ETL) multiple, large and small datasets from different sources with different formats into a common property graph schema using tools like Python, Spark, Databricks or Snowflake. How much ETL varies by industry from relatively little in cybersecurity applications to a significant amount with business graph applications like KYC /AML / financial compliance. A well defined graph model with fewer makes it easy to access, query, analyze and model your business domain in a graph database such as Neo4j, TigerGraph, ArangoDB or Neptune.
Why not load your raw data directly into a graph database and do ETL inside it? Graph databases aren't ETL platforms. They are not designd for it. Python based tools are. Modern ETL increasingly involves using machine learning techniques rather than simple transformations. Graph databases are typically build on top of the Java Virtual Machine (JVM) or C++. Ask your data engineers how productive they will be doing ETL in Python versus Java or C++. Python shines at ETL. The JVM and C++ shine at interactively querying clean graph data.

Adding Unstructured Data

Once a core knowledge graph built from structured datasets is established, then it is time to bring in unstructured datasets to extend that network to a larger knowledge base using Natural Language Processing (NLP). Starting with unstructured data can be much more difficult - there is no anchor on which to peg the entities [nodes] and relationships [edges] you extract from text.
Raw Data in Bronze Tables

Entity Resolution (ER): Node and Edge Deduplication

Entity Resolution (ER) is the process of deduplicating and combining duplicate nodes and splitting up mistakenly merged nodes. In a similar manner, edges can also be merged or split up. This is important as it is difficult to detect patterns in networks if there are disconnected duplicates that are each a part of different, important relationships.
Clean Data in Silver Tables

Manual Block and Match

Traditional entity resolution involves two phases: blocking and matching. Querying data as part of exploratory data analysis (EDA) reveals strategies to match records. This was traditionally done by hand, and it still can be for a limited number of small datasets.
The next step is to compare records for matching. This presents a problem. The naive complexity of comparing every node with every other node is n2/2, where n is the number of nodes. This can quickly get out of hand with millions or billions of nodes! Blocking is a strategy to prune the set of nodes compared down to groups that are more manageable.
Raw Data in Bronze Tables
Clean Data in Silver Tables

Automatic Deduplication

When dealing with big data, especially when there are a a number of datasets large and small, the traditional entity resolution model of manual blocking and matching starts to break down. It is cumbersome and takes too much developer time. What is needed is a generic approach to entity resolution.
Recent developments in Large Language Models [LLMs] like ChatGPT and Graph Neural Networks (GNNs) allow us to ETL nodes and edges into XML-like text and sentence encode them using a large language model and then combine them based on semantic inferences made by the LLM in combination with the network topology. LLMs have seen many similar documents as the nodes’ text representation on the world wide web, and if we provide a few clues... they provide state of the art entity resolution for both the blocking and matching stages!
Manual blocking and matching for numerous datasets is a cumbersome and expensive activity. Advances in AI - representation learning and an architecture from Google called Grale - make a generic entity resolution (ER) system possible. This system is configurable to work across multiple datasets by embedding records using large language models (LLMs) such as GPT-3 or ChatGPT, but tuned specifically for the entity matching task.
Raw Data in Bronze Tables
Raw Data in Bronze Tables
Raw Data in Bronze Tables
Raw Data in Bronze Tables

Step 4) Pattern Matching: Network Motif Search

While it is nice to think that a single version of your data can be encoded in a graph, the reality is that you must make decisions as to which business logic to encode in the representation you choose for your knowledge graph.
Multiple chain indirect ownership is a way of tracking Ultimate Beneficial Ownership (UBO)
A business graph risk motif showing how an incorporation services company was purchased and used to create fake companies to launder money.

How can a graph model my data?

Property graphs can represent data from any database you throw at them. They are objects and their relationships just like the world around us. Objects have properties just like a baseball has a width. Relationships can as well, just like a pitch between a pitcher and batter has a speed.
Property graph model
You may be used to thinking of graphs as simple, mathematical concepts but property graphs have different types of nodes and edges that are much more powerful for data science and artificial intelligence than simple graphs are.
Simple vs property graphs

What is a property graph, knowledge graph and triple store?

A property graph is a set of objects representing nodes [also known as vertex/vertices] and edges [also known as links].
Property graphs vs RDF Triples. Both are knowledge graphs.

What is graph machine learning (graph ML)? What is a GNN?

I'll let you in on a secret that is driving the popularity of enterprise knowledge graphs, property graphs, graph databases and Graph Neural Networks (GNNs): MOST DATA IS GRAPH DATA. To compose a single table to get the corresponding vectors, matrices and tensors we load into GPUs to drive machine learning algorithms, several tables have usually been combined [squashed] into one table. There's a problem with this... it is a lossy process. We threw away the relationships. Knowledge graphs modeled using graph machine learning (Graph ML) and graph neural networks (GNNs) are able to learn better to build more powerful models because they have a greater potential by matching the structure of the data’s entities and their relationships.
Networks vs sequence representations. Networks preserve inherent relational bias of data

I use a certain tool or platform. Can you help me?

We can build knowledge graphs for any platform, but here are a few tools that are more up our alley to create business value using graphs and networks:

Our Principal Consultant, Russell Jurney

My name is Russell Jurney. I work at the intersection of big data, large networks - property graphs or knowledge graphs, representation learning with Graph Neural Networks (GNNs), Natural Language Processing (NLP) and Understanding (NLU), model explainability using network visualization and vector search for information retrieval. I am a startup product and engineering executive focused on building products driven by billion node+ networks. I have worked at cool places like Ning, LinkedIn and Hortonworks. I co-founded Deep Discovery to use networks, GNNs and visualizations to build an explainable risk score for KYC / AML.

I am a four-time O'Reilly author with 122 citations on Google Scholar for being the first to write about “agile data science” - agile development as applied to data science and machine learning. I am an applied researcher and product manager with 17 years of experience building and shipping data-driven products.

I am currently fascinated by knowledge graph / property graph construction, graph representation learning, graph neural networks (GNNs), NLP/NLU techniques such as information extraction, named entity resolution (NER), coreference resolution, fact extraction, and entity linking. I do network science and machine learning - so I get stuff done :) Check out my network science portfolio, my blog and my O’Reilly Radar posts.