OpraDB is graph database that uses OPRA query language and is written in F#. Main purpose of this project is to show the expressive power of OPRA queries.
Ability to express complex path properties in a modular way and with a minimum effort, is where this query language really excels.
Note: Although this is a work in progress, usable version is already available.
- OpraDB
-
Install dotnet core
-
Install mono
-
Set
FrameworkPathOverride
to mono4.6.*
libs:export FrameworkPathOverride=/usr/lib/mono/4.6.2-api
-
(Optional) You can also add above line to your
.bashrc
:echo 'export FrameworkPathOverride=/usr/lib/mono/4.6.2-api' >> ~/.bashrc
-
git clone https://github.com/mateuszlewko/OpraDB.git
-
cd OpraDB/src/opraDB.Shell
-
dotnet build
Run console client with (assuming you're in
opraDB.Shell
folder):mono bin/Debug/net462/opraDB.Shell.exe
Quick start with example graph:
mono bin/Debug/net462/opraDB.Shell.exe -g ../../examples/simple-cycle/graph.json -f json
How to edit in Visual Studio Code
- Install mono
- Install F#
- Add Ionide plugin to VS Code:
- launch VS Code Quick Open (Ctrl+P)
- paste:
ext install Ionide.Ionide-fsharp
Read this sections to get a brief understanding of Opra QL and OpraDB.
Let's start by defining OpraDB graph. Graph consists of directed edges and nodes. Each node and edge can have a set of properties. Property has a:
- key
string
- value of one of the following types:
string
,int
,bool
,float
.
Undirected graphs can be represented by adding two directed edges for each undirected edge, in both directions.
Example graph in json
format:
{
"nodes": [
{ "_id": 1, "crowd": 2, "start": 1 }
, { "_id": 2, "crowd": 10 }
],
"directed_edges": [
{ "_from": 1, "_to": 2, "dist": 3 }
, { "_from": 2, "_to": 1, "dist": 3 }
]
}
Note: All nodes must have field _id
with positive integer value. Edges
must have _from
and _to
properties, which point to node IDs already defined
in nodes
object (see example above).
Query has following syntax:
MATCH NODES <list of matched (returned) nodes or node properties>
PATHS <list of matched paths>
SUCH THAT <path constraints>
WHERE <regular constraints>
HAVING <arithmetic constraints>
MATCH
, NODES
, PATHS
, SUCH THAT
, WHERE
and HAVING
are most used
keywords and they all must be uppercase. When creating a query you should always
specify at least one matched node or node property.
Note: Returning matched paths is currently not supported.
-
List of matched nodes (
MATCH NODES n1, n2, ..., property1(n1)
):Example:
MATCH NODES a, b, someNode, property(someNode), name(someNode) ...
By specifying node or node's property in matched nodes, we're essentially defining an alias to this node, which we can use in latter constraints.
Usually we want to return node's property like
name(x)
. Providing just an aliasx
will return IDs of all nodes that matchedx
's constraints. -
MATCH ... PATHS p1, p2, ..., pn
: Behaviour for matching paths is similar to nodes, however current version doesn't returning paths.
path1 : beginNode -> endNode
Path constraints are here to specify that there is path called path1
from node
beginNode
to endNode
. Both nodes could be bound earlier in matched nodes list,
but they don't have to be. In case a node wasn't bound earlier, it will be existentially quantified. This rule applies for all occurrences of nodes or paths not bound in MATCH ...
list.
Multiple path constraints can be specified in query by separating them with comma.
Example:
MATCH NODES name(a), name(b), name(c)
SUCH THAT p: a -> b, q: a -> c
Regular constraint specifies how correct paths should look like. It consists of
node constraints and operators: |
, *
. Each node constraint describes one of the nodes on a given path.
Example node constrains:
(access_level(people) > 3 OR position(people) = "CTO")
this constraint will be true if current node on pathpeople
has propertyaccess_level
that is greater than 3 or propertyposition
is "CTO"
Current node is currently checked node when traversing a path.
(distance(cities, 'cities) < 10)
Apostrophe before node's name ('cities
) is current node's successor when traversing a path. Node will satisfy this constraint, if outgoing edge (to successor) has propertydistance
and its values is less than 10.
Regular constraint can just be a sequence of node constraints, like this:
(name(friends) = "Bol")(name(friends) = "Alice")(name(friends) = "Mateusz")
Path friends
satisfies this constraint as long as it contains three nodes which
are connected by some edges, and name
of first node is Bob, second Alice and
third Mateusz.
However, regular constraints can act as regular expression on paths, where instead
of letters and digits we have node constraints. This is where we can make use of
union |
and Kleene-star *
operators.
Following query returns pairs of cities in Poland and USA for which there exists flight path:
MATCH NODES x, y
SUCH THAT p: x -> y
WHERE (country(p) = "Poland").*(country(p) = "USA"),
(flight(p, 'p) IS NOT NULL)*.
Expression (<node constraint>)*
will be satisfied by sequence of nodes on a path,
of any length (possibly zero). Dot .
means that any node will match this
constraint.
In above query we have two regular constraints.
First checks that path begins and ends in appropriate countries, however, we can
flight through any country (hence .*
). Second, ensures that there exists a flight
connection between consecutive nodes.
Note: Dot at the end (match any) is necessary, because when checking the constraint
for last node on a path, we don't need to ensure that there is a property flight
on edge outgoing from last node (there may not be).
In next example we want to travel from Poland to USA only through specific countries:
MATCH NODES x, y
SUCH THAT p: x -> y
WHERE (country(p) = "Poland")((country(p) = "Canada") | (country(p) = "Germany"))*
(country(p) = "USA"), (flight(p, 'p) IS NOT NULL)*.
In order to constrain arithmetic properties of paths, you can use following
keywords: SUM
, MAX
, MIN
. They will perform aggregated operation on nodes or
edges. Then they can be compared to other aggregated operations or constants inside
HAVING
clause.
Example:
MATCH NODES x, y
SUCH THAT p: x -> y
WHERE (country(p) = "Poland").*(country(p) = "USA"),
(flight(p, 'p) IS NOT NULL)*.
HAVING SUM (dist(p)) < 8000, MAX (duration(p, 'p)) <= 6
This will constrain length of trip from Poland to USA, to be shorter than 8000 km and make sure none of the flights exceeds 6 hours.
Following query ensures that return trip will be shorter than the first one:
MATCH NODES x, y
SUCH THAT p: x -> y, q: y -> x
WHERE (country(p) = "Poland").*(country(p) = "USA"),
(flight(p, 'p) IS NOT NULL)*.
HAVING SUM (dist(p)) < 8000, SUM (dist(q)) < SUM (dist(p))
You may have already noticed that some constraints are duplicated and make queries more obscure. Let expressions
allow to modularize and greatly simplify a query while maintaining it's expressive power.
Let
syntax is as follows:
LET <name> <list of bound nodes or paths> =
<let body> IN
LET ... = ... IN
MATCH ...
Let body
can be one of:
-
Value expression or node constraint:
LET duration x = 60 * distance(x) / speedLimit(x) IN LET isAirport x = type(x) = "airport" IN
Which can be used as part of other constraints:
... MATCH ... WHERE (isAirport(p)).* HAVING SUM (duration(p)) < 10
-
Regular constraint or arithmetic constraint:
LET route p = (isAirport(p) AND (flight(p, 'p) IS NOT NULL))*. IN
LET inRange p = SUM (dist(p)) < 100 IN
MATCH ...SUCH THAT path: ...
WHERE route(path)
HAVING inRange(p)
- Query:
LET shop x = type(x) = "shop" IN
LET shopsNearHome s =
MATCH NODES s
SUCH THAT p: s -> home
WHERE (shop(p)).*(address(p) = "My home location")
HAVING SUM (distance(p)) < 4 IN
MATCH NODES address(shop)
SUCH THAT p: work -> shop
WHERE (address(p) = "My work location").*(shopsNearHome(p))
HAVING SUM (distnace(p)) < 4
Above query returns location of all shops near home and work.
Note: All let
expressions must come before match
query, and they can't be mutually recursive.
There is no expression for counting, however we can easily define ontology that will return number of nodes on paths.
LET realNode n = IF _id(n) IS NOT NULL
THEN 1
ELSE 0 IN
LET count p = SUM (realNode(p)) IN
MATCH ...
HAVIN count(p) < 10;
Note: Label _id
on nodes and labels _from
and _to
on edges are
required in graph description, but they can also be used in queries.
They're all integers. Special internal sink node doesn't have any of those
labels.
As arithmetic constraints can be quite complicated and they can be applied on graphs with positive and negative cycles, OpraDB uses following algorithm to check whether paths satisfy these constraints:
- First all paths that satisfy regular constraints are found.
- For each path, we retrieve all simple cycles (using Johnson's algorithm) and calculate delta value of each property (that exists in arithmetic constrains).
- With all aggregated values for every property and delta of property values for each cycle, we construct a set of linear inequalities, that includes variables representing number of times to traverse a given cycle.
- We solve this set of inequalities using Z3 solver to find out how many times to traverse each cycle. If no positive solutions where found it means that path doesn't satisfy arithmetic constraints.
All examples can be found here.
In these examples we'll use following graph:
This graph in json
format can be found here.
Every node has a label indicating level of crowdedness. All edges have
specified distance. Nodes 1
and 7
have a start
label.
-
Get all nodes.
MATCH NODES x SUCH THAT p: x -> y
Set of nodes is selected for paths, so there needs to be at least one path to return anything.
-
Plan a route that avoids crowded places, and starts from node with label
start
.LET crowded x = MATCH NODES x SUCH THAT q: x -> y WHERE .*(crowd(q) >= 10) HAVING SUM (dist(q, 'q)) <= 10 IN MATCH NODES x, y SUCH THAT p: x -> y WHERE (crowded(p) = false)*, (start(p) IS NOT NULL).* ;
Expected result:
x y 7 9 7 8 7 5 7 6 7 10 -
Traverse cycle multiple times to satisfy arithmetic constraint.
This query shows OpraDB's ability to efficiently find non trivial paths that satisfy complex arithmetic constraints.
MATCH NODES x, y SUCH THAT p: x -> y WHERE (start(p) IS NOT NULL).* HAVING (10 + SUM (crowd(p))) >= SUM (dist(p, 'p) * 5), SUM (dist(p, 'p)) > 10000000;
Expected result:
x y 7 9 7 8
Next graph is simple cycle with label a = 1
or b = 1
on each node.
-
Paths where number of nodes labeled
a
is greater by two.MATCH NODES x, y SUCH THAT p : x -> y HAVING SUM (a(p)) = (2 + SUM (b(p)));
Expected result:
x y 7 3 1 4 2 3 1 2 -
Paths where number of nodes labeled
a
is two times number ofb
nodes.MATCH NODES x, y SUCH THAT p : x -> y HAVING SUM (a(p)) = (2 * SUM (b(p)));
Expected result:
x y 2 4 7 2 -
Constrain path length.
LET realNode n = IF _id(n) IS NOT NULL THEN 1 ELSE 0 IN LET count p = SUM (realNode(p)) IN MATCH NODES x, y SUCH THAT p : x -> y HAVING SUM (a(p)) = (2 + SUM (b(p))) , count(p) = 4 ;
Comparison with Gremlin (Apache TinkerPop)
Now, we'll compare these graph query languages on air routes dataset (contains description of the data).
Note: This graph is in graphml format, which is also supported by OpraDB (apart from json).
-
All flight connections outgoing from Wroclaw (WRO) airport:
OpraQL:
LET r q = labelE(q, 'q) = "route" IN MATCH NODES code(a), code(b) SUCH THAT p: a->b WHERE (code(p) = "WRO").*, (r(p)). ;
Gremlin:
g.V().has('code', 'WRO').out('route').path().by('code')
-
Connections from any airport in Poland to any airport in Germany that go through Munich.
OpraQL:
LET route q = labelE(q, 'q) = "route" IN MATCH NODES code(a), code(b) SUCH THAT p: a -> b WHERE (country(p) = "PL")(city(p) = "Munich")(country(p) = "DE"), (route(p))(route(p)). ;
Gremlin:
g.V().has('airport','country', 'PL').as('pl').out('route') \ .has('airport', 'city', 'Munich').out('route').has('country', 'DE') \ .as('de').select('pl', 'de').by('code').by('code')
- Returning paths.
- Finding shortest paths.
- Time and memory optimizations.
- Improve query execution time complexity for multiple paths
- Results visualization with web client.
If you have any thoughts or request, feel free to create an issue or add a pull request. Feedback is welcome.