GLL4Graph

This is the implementation of the GLL-based context-free path querying (CFPQ) algorithm. It is based on a high-performance GLL parsing algorithm implemented in Iguana project. Then, it was modificated to support graph-structured input data. Proposed solution solves both reachability and all paths problems for all pairs and multiple sources cases.

Performance

The proposed solution has been evaluated on several real-world graphs for both the all pairs and the multiple sources scenarios. The evaluation shows that the proposed solution is more than 45 times faster than the previous solution for Neo4j and is comparable, in some scenarios and cases, with the linear algebra based solution.

Machine configuration: PC with Ubuntu 20.04, Intel Core i7-6700 3.40GHz CPU, DDR4 64Gb RAM.

Enviroment configuration:

Java HotSpot(TM) 64-Bit server virtual machine (build 15.0.2+7-27, mixed mode, sharing).
JVM heap configuration: 55Gb both xms and xmx.
Neo4j 4.0.3 is used. Almost all configurations are default except two:
- total off-heap transaction memory (tx_state_max_off_heap_memory parameter) is set to 24Gb
- pagecache_warmup_enabled is set to true.

Graphs

The graph data is selected from CFPQ_Data dataset. There are two types of graph:

Graphs related to RDF analysis problems
Graphs related to static code analysis problems

A detailed description of the graphs is listed below.

RDF analysis

Graph name	\|V\|	\|E\|	#subClassOf	#type	#broaderTransitive
Core	1 323	2 752	178	0	0
Pathways	6 238	12 363	3 117	3 118	0
Go hierarchy	45 007	490 109	490 109	0	0
Enzyme	48 815	86 543	8 163	14 989	8 156
Eclass_514en	239 111	360 248	90 962	72 517	0
Geospecies	450 609	2 201 532	0	89 065	20 867
Go	582 929	1 437 437	94 514	226 481	0
Taxonomy	5 728 398	14 922 125	2 112 637	2 508 635	0

Static code analysis

Graph name	\|V\|	\|E\|	#a	#d
Apache	1 721 418	1 510 411	362 799	1 147 612
Block	3 423 234	2 951 393	669 238	2 282 155
Fs	4 177 416	3 609 373	824 430	2 784 943
Ipc	3 401 022	2 931 498	664 151	2 267 347
Lib	3 401 355	2 931 880	664 311	2 267 569
Mm	2 538 243	2 191 079	498 918	1 692 161
Net	4 039 470	3 500 141	807 162	2 692 979
Postgre	5 203 419	4 678 543	1 209 597	3 468 946
Security	3 479 982	3 003 326	683 339	2 319 987
Sound	3 528 861	3 049 732	697 159	2 352 573
Init	2 446 224	2 112 809	481 994	1 630 815
Arch	3 448 422	2 970 242	6 712 95	2 298 947
Crypto	3 464 970	2 988 387	678 408	2 309 979
Drivers	4 273 803	3 707 769	858 568	2 849 201
Kernel	11 254 434	9 484 213	1 981 258	7 502 955

Field-sensitive points-to analysis

Graph name	\|V\|	\|E\|
sunflow	15 464	15 957
lusearch	15 774	14 994
luindex	18 532	17 375
avrora	24 690	25 196
eclipse	41 383	40 200
h2	44 717	56 683
pmd	54 444	59 329
xalan	58 476	62 758
batik	60 175	63 089
fop	86 183	83 016
tomcat	111 327	110 884
jython	191 895	260 034
tradebeans	439 693	466 969
tradesoap	440 680	468 263

Grammars

All queries used in evaluation are variants of same-generation query. The inverse of an x relation and the respective edge is denoted as x_r.

Grammars used for RDF graphs:

G₁

S -> subClassOf_r S subClassOf | subClassOf_r subClassOf 
     | type_r S type | type_r type

G₂

S -> subClassOf_r S subClassOf | subClassOf

Geo

S -> broaderTransitive S broaderTransitive_r
     | broaderTransitive broaderTransitive_r

Grammar used for static code analysis graphs:

PointsTo

M -> d_r V d
V -> (M? a_r)* M? (a M?)*

Grammar template used for field-sensitive points-to analysis:

PointsTo -> (assign | load_f Alias store_f )* alloc
Alias -> PointsTo FlowsTo
FlowsTo -> alloc_r (assign_r | store_f_r Alias load_f_r)*

TermiFor all fields f the terminals load_f, store_f are generated

Results

The results of the all pairs reachability queries evaluation on graphs related to RDF analysis are listed below.

The sign ’–’ in cells means that the respective query is not applicable to the graph, so time is not measured.

Graph name	G₁		G₂		Geo
Graph name	time (sec)	#answer	time (sec)	#answer	time (sec)	#answer
Core	0,02	204	0,01	214	–	–
Pathways	0,07	884	0,04	3117	–	–
Go hierarchy	3,68	588 976	5,42	738 937	–	–
Enzyme	0,22	396	0,17	8163	5,7	14 267 542
Eclass	1,5	90 994	0,97	96 163	–	–
Geospecies	2,89	85	2,65	0	145,8	226 669 749
Go	5,56	640 316	4,24	659 501	–	–
Taxonomy	45,47	151 706	36,06	2 112 637	–	–

The evaluation results for single source CFPQ for graphs related to RDF analysis and G₁, G₂, Geo grammars respectively in reachability and all paths scenarious:

The results for graphs related to static code analysis are compared to results of Azimov’s CFPQ algorithm based on matrix operations. The implementation from CFPQ_PyAlgo was taken as the implementation of the matrix CFPQ algorithm. This library contains the implementation for both scenarios, all pairs reachability and single source reachability. To perform matrix operations pygraphblas is used. Pygraphblas is a python wrapper over the SuiteSparse library, which based on the GraphBLAS framework.

The results of the all pairs reachability queries evaluation on graphs related to static code analysis are listed below.

The sign ’–’ in cells means that the respective query and graph require a considerable amount of memory during algorithm execution that leads to unpredictable time to get the result.

Graph name	PointsTo
Graph name	Neo4j time (sec)	GraphBLAS time (sec)	#answer
Apache	–	536,7	92 806 768
Block	113,01	123,88	5 351 409
Fs	167,73	105,72	9 646 475
Ipc	109,43	79,52	5 249 389
Lib	111,09	121,79	5 276 303
Mm	77,92	84,15	3 990 305
Net	160,64	206,29	8 833 403
Postgre	–	969,88	90 661 446
Security	115,75	181,7	5 593 387
Sound	120,14	133,64	6 085 269
Init	87,25	45,84	3 783 769
Arch	130,77	119,92	5 339 563
Crypto	128,8	122,09	5 428 237
Drivers	371,18	279,39	18 825 025
Kernel	614,047	378,05	16 747 731

The evaluation results for single source CFPQ for graphs related to static code analysis and pointsTo grammar in reachability and all paths scenarious:

The results of the all pairs reachability queries evaluation on graphs related to field-sensitive points-to analysis. The results are compared to results of .NET GLL implementation.

Graph name	Reachability				All paths
Graph name	#answer	Neo4j time (sec)	In memory time (sec)	.NET time (sec)	Neo4j time (sec)	In memory time (sec)	.NET time (sec)
avrora	21 532	5.022	2.168	5.022	7.879	5.042	2.984
batik	45 968	12.659	5.656	2.087	30.282	30.137	6.349
eclipse	21 830	5.712	2.223	1.245	14.095	13.418	3.557
fop	76 615	19.400	9.394	4.649	53.721	45.088	15.353
h2	92 038	19.643	13.507	19.065	30.414	23.375	108.554
jython	561 720	OOM	OOM	1283.518	OOM	OOM	OOM
luindex	9 677	2.121	0.874	0.438	3.782	2.326	2.361
lusearch	9 242	1.417	0.547	0.285	2.481	1.614	1.419
pmd	60 518	32.193	23.034	45.487	46.522	36.269	OOT
sunflow	16 354	0.878	0.397	0.340	2.074	1.854	1.733
tomcat	82 424	58.637	26.187	3.872	124.182	88.235	24.168
tradebeans	696 316	OOT	OOM	134.641	OOT	OOM	OOM
tradesoap	698 567	OOT	OOM	134.321	OOT	OOM	OOM
xalan	52 382	24.140	10.991	4.490	39.168	34.247	62.832

Download and build

The project is build with Maven.

git clone https://github.com/JetBrains-Research/GLL4Graph
cd GLL4Graph
mvn compile

Usage

To replicate experiments:

make the directory to save results

mkdir results

run the following command with arguments

mvn exec:java -Dexec.mainClass="benchmark.GraphBenchmark" -Dexec.args="aguments"

print help

mvn exec:java -Dexec.mainClass="benchmark.GraphBenchmark" -Dexec.args="-h"

usage: GraphBenchmark [-h] -d <dataset name> -gm <path> -gp <path> -gs <storage type> -m <number> -p <problem type> -S <s=value1 a=value2> -w <number>
 -d,--dataset <dataset name>            The name of the dataset, an important component of the file name with the results
 -gm,--grammar <path>                   Path to JSON file contains context-free grammar
 -gp,--graph <path>                     Path to directory contains files nodes.csv and edges.csv
 -gs,--graph_storage <storage type>     Graph storage type, allowed values: NEO4J, IN_MEMORY
 -h,--help                              Print help message
 -m,--measurement_iterations <number>   Number of measurement iterations
 -p,--problem <problem type>            Benchmarking algorithm, allowed values: REACHABILITY, ALL_PATHS
 -S,--scenario <s=value1 a=value2>      Benchmarking scenario and its argument, 's' property allowed values: ALL_PAIRS, MULTIPLE_SOURCES, 'a' property
                                        contains number of nodes if 's' equals ALL_PAIRS or path to file with vertices chunks if 's' equals
                                        MULTIPLE_SOURCES
 -w,--warmup_iterations <number>        Number of warm-up iterations

Example

Here is an example which can be run right after project is downloaded and results directory is created without any additional preparation.

To run all pairs reachability algorithm on Core graph on G₁ grammar use the following command:

mvn exec:java -Dexec.mainClass="benchmark.GraphBenchmark" -Dexec.args="-d Core -gm test/resources/grammars/graph/g1/grammar.json -gp data/core -gs NEO4J -p REACHABILITY -S -s ALL_PAIRS -a $(( $(cat "data/core/nodes.csv" | wc -l)-1 )) -w 1 -m 10"

Data

To get more graph data examples use Python script:

graph_loader.py --graph {graph_name} --relationships {comma_separated_relationships_list}

For example:

graph_loader.py --graph core --relationships subClassOf,type

License

This project is licensed under OpenBSD License. License text can be found in the license file.

Name		Name	Last commit message	Last commit date
Latest commit History 2,502 Commits
.github/workflows		.github/workflows
META-INF		META-INF
data/core		data/core
docs/pictures		docs/pictures
src		src
test		test
.gitignore		.gitignore
.travis.yml		.travis.yml
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
README.md		README.md
build.properties		build.properties
entrypoint.sh		entrypoint.sh
graph_loader.py		graph_loader.py
pom.xml		pom.xml

License

FormalLanguageConstrainedPathQuerying/GLL4Graph

Folders and files

Latest commit

History

Repository files navigation

GLL4Graph

Performance

Graphs

Grammars

Results

Download and build

Usage

Example

Data

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages