ICPR 2016 - Graph Distance Contest

Measuring the dissimilarity between graphs is a key step in data analysis based on graph representations. This contest focusses on the definition and the computation of general dissimilarity measures through two challenges:

Challenge 1: computation of the exact or an approximate graph edit distance
Challenge 2: computation of a dissimilarity measure for graph classification

Different datasets are considered with symbolic or numerical attributes on both nodes and edges. Methods submitted to Challenge 1 can also be submitted to Challenge 2.

This contest is supported by the Technical Committee on Graph-Based Representations in Pattern Recognition (TC 15) of the International Association of Pattern Recognition (IAPR)

News

April 2017	Results of Challenge 1 submitted to a journal
Dec. 2016	ICPR / results
Sept. 2016	~~Submission deadline~~

Submissions

Number of methods
- Challenge 1: 8
- Challenge 2: 2

Goal

Compute the exact or an approximate GED, for several pairs of graphs of several datasets, under the following constraints:

running time limited to 30 seconds maximum for each pair of graphs
edit costs are imposed for each dataset

Datasets are divided into two groups:

tiny graphs (less than 10 nodes): selected such that the exact GED is known for all pairs of graphs
small graphs (less than 30 nodes): selected such that the exact GED is known for some pairs of graphs
larger graphs (less that 70 nodes): the exact GED is unknown

Evaluation

Methods are compared according to

running time
closeness to reference distances

Given two graphs, the reference distance can be:

the exact GED computed by the A*-based method [1], if available (depends on the size of the graphs), or
the smallest distance computed by methods tested in this challenge:
- obtained by the organizers with A*-based method [1], Beam-search method [1], Hungarian-based bipartite GED [2]
- obtained by the participants

More information about evaluation.

Participation

This challenge is open to any method based on the computation of edit paths:

exact methods
methods based on heuristics
linear and quadratic programming approaches
algorithms improving computational time, e.g. multi-threaded methods (using at most 4 threads)

Datasets

Three groups of datasets are considered.

Dataset	#graphs	mean #nodes	mean degree	min #nodes	max #nodes	node attributes	edge attributes	exact GED
Alkane	150	8.9	1.8	1	10	chemical symbol	valence (bound type)	99.5467%
Acyclic	185	8.2	1.8	3	11			85.7595%
PAH	94	20.7	20.4	10	28			no
MAO	68	18.4	2.1	11	27			no
CMU	111	30		30	30	-	distance	no

For a finer analysis, each dataset of the second group is decomposed into several subsets of graphs having the same number of nodes, and one subset of graphs having different number of nodes. Each subset is considered as a dataset.

Dataset	#subsets	#graphs / subset	mean #nodes	min #nodes	max #nodes	node attributes	edge attributes	exact GED
MUTA	7	10	10	10	10	chemical symbol	valence (bound type)	no
			20	20	20
			30	30	30
			40	40	40
			50	50	50
			60	60	60
			70	70	70
			10, 20, ..., 70	10	70
GREC	4	10	5	5	5	x,y coordinate and type	type	90% (100% soon)
			10	10	10
			15	15	15
			20	20	20
			5, 10, 15, ..., 20	5	20

Other attributes can be found in the datasets. Attributes listed here are those used in the challenge. See Prepare your method for the expressions of the costs using these attributes.

About these datasets

All datasets are taken from existing benchmarks:

PAH, MAO, Alkane, Acyclic are part of the GREYC's Chemistry dataset.
GREC, MUTA and CMU are part of GDR4GED [3]. GREC and MUTA are subsets of the IAM Graph Database Repository, and CMU is a set of Delaunay graphs computed from the House image sequence of the CMU/VASC Image Database.

Visit links for more information.

Download datasets adapted to the challenge: gdc-c1.tar.gz
Graph format: Graph eXchange Language (GXL)

Evaluation

For each dataset D (or subset), each set of cost parameters C and each method M, the distance is computed for all pairs of graphs. Note that for CMU, each graph is compared with graphs representing the same image rotated at 10, 20, 30, 40, 50, 60, 70, 80, and 90 degrees only (see README file in datasets).

Then, methods are compared according to the following metrics [3] computed for each triplet (D,C,M):

mean deviation of the computed distances from the reference distances
number of times the reference distance is found
mean running time
time/deviation score

Since running time is compared, your method needs to be executed on our computer.

Prepare your method

Given two graphs g1.gxl and g2.gxl, each encoding a graph, your program must:

load the graphs and their attributes from the GXL files
compute the distance based on edit operations
print the following information on the standard output: opt_sol;time;node_match
- opt_sol is an integer: 0 (not found), 1 (found), 2 (don't know)
- time (in seconds) is the total running time of steps 1 and 2, printed with 8 digits after the comma
- node_match is a vector encoding substitutions and removals of nodes of g1
  - it is indexed by nodes of g1 in the order defined in the GXL file (attribute id)
  - for each node i of g1
    - node_match_i = j if i is substituted by the node of g2 with index j, or
    - node_match_i = -1 if i is removed

Step 1 or 2 must include the computation of the costs. Moreover, it cannot run more than 30 seconds, and no more than 4 threads can be used.

Since all graphs are simple, the full edit path can be easily reconstructed from this vector without any ambiguity. Its total cost defines the value of the distance computed by your method. It cannot be less than the exact GED.

We are in charge of the computation of the full edit path and the associated distance from node_match.

Example

Your program computes a distance in 0.5 seconds between a graph g1 with 4 nodes and a graph g2 with 5 nodes, with operations on nodes given by:

substitutions 1->2, 2->1, 4->4
removal: node 3 of g1
insertions: nodes 4 and 5 of g2

and induced operations on edges. Your method does not know if the distance is exact.

Your program must print: 2;0.50000000;2 1 -1 4 on the standard output.

Other operations are computed from this last information and from the graphs.

Programs

We accept command-line programs for Linux 64 bits (16GB RAM, 4 Quad-Core AMD Opteron F 8350 2GHz):

binaries, executable JAR files
source code in C, C++, Java
MATLAB scripts (with/out MEX files, compiled or not)
other languages may be possible on demand

There are 3 types of pairs of attributes on nodes and edges, so you have to prepare 3 programs, as described below.

1. for symbolic graphs (Alkane, Acyclic, PAH, MAO, MUTA)

6 inputs	node costs	c_node,sub, c_node,del/ins
	edge costs	c_edge,sub, c_edge,del/ins
	graphs as GXL files	g1.gxl, g2.gxl

myexec_symbolic c_node,sub c_node,del/ins c_edge,sub c_edge,del/ins g1.gxl g2.gxl

with costs computed by myexec_symbolic as follows:

dataset	node attribute	node costs	edge attribute	edge costs
Alkane	chemical symbol (label) `l`	`c_sub(u,v)=0` if `l(u)=l(v)`, `c_node,sub` else `c_del/ins(u)=c_node,del/ins` for all node `u`	valence (bound type) `t`	`c_sub(a,b)=0` if `t(a)=t(b)`, `c_edge,sub` else `c_del/ins(a)=c_edge,del/ins` for all edge `a`
PAH
MAO
Acyclic
MUTA

several values of parameters c_node,sub, c_node,del/ins, c_edge,sub, c_edge,del/ins are tested.

2. for GREC dataset

2 inputs	graphs as GXL files	g1.gxl, g2.gxl

myexec_grec g1.gxl g2.gxl

with costs computed by myexec_grec as follows:

GREC dataset	used attributes	costs
nodes	(x,y) coordinates `pos`, and type (label) `l`	`c_sub(u,v)`	`0.5*d_euc(pos(u),pos(v))` if `l(u)=l(v)` (same labels)
		`c_sub(u,v)`	`90` else (different labels)
		`c_ins/del`	`45`
edges	type0 (string: line or arc)`t` frequency (label) `f`	`c_sub(a,b)`	`0` if `f(a)=f(b)=1 and t(a)=t(b)` `15` if `f(a)=f(b)=1 and t(a)≠t(b)`
		`c_sub(a,b)`	`0` if `f(a)=f(b)=2` `7.5` if `f(a)≠f(b)`
		`c_ins/del(a)`	`7.5*f(a)`

where d_euc is the Euclidean distance.

3. for CMU dataset

2 inputs	graphs as GXL files	g1.gxl, g2.gxl

myexec_cmu g1.gxl g2.gxl

with costs computed by myexec_cmu as follows:

Dataset	node attributes	fixed node costs	edge attributes	edge costs
CMU	constant integer 1	`c_sub=0` `c_ins/del=inf`	Euclidean length `L`	`c_sub(a_i,a_j)=0.5\|L(a_i)-L(a_j)\|` `c_del/ins(a)=0.5*dist(a)`

Participants and methods

8 methods were submitted:

Bipartite GED with random walks and Hungarian algorithm adapted to error-correcting bipartite matching (LSAPE)
- Approximate Graph Edit Distance Guided by Bipartite Matching of Bags of Walks
  B. Gaüzère, S. Bougleux, K. Riesen, L. Brun
  S+SSPR 2014 [paper]
- Linear Sum Assignment with Edition
  S. Bougleux, L. Brun
  Research Report, Normandie Univ, 2016 [paper]
GED as a quadratic assignment problem approximated by IPFP (QAPE)
- Graph Edit Distance as a Quadratic Program
  S. Bougleux, B. Gaüzère, L. Brun
  ICPR 2016 [to appear]
- Graph Edit Distance as a Quadratic Assignment Problem
  S. Bougleux, L. Brun, V. Carletti, P. Foggia, B. Gaüzère, M. Vento
  Pattern Recognition Letters 87, 2017 [paper]
GED as a binary linear programming problem solved by CPLEX (F2)
Parallel version of F2 via a CPLEX parameter dedicated to number of threads (F24threads)
Relaxed version of F2 (F2LP)
- Graph edit distance: a new binary linear programming formulation
  J. Lerouge, Z. Abu-Aisheh, R. Raveaux, P. Héroux, S. Adam
  Research Report [paper]
- Exact Graph Edit Distance Computation Using a Binary Linear Program
  J. Lerouge, Z. Abu-Aisheh, R. Raveaux, P. Héroux, S. Adam
  S+SSPR 2016 [paper]
GED solved by depth-first-based branch-and-bound strategy (DF)
Exploration of one branch in the search tree of DF (DFUB)
- An Exact Graph Edit Distance Algorithm for Solving Pattern Recognition Problems
  Z. Abu-Aisheh, R. Raveaux, J.-Y. Ramel and P. Martineau
  ICPRAM 2015 [paper]
Parallel version of DF via a load balancing strategy (PDFS)
- A Parallel Graph Edit Distance Algorithm
  Z. Abu-Aisheh, R. Raveaux, J.-Y. Ramel and P. Martineau
  Research Report, 2017 [paper]

Beam search (BS), proposed by M. Neuhaus, K. Riesen and H. Bunke [1], was also tested

Results

Following quantities are reported here:

mean deviation from the reference distances (in %)
% of reached reference distances
mean computational time (in seconds)

More details are given in Evaluation and Prepare your method sections, and in the presentation given at ICPR. Also, a deeper analysis has been recently submitted to a journal.

For symbolic datasets, 3 cost settings have been used:

c_node,sub=2, c_node,del/ins=4, c_edge,sub=c_edge,ins/del=1
c_node,sub=2, c_node,del/ins=4, c_edge,sub=1, c_edge,ins/del=2
c_node,sub=6, c_node,del/ins=2, c_edge,sub=3, c_edge,ins/del=1

	Alkane
	cost parameter 1			cost parameter 2			cost parameter 3
	Dev	%	Time	Dev	%	Time	Dev	%	Time
BS	16.623	59.448	0.133	20	59.835	0.133	18.258	59.835	0.201
LSAPE	65.654	5.431	0.013	75.340	5.817	0.014	73.766	5.817	0.014
QAPE	18.939	54.715	0.029	23.264	54.546	0.028	21.482	54.546	0.028
F2	0	100	0.833	0	100	0.835	0	100	0.835
F24threads	0	100	0.918	0	100	0.931	0	100	0.922
F2LP	38.507	29.746	0.809	46.632	29.933	0.808	44.728	29.928	0.808
DF	0	99.613	0.705	0	100	0.706	0	100	0.703
DFUB	19.776	52.546	0.035	23.928	52.933	0.035	21.979	52.933	0.035
PDFS	0	100	0.517	0	100	0.493	0	100	0.526

	Acyclic
	cost 1			cost 2			cost 3
	Deviation	%	Time	Deviation	%	Time	Deviation	%	Time
BS	21.682	19.394	0.127	34.016	13.464	0.127	22.304	20.042	0.128
LSAPE	38.437	9.677	0.010	55.823	7.865	0.010	40.136	10.283	0.010
QAPE	9.390	64.340	0.024	13.732	57.165	0.023	10.141	57.759	0.024
F2	0	100	0.792	0	100	0.815	0	100	0.790
F24threads	0	100	0.839	0	100	0.865	0	100	0.837
F2LP	14.813	56.484	0.781	25.266	42.575	0.788	11.825	58.156	0.780
DF	0.045	99.782	2.949	0.080	99.647	3.351	0.056	99.674	4.320
DFUB	33.796	8.764	0.036	45.938	7.402	0.036	36.059	9.280	0.036
PDFS	0.007	99.967	1.976	0.019	99.922	2.359	0.019	99.895	2.359

	PAH
	cost 1			cost 2			cost 3
	Deviation	%	Time	Deviation	%	Time	Deviation	%	Time
BS	49.484	4.741	0.786	62.790	4.538	0.779	59.830	4.730	0.777
LSAPE	93.919	1.063	0.141	98.665	1.063	0.144	97.237	1.063	0.142
QAPE	41.361	9.042	0.196	54.656	9.325	0.191	51.627	9.574	0.189
F2	13.541	61.736	18.782	17.882	61.407	18.721	15.315	62.426	18.594
F24threads	1.294	84.019	11.428	3.278	83.080	11.511	1.426	83.997	11.408
F2LP	92.706	0.882	2.709	65.105	2.206	2.709	96.530	1.063	2.709
DF	33.827	16.772	27.446	45.019	16.455	27.435	42.137	16.862	27.441
DFUB	56.308	3.225	0.100	70.246	3.066	0.100	67.414	3.225	0.100
PDFS	31.023	20.348	26.498	41.550	20.133	26.484	38.656	20.484	26.484

	MUTA
	MUTA-10			MUTA-20			MUTA-30			MUTA-40			MUTA-50			MUTA-60			MUTA-70			MUTA-MIX
	Dev	Time	Dev	Dev	Time	Dev	Dev	Time	Dev	Dev	Time	Dev	Dev	Time	Dev	Dev	time	Dev	Dev	Time	Dev	Dev	Time	Dev
BS	13.769	0.15		49.054	0.56		48.628	1.54		59.997	3.42		55.078	7.21		45.563	14.94		69.847	28.42		26.347	11.69
LSAPE	54.456	0.02		86.870	0.11		91.132	0.46		89.868	1.47		84.990	4.25		88.332	11.69		86.487	27.76		45.623	4.98
QAPE	11.270	0.04		25.971	0.17		27.498	0.49		22.670	1.40		5.214	4.34		3.335	11.55		2.490	27.09		3.719	4.92
F2	0	0.61		0.547	6.24		11.179	25.72		18.591	27.50		26.893	30		43.395	34.14		56.530	42.20		9.375	22.89
F24threads	0	0.61		0	3.57		1.369	24.70		1.842	27.40		6.539	29.91		11.144	34.13		26.961	42.10		1.802	21.51
F2LP	15.186	0.57		47.308	1.38		72.450	3.09		65.052	7.70		46.954	16.51		47.900	30.36		57.691	42.08		18.559	14.15
DF	0	1.70		42.278	27.05		47.624	27.08		58.678	27.10		57.334	27.17		48.201	27.23		42.855	27.37		25.721	27.17
DFUB	28.554	0.04		64.687	0.09		66.615	0.14		71.280	0.22		70.072	0.29		65.286	0.37		57	0.50		30.948	0.27
PDFS	0	1.24		25.716	26.65		37.903	27.08		50.340	27.10		47.937	27.17		38.746	27.23		36.300	27.37		22.442	26.96

	MAO
	cost 1			cost 2			cost 3
	Deviation	%	Time	Deviation	%	Time	Deviation	%	Time
BS	5.980	72.577	0.559	9.427	67.236	0.562	7.880	74.372	0.537
LSAPE	30.593	19.355	0.097	45.677	16.976	0.094	76.581	1.384	0.086
QAPE	2.017	77.573	0.128	2.779	79.044	0.123	5.921	72.534	0.126
F2	0	100	1.065	0	100	1.268	0	100	0.979
F24threads	0	100	1.047	0	100	1.513	0	100	0.990
F2LP	1.242	89.100	1.050	0	100	1.268	2.330	86.029	0.977
DF	0.349	98.421	6.996	0.763	97.945	8.858	0.877	96.799	5.513
DFUB	11.254	56.444	0.083	16.157	56.293	0.083	13.902	59.580	0.083
PDFS	1.589	95.436	7.368	2.784	94.442	8.806	2.001	94.809	5.793

	CMU
	Deviation	%	Time
BS	54.636	28.484	29.235
F2	36.469	50.909	26.128
F24threads	0	50.909	24.708
F2LP	36.625	50.757	24.516
DF	39.313	40.454	25.773
DFUB	37.504	41.212	0.657
PDFS	38.641	40.606	24.788

	GREC
	GREC-5			GREC-10			GREC-15			GREC-20			GREC-MIX
	Deviation	%	Time	Deviation	%	Time	Deviation	%	Time	Deviation	%	Time	Deviation	%	Time
BS	2.907	71	0.085	1.291	61	0.413	1.955	42	0.833	3.265	32	1.601	1.900	39	0.622
F2	0	100	0.411	0	100	0.449	0	100	1.752	0	100	1.401	0	100	0.493
F24threads	0	100	0.420	0	100	0.488	0	100	1.240	0	100	1.335	0	100	0.519
F2LP	0	100	0.409	0.678	77	0.445	1.364	52	0.662	1.072	47	0.872	0.751	69	0.468
DF	0	100	0.099	0	100	5.308	1.310	47	16.503	2.071	39	21.848	0.486	83	7.685
DFUB	11.152	41	0.051	4.832	44	0.088	9.941	18	0.141	12.444	27	0.191	8.172	26	0.107
PDFS	0	100	0.155	0	100	3.677	1.210	48	16.691	1.556	43	22.149	0.467	85	7.150

Goal

Given a dataset decomposed into a train set, a validation set and a test set, determine the class of the graphs in the test set with a usual k-NN classifier (with k=3, decision at majority, reject if 3 different classes)

To this, you have to train your measure according to the train set and the validation set. The final output corresponds to a matrix of dissimilarities including all pairwise dissimilarities of graphs contained in the three sets (train, validation and test). In order to avoid any ambiguity in the interpretation of the matrix, please respect carrefully the order of the graphs provided in a separate file given with each datasets.

Several datasets are considered:

four existing datasets (already used in Challenge 1)
a new dataset of graphs encoding monoterpenes (main components of essential oils)

Evaluation

Methods are analyzed and compared, for each dataset, according to:

recognition rate
confusion matrix

You submit your results (dissimilarity matrix and computational time) and we perform the classification.

Participation

This challenge is open to any method which computes a dissimilarity between attributed graphs.

Contrary to Challenge 1, methods do not need to compute a distance based on edit paths. For such a distance, costs are not constrained in this challenge, and more generally, parameters can be optimized to obtain the best classification score. Also, running time and number of threads are not limited.

See How to participate for more informations

Datasets

Dataset	#graphs	#graphs train, validation, test	mean #nodes	mean degree	min #nodes	max #nodes	node attributes	#class
PAH	94	23, 23, 48	20.7	20.4	10	28	any attribute defined in the graph files can be used	2
MAO	68	17, 17, 34	18.4	2.1	11	27		2
MUTA	4337	1500, 500, 2337	30.3			417		2
GREC	1100	286, 286, 528	11.5			25		22
MONOTERP	to be announced							10

Since datasets are already known with classes (except for MONOTERP), 10% of graphs will be removed in each dataset, and decomposition of each dataset into train/validation/test does not correspond to original ones

About these datasets

PAH and MAO are part of GREYC's Chemistry dataset.
GREC and MUTA are part of IAM Graph Database Repository
MONOTERP is a new dataset (monoterpenes) available for the participants of the contest (no class is provided for graphs in the test set)
it will be included in GREYC's Chemistry dataset after the contest

Visit links for more information.

Downloading datasets needs to be inscribed to the challenge
Graph format: Graph eXchange Language (GXL)

How to participate

This challenge is open to any method that computes a dissimilarity measure between attributed graphs.

inscriptions before July 12
download datasets and compute dissimilarity matricies
submit your results before August 12, as a compressed archive (zip, tgz, ...) containing:
- a README file containing:
  - names of the participants
  - name of the method
  - an abstract (maximum 200 words) for contest report and website
- a folder for each dataset (having the name of the dataset), each containing:
  - a text file diss.dat containing the dissimilarity matrix M (each line of the matrix being on one line of the file), each line i and each column j correspond to a graph of the dataset, they are indexed according to their order in files train.cxl, then valid.cxl and then test.cxl, values are separated by space(s)
  - a file time.txt containing the time taken to compute this matrix
submit your extended abstract

When a method proposed for this challenge is also proposed for Challenge 1, two separate submissions must be done. In particular for Challenge 2, your abstract must detail the expression of the costs.

Important dates

submission deadlines
registration open	13 June
datasets available	15 June
registration closed	12 July
Challenge-1 Challenge-2 abstract	15 July 12 August 15 August
results	December

Inscriptions / Submissions

Register by sending an email to containing:
- the names of the participants
- the title of your method
- selected challenge(s), and if Challenge 1 is selected, the type of programs
Download datasets
- Challenge 1: from this website (datasets) for preparing your programs
- Challenge 2: from the link given after inscription (datasets)
Submit your results (see Challenge 1 or Challenge 2)
- Challenge 1: your three programs
- Challenge 2: your classification results
Submit your extended abstract (1 A4 page)

You can register/submit to only one challenge, to both challenges at the same time, or to one challenge and then to the other one. Register and submit your programs for Challenge 1 as soon as possible.

Any method must be described in an extended abstract (1 A4 page), containing details for Challenge 1, Challenge 2, or both.

New methods can be submitted (4-6 pages in ICPR style), reviewed, and published in ICPR proceedings (more details soon).

Scientific committee

Luc Brun	Normandie Univ, ENSICAEN, France
Xiaoyi Jiang	Univ Münster, Germany
Jean-Yves Ramel	Univ Tours, LI Tours, France

Organizing committee

Zeina Abu-Aisheh	Univ Tours, LI Tours, France
Sébastien Bougleux	Normandie Univ, UNICAEN, France
Benoit Gaüzère	Normandie Univ, INSA Rouen, LITIS, France

M. Neuhaus, K. Riesen and H. Bunke. Fast Suboptimal Algorithms for the Computation of Graph Edit Distance. Structural, Syntactic, and Statistical Pattern Recognition. LNCS Vol. 4109, pp 163-172, 2006
K. Riesen, M. Neuhaus, H. Bunke. Bipartite graph matching for computing the edit distance of graphs. Graph-Based Representations in Pattern Recognition. LNCS Vol. 4538, p. 1-12, 2007
Z. Abu-Aisheh, R. Raveaux, J.-Y. Ramel. A Graph Database Repository and Performance Evaluation Metrics for Graph Edit Distance. Graph-Based Representations in Pattern Recognition. LNCS Vol. 9069, p. 138-147, 2015

ICPR 2016 - Graph Distance Contest

News

Submissions

Challenge 1 GED computation

Goal

Evaluation

Participation

Datasets

About these datasets

Evaluation

Prepare your method

Example

Programs

1. for symbolic graphs (Alkane, Acyclic, PAH, MAO, MUTA)

2. for GREC dataset

3. for CMU dataset

Participants and methods

Results

Challenge 2 Graph classification

Goal

Evaluation

Participation

Datasets

About these datasets

How to participate

Dates and inscriptions

Important dates

Inscriptions / Submissions

Publications

Organizers

Scientific committee

Organizing committee

References