Graph Dataset¶
We briefly introduce the dataset format of DeepRobust through self-contained examples. In essence, DeepRobust-Graph provides the following main features:
Clean (Unattacked) Graphs for Node Classification¶
Graphs are ubiquitous data structures describing pairwise relations between entities.
A single clean graph in DeepRobust is described by an instance of deeprobust.graph.data.Dataset
, which holds the following attributes by default:
data.adj
: Graph adjacency matrix in scipy.sparse.csr_matrix format with shape[num_nodes, num_nodes]
data.features
: Node feature matrix with shape[num_nodes, num_node_features]
data.labels
: Target to train against (may have arbitrary shape), e.g., node-level targets of shape[num_nodes, *]
data.train_idx
: Array of training node indicesdata.val_idx
: Array of validation node indicesdata.test_idx
: Array of test node indices
By default, the loaded deeprobust.graph.data.Dataset
will select the largest connect
component of the graph, but users specify different settings by giving different parameters.
Currently DeepRobust supports the following datasets:
Cora
,
Cora-ML
,
Citeseer
,
Pubmed
,
Polblogs
,
ACM
,
BlogCatalog
,
Flickr
,
UAI
.
More details about the datasets can be found here.
By default, the data splits are generated by deeprobust.graph.utils.get_train_val_test
,
which randomly split the data into 10%/10%/80% for training/validaiton/test. You can also generate
splits by yourself by using deeprobust.graph.utils.get_train_val_test
or deeprobust.graph.utils.get_train_val_test_gcn
.
It is worth noting that there is parameter setting
that can be passed into this class. It can be chosen from [“nettack”, “gcn”, “prognn”]:
setting="nettack"
: the data splits are 10%/10%/80% and using the largest connected component of the graph;setting="gcn"
: use the full graph and the data splits will be: 20 nodes per class for training, 500 nodes for validation and 1000 nodes for testing (randomly choosen);setting="prognn"
: use the largest connected component and the data splits are provided by ProGNN (10%/10%/80%);
Note
The ‘netack’ and ‘gcn’ setting do not provide fixed split, i.e., different random seed would return different data splits.
Note
If you hope to use the full graph, please use the ‘gcn’ setting.
The following example shows how to load DeepRobust datasets
from deeprobust.graph.data import Dataset
# loading cora dataset
data = Dataset(root='/tmp/', name='cora', seed=15)
adj, features, labels = data.adj, data.features, data.labels
idx_train, idx_val, idx_test = data.idx_train, data.idx_val, data.idx_test
# you can also split the data by yourself
idx_train, idx_val, idx_test = get_train_val_test(adj.shape[0], val_size=0.1, test_size=0.8)
# loading acm dataset
data = Dataset(root='/tmp/', name='acm', seed=15)
DeepRobust also provides access to Amazon and Coauthor datasets loaded from Pytorch Geometric:
Amazon-Computers
,
Amazon-Photo
,
Coauthor-CS
,
Coauthor-Physics
.
Users can also easily create their own datasets by creating a class with the following attributes: data.adj
, data.features
, data.labels
, data.train_idx
, data.val_idx
, data.test_idx
.
Attacked Graphs for Node Classification¶
DeepRobust provides the attacked graphs perturbed by metattack and nettack. The graphs are attacked using authors’ Tensorflow implementation, on random split using seed 15. The download link can be found in ProGNN code and the performance of various GNNs can be found in ProGNN paper. They are instances of deeprobust.graph.data.PrePtbDataset
with only one attribute adj
. Hence, deeprobust.graph.data.PrePtbDataset
is often used together with deeprobust.graph.data.Dataset
to obtain node features and labels.
For metattack, DeepRobust provides attacked graphs for Cora, Citeseer, Polblogs and Pubmed, and the perturbation rate can be chosen from [0.05, 0.1, 0.15, 0.2, 0.25].
from deeprobust.graph.data import Dataset, PrePtbDataset
# You can either use setting='prognn' or seed=15 to get the prognn splits
data = Dataset(root='/tmp/', name='cora', setting='prognn')
data = Dataset(root='/tmp/', name='cora', seed=15) # since the attacked graph are generated under seed 15
adj, features, labels = data.adj, data.features, data.labels
idx_train, idx_val, idx_test = data.idx_train, data.idx_val, data.idx_test
# Load meta attacked data
perturbed_data = PrePtbDataset(root='/tmp/',
name='cora',
attack_method='meta',
ptb_rate=0.05)
perturbed_adj = perturbed_data.adj
For nettack, DeepRobust provides attacked graphs for Cora, Citeseer, Polblogs and Pubmed, and ptb_rate indicates the number of perturbations made on each node. It can be chosen from [1.0, 2.0, 3.0, 4.0, 5.0].
from deeprobust.graph.data import Dataset, PrePtbDataset
# data = Dataset(root='/tmp/', name='cora', seed=15) # since the attacked graph are generated under seed 15
data = Dataset(root='/tmp/', name='cora', setting='prognn')
adj, features, labels = data.adj, data.features, data.labels
idx_train, idx_val, idx_test = data.idx_train, data.idx_val, data.idx_test
# Load nettack attacked data
perturbed_data = PrePtbDataset(root='/tmp/', name='cora',
attack_method='nettack',
ptb_rate=3.0) # here ptb_rate means number of perturbation per nodes
perturbed_adj = perturbed_data.adj
idx_test = perturbed_data.target_nodes
Converting Graph Data between DeepRobust and PyTorch Geometric¶
Given the popularity of PyTorch Geometric in the graph representation learning community,
we also provide tools for converting data between DeepRobust and PyTorch Geometric. We can
use deeprobust.graph.data.Dpr2Pyg
to convert DeepRobust data to PyTorch Geometric
and use deeprobust.graph.data.Pyg2Dpr
to convert Pytorch Geometric data to DeepRobust.
For example, we can first create an instance of the Dataset class and convert it to pytorch geometric data format.
from deeprobust.graph.data import Dataset, Dpr2Pyg, Pyg2Dpr
data = Dataset(root='/tmp/', name='cora') # load clean graph
pyg_data = Dpr2Pyg(data) # convert dpr to pyg
print(pyg_data)
print(pyg_data[0])
dpr_data = Pyg2Dpr(pyg_data) # convert pyg to dpr
print(dpr_data.adj)
Load OGB Datasets¶
Open Graph Benchmark (OGB) has provided various benchmark datasets. DeepRobsut now provides interface to convert OGB dataset format (Pyg data format) to DeepRobust format.
from ogb.nodeproppred import PygNodePropPredDataset
from deeprobust.graph.data import Pyg2Dpr
pyg_data = PygNodePropPredDataset(name = 'ogbn-arxiv')
dpr_data = Pyg2Dpr(pyg_data) # convert pyg to dpr
Load Pytorch Geometric Amazon and Coauthor Datasets¶
DeepRobust also provides access to the Amazon datasets and Coauthor datasets, i.e.,
Amazon-Computers, Amazon-Photo, Coauthor-CS, Coauthor-Physics, from Pytorch
Geometric. Specifically, users can access them through
deeprobust.graph.data.AmazonPyg
and deeprobust.graph.data.CoauthorPyg
.
For example, we can directly load Amazon dataset from deeprobust in the format of pyg
as follows,
from deeprobust.graph.data import AmazonPyg
computers = AmazonPyg(root='/tmp', name='computers')
print(computers)
print(computers[0])
photo = AmazonPyg(root='/tmp', name='photo')
print(photo)
print(photo[0])
Similarly, we can also load Coauthor dataset,
from deeprobust.graph.data import CoauthorPyg
cs = CoauthorPyg(root='/tmp', name='cs')
print(cs)
print(cs[0])
physics = CoauthorPyg(root='/tmp', name='physics')
print(physics)
print(physics[0])