Graph Dataset

We briefly introduce the dataset format of DeepRobust through self-contained examples. In essence, DeepRobust-Graph provides the following main features:

Clean (Unattacked) Graphs for Node Classification

Graphs are ubiquitous data structures describing pairwise relations between entities. A single clean graph in DeepRobust is described by an instance of deeprobust.graph.data.Dataset, which holds the following attributes by default:

  • data.adj: Graph adjacency matrix in scipy.sparse.csr_matrix format with shape [num_nodes, num_nodes]
  • data.features: Node feature matrix with shape [num_nodes, num_node_features]
  • data.labels: Target to train against (may have arbitrary shape), e.g., node-level targets of shape [num_nodes, *]
  • data.train_idx: Array of training node indices
  • data.val_idx: Array of validation node indices
  • data.test_idx: Array of test node indices

By default, the loaded deeprobust.graph.data.Dataset will select the largest connect component of the graph, but users specify different settings by giving different parameters.

Currently DeepRobust supports the following datasets: Cora, Cora-ML, Citeseer, Pubmed, Polblogs, ACM, BlogCatalog, Flickr, UAI. More details about the datasets can be found here.

By default, the data splits are generated by deeprobust.graph.utils.get_train_val_test, which randomly split the data into 10%/10%/80% for training/validaiton/test. You can also generate splits by yourself by using deeprobust.graph.utils.get_train_val_test or deeprobust.graph.utils.get_train_val_test_gcn. It is worth noting that there is parameter setting that can be passed into this class. It can be chosen from [“nettack”, “gcn”, “prognn”]:

  • setting="nettack": the data splits are 10%/10%/80% and using the largest connected component of the graph;
  • setting="gcn": use the full graph and the data splits will be: 20 nodes per class for training, 500 nodes for validation and 1000 nodes for testing (randomly choosen);
  • setting="prognn": use the largest connected component and the data splits are provided by ProGNN (10%/10%/80%);

Note

The ‘netack’ and ‘gcn’ setting do not provide fixed split, i.e., different random seed would return different data splits.

Note

If you hope to use the full graph, please use the ‘gcn’ setting.

The following example shows how to load DeepRobust datasets

from deeprobust.graph.data import Dataset
# loading cora dataset
data = Dataset(root='/tmp/', name='cora', seed=15)
adj, features, labels = data.adj, data.features, data.labels
idx_train, idx_val, idx_test = data.idx_train, data.idx_val, data.idx_test
# you can also split the data by yourself
idx_train, idx_val, idx_test = get_train_val_test(adj.shape[0], val_size=0.1, test_size=0.8)

# loading acm dataset
data = Dataset(root='/tmp/', name='acm', seed=15)

DeepRobust also provides access to Amazon and Coauthor datasets loaded from Pytorch Geometric: Amazon-Computers, Amazon-Photo, Coauthor-CS, Coauthor-Physics.

Users can also easily create their own datasets by creating a class with the following attributes: data.adj, data.features, data.labels, data.train_idx, data.val_idx, data.test_idx.

Attacked Graphs for Node Classification

DeepRobust provides the attacked graphs perturbed by metattack and nettack. The graphs are attacked using authors’ Tensorflow implementation, on random split using seed 15. The download link can be found in ProGNN code and the performance of various GNNs can be found in ProGNN paper. They are instances of deeprobust.graph.data.PrePtbDataset with only one attribute adj. Hence, deeprobust.graph.data.PrePtbDataset is often used together with deeprobust.graph.data.Dataset to obtain node features and labels.

For metattack, DeepRobust provides attacked graphs for Cora, Citeseer, Polblogs and Pubmed, and the perturbation rate can be chosen from [0.05, 0.1, 0.15, 0.2, 0.25].

from deeprobust.graph.data import Dataset, PrePtbDataset
# You can either use setting='prognn' or seed=15 to get the prognn splits
data = Dataset(root='/tmp/', name='cora', setting='prognn')
data = Dataset(root='/tmp/', name='cora', seed=15) # since the attacked graph are generated under seed 15
adj, features, labels = data.adj, data.features, data.labels
idx_train, idx_val, idx_test = data.idx_train, data.idx_val, data.idx_test
# Load meta attacked data
perturbed_data = PrePtbDataset(root='/tmp/',
                                        name='cora',
                                        attack_method='meta',
                                        ptb_rate=0.05)
perturbed_adj = perturbed_data.adj

For nettack, DeepRobust provides attacked graphs for Cora, Citeseer, Polblogs and Pubmed, and ptb_rate indicates the number of perturbations made on each node. It can be chosen from [1.0, 2.0, 3.0, 4.0, 5.0].

from deeprobust.graph.data import Dataset, PrePtbDataset
# data = Dataset(root='/tmp/', name='cora', seed=15) # since the attacked graph are generated under seed 15
data = Dataset(root='/tmp/', name='cora', setting='prognn')
adj, features, labels = data.adj, data.features, data.labels
idx_train, idx_val, idx_test = data.idx_train, data.idx_val, data.idx_test
# Load nettack attacked data
perturbed_data = PrePtbDataset(root='/tmp/', name='cora',
                                        attack_method='nettack',
                                        ptb_rate=3.0) # here ptb_rate means number of perturbation per nodes
perturbed_adj = perturbed_data.adj
idx_test = perturbed_data.target_nodes

Converting Graph Data between DeepRobust and PyTorch Geometric

Given the popularity of PyTorch Geometric in the graph representation learning community, we also provide tools for converting data between DeepRobust and PyTorch Geometric. We can use deeprobust.graph.data.Dpr2Pyg to convert DeepRobust data to PyTorch Geometric and use deeprobust.graph.data.Pyg2Dpr to convert Pytorch Geometric data to DeepRobust. For example, we can first create an instance of the Dataset class and convert it to pytorch geometric data format.

from deeprobust.graph.data import Dataset, Dpr2Pyg, Pyg2Dpr
data = Dataset(root='/tmp/', name='cora') # load clean graph
pyg_data = Dpr2Pyg(data) # convert dpr to pyg
print(pyg_data)
print(pyg_data[0])
dpr_data = Pyg2Dpr(pyg_data) # convert pyg to dpr
print(dpr_data.adj)

Load OGB Datasets

Open Graph Benchmark (OGB) has provided various benchmark datasets. DeepRobsut now provides interface to convert OGB dataset format (Pyg data format) to DeepRobust format.

from ogb.nodeproppred import PygNodePropPredDataset
from deeprobust.graph.data import Pyg2Dpr
pyg_data = PygNodePropPredDataset(name = 'ogbn-arxiv')
dpr_data = Pyg2Dpr(pyg_data) # convert pyg to dpr

Load Pytorch Geometric Amazon and Coauthor Datasets

DeepRobust also provides access to the Amazon datasets and Coauthor datasets, i.e., Amazon-Computers, Amazon-Photo, Coauthor-CS, Coauthor-Physics, from Pytorch Geometric. Specifically, users can access them through deeprobust.graph.data.AmazonPyg and deeprobust.graph.data.CoauthorPyg. For example, we can directly load Amazon dataset from deeprobust in the format of pyg as follows,

from deeprobust.graph.data import AmazonPyg
computers = AmazonPyg(root='/tmp', name='computers')
print(computers)
print(computers[0])
photo = AmazonPyg(root='/tmp', name='photo')
print(photo)
print(photo[0])

Similarly, we can also load Coauthor dataset,

from deeprobust.graph.data import CoauthorPyg
cs = CoauthorPyg(root='/tmp', name='cs')
print(cs)
print(cs[0])
physics = CoauthorPyg(root='/tmp', name='physics')
print(physics)
print(physics[0])