Complete results and additional material for the article “Driven PCTBagging: Seeking greater discriminating capacity for the same level of interpretability”

2024-03-20

This page contains the additional material related to the work presented in the article:

Jesús M. Pérez, Olatz Arbelaitz and Javier Muguerza. "Driven PCTBagging: Seeking greater discriminating capacity for the same level of interpretability". XX Conference of the Spanish Association for Artificial Intelligence (CAEPIA'24).

All the tables of results can be downloaded as an OpenDocument Spreadsheet (ODS) file.

Table of Contents

1. Datasets characteristics

Table 1: Description of imbalanced datasets

2. Subsample numbers by data set to achieve the selected coverage value

Table 2: Subsample amounts for imbalanced data sets

3. Results

3.1. Discriminating capacity

Figure 1: Average AUC values for the 33 datasets

Table 3: AUC values for all algorithms over 33 datasets

Figure 2: Average balanced accuracy values for the 33 datasets

Table 4: Balanced Accuracy values for all algorithms over 33 datasets

Figure 3: Average True Positive Rate values for the 33 datasets

Table 5: True Positive Rate values for all algorithms over 33 datasets

3.2. Structural complexity

Figure 4: Average values of then number of internal nodes for the 33 datasets

Table 6: Internal Nodes values for the algorithms with explaining capacity over 33 datasets

Table 7: Average values of the number of internal nodes of all the trees of the ensembles over 33 datasets

3.3. Computational cost

Figure 5: Average construction time values for the 33 datasets

Table 8: Elapsed Time Training Rate values for all algorithms over 33 datasets

1. Datasets characteristics

This section contains the table with the characteristics for the 33 datasets from the KEEL repository used in this study. We present the datasets from the second (Imbalanced) context.

Table 1: Description of imbalanced datasets

Data set

#Atts.

#Examples

Imbalance

Size of Min. Class

Size of Maj. Class

Abalone19

8

4174

0.77%

32

4142

Yeast6

8

1484

2.49%

37

1447

Yeast5

8

1484

2.96%

44

1440

Yeast4

8

1484

3.43%

51

1433

Yeast2vs8

8

482

4.15%

20

462

Glass5

9

214

4.2%

9

205

Abalone9vs18

8

731

5.65%

41

690

Glass4

9

214

6.07%

13

201

Ecoli4

7

336

6.74%

23

313

Glass2

9

214

8.78%

19

195

Vowel0

13

988

9.01%

89

899

Page-blocks0

10

5472

10.23%

560

4912

Ecoli3

7

336

10.88%

37

299

Yeast3

8

1484

10.98%

163

1321

Glass6

9

214

13.55%

29

185

Segment0

19

2308

14.26%

329

1979

Ecoli2

7

336

15.48%

52

284

New-thyroid1

5

215

16.28%

35

180

New-thyroid2

5

215

16.89%

36

179

Ecoli1

7

336

22.92%

77

259

Vehicle0

18

846

23.64%

200

646

Glass0123vs456

9

214

23.83%

51

163

Haberman

3

306

27.42%

84

222

Vehicle1

18

846

28.37%

240

606

Vehicle2

18

846

28.37%

240

606

Vehicle3

18

846

28.37%

240

606

Yeast1

8

1484

28.91%

429

1055

Glass0

9

214

32.71%

70

144

Iris0

4

150

33.33%

50

100

Pima

8

768

34.84%

268

500

Ecoli0vs1

7

220

35%

77

143

Wisconsin

9

683

35%

239

444

Glass1

9

214

35.51%

76

138

Mean

9.39

919.94

17.61%

120

799.94

Median

8

482

15.48%

52

444

2. Subsample numbers by data set to achieve the selected coverage value

The table in this section show the number of subsamples computed for each data set for 99% coverage value.

Table 2: Subsample amounts for imbalanced data sets


Original

Training sample

Subsample set

Data set

Size

%Min

Size

Min.

Class

Size

Maj. Class Size

Size

Number

Abalone19

4174

0.77

3340

26

3314

52

585

Yeast6

1484

2.49

1188

30

1158

60

176

Yeast5

1484

2.96

1189

36

1153

72

146

Yeast4

1484

3.43

1188

41

1147

82

127

Yeast2vs8

482

4.15

387

17

370

34

98

Glass5

214

4.2

173

8

165

16

93

Abalone9vs18

731

5.65

586

34

552

68

73

Glass4

214

6.07

172

11

161

22

66

Ecoli4

336

6.74

270

19

251

38

59

Glass2

214

8.78

173

16

157

32

43

Vowel0

988

9.01

792

72

720

144

44

Page-blocks0

5472

10.23

4378

448

3930

896

39

Ecoli3

336

10.88

270

30

240

60

35

Yeast3

1484

10.98

1188

131

1057

262

35

Glass6

214

13.55

173

24

149

48

27

Segment0

2308

14.26

1848

264

1584

528

26

Ecoli2

336

15.48

270

42

228

84

23

New-thyroid1

215

16.28

173

29

144

58

21

New-thyroid2

215

16.89

173

30

143

60

20

Ecoli1

336

22.92

270

62

208

124

14

Vehicle0

846

23.64

677

160

517

320

13

Glass0123vs456

214

23.83

172

41

131

82

13

Haberman

306

27.42

246

68

178

136

10

Vehicle1

846

28.37

678

193

485

386

10

Vehicle2

846

28.37

678

193

485

386

10

Vehicle3

846

28.37

678

193

485

386

10

Yeast1

1484

28.91

1188

344

844

688

9

Glass0

214

32.71

172

56

116

112

7

Iris0

150

33.33

121

40

81

80

7

Pima

768

34.84

616

215

401

430

6

Ecoli0vs1

220

35

177

62

115

124

6

Wisconsin

683

35

548

192

356

384

6

Glass1

214

35.51

172

61

111

122

6

Mean

919.94

17.61

737.09

96.61

640.48

193.21

56

Median

482

15.48

387

42

356

84

23

3. Results

This section includes the complete figures of the average values and the full tables of the results related to the algorithms compared in the study (C4.5, CTC, Bagging and Driven PCTBagging for 6 different criteria and with 11 consolidation percentages) for the discriminating capacity, structural complexity, and computational cost measures.

3.1. Discriminating capacity



Figure 1: Average AUC values for the 33 datasets



Table 3: AUC values for all algorithms over 33 datasets

This table can be downloaded as an OpenDocument Spreadsheet (ODS) file by clicking on the following link



Figure 2: Average balanced accuracy values for the 33 datasets



Table 4: Balanced Accuracy values for all algorithms over 33 datasets

This table can be downloaded as an OpenDocument Spreadsheet (ODS) file by clicking on the following link



Figure 3: Average True Positive Rate values for the 33 datasets



Table 5: True Positive Rate values for all algorithms over 33 datasets

This table can be downloaded as an OpenDocument Spreadsheet (ODS) file by clicking on the following link



3.2. Structural complexity



Figure 4: Average values of then number of internal nodes for the 33 datasets



Table 6: Internal Nodes values for the algorithms with explaining capacity over 33 datasets

This table can be downloaded as an OpenDocument Spreadsheet (ODS) file by clicking on the following link



Table 7: Average values of the number of internal nodes of all the trees of the ensembles over 33 datasets

This table can be downloaded as an OpenDocument Spreadsheet (ODS) file by clicking on the following link



3.3. Computational cost



Figure 5: Average construction time values for the 33 datasets



Table 8: Elapsed Time Training Rate values for all algorithms over 33 datasets

This table can be downloaded as an OpenDocument Spreadsheet (ODS) file by clicking on the following link