Complete results and additional material for the article “PCTBagging: From Inner Ensembles to Ensembles. A trade-off between Discriminating Capacity and Interpretability”

2021-12-01

 

This page contains the full tables related to the work presented in the article:

Igor Ibarguren, Jesús M. Pérez, Javier Muguerza, Olatz Arbelaitz and Ainhoa Yera.  
        "PCTBagging: From Inner Ensembles to Ensembles. A trade-off between Discriminating Capacity and Interpretabibility". Information Sciences (2022), Vol. 583, pp 219-238.

First, we present the table with the characteristics for the 96 datasets used in this study, divided into three contexts.

Then, for each of the evaluation measures, we include the full tables of the results related to the different proposed consolidation percentages of PCTBagging, Bagging, CTC and C4.5.

 

All the tables of results can be downloaded as an Excel document or as a CSV file.

 

Content

1. Datasets characteristics. 1

2  Subsample numbers by data set to achieve the selected coverage value. 4

3. Results for the discriminating capacity, structural complexity, and computational cost measures. 6

 

Index of Tables

Table 1. Description of standard datasets. 2

Table 2. Description of imbalanced datasets. 2

Table 3. Subsample numbers for standard data sets. 4

Table 4:Subsample amounts for imbalanced data sets. 5

Table 5. AUC values for all algorithms over 96 datasets. 6

Table 6. Number of Internal Nodes values for all algorithms over 96 datasets. 8

Table 7. Time values for all algorithms over 96 datasets. 11

 

1. Datasets characteristics

This section contains the tables with the characteristics for the 96 datasets from the KEEL repository used in this study. First we present the datasets from the first (Standard) context and then from the second (Imbalanced) context. SMOTE-preprocessed datasets have the same characteristics as the datasets from Table 2, but the minority class oversampled until it has the majority class’ size.

 


Table 1. Description of standard datasets.

Data set

#Ants

#Examples

#Classes

%min

%maj

Size Of Min. Class

Size of Maj. Class

lymphography

18

148

4

1.36%

54.73%

2

81

ecoli

7

336

8

0.6%

42.56%

2

143

car

6

1728

4

3.77%

70.03%

65

1210

nursery

8

1296

5

0.08%

33.34%

1

432

cleveland

13

297

5

4.38%

53.88%

13

160

zoo

17

101

7

3.97%

40.6%

4

41

glass

9

214

6

4.21%

35.52%

9

76

flare

10

1066

6

4.04%

31.06%

43

331

abalone

8

418

22

0.24%

16.51%

1

69

balance

4

625

3

7.84%

46.08%

49

288

dermatology

33

358

6

5.59%

31.01%

20

111

hepatitis

19

80

2

16.25%

83.75%

13

67

newthyroid

5

215

3

13.96%

69.77%

30

150

haberman

3

306

2

26.48%

73.53%

81

225

breast

9

277

2

29.25%

70.76%

81

196

german

20

1000

2

30%

70%

300

700

wisconsin

9

630

2

34.61%

65.4%

218

412

contraceptive

9

1473

3

22.61%

42.71%

333

629

 

tictactoe

9

958

2

34.66%

65.35%

332

626

pima

8

768

2

34.9%

65.11%

268

500

magic

10

1902

2

35.13%

64.88%

668

1234

wine

13

178

3

26.97%

39.89%

48

71

bupa

6

345

2

42.03%

57.98%

145

200

heart

13

270

2

44.45%

55.56%

120

150

australian

14

690

2

44.5%

55.51%

307

383

crx

15

653

2

45.33%

54.68%

296

357

vehicle

18

846

4

23.53%

25.77%

199

218

penbased

16

1100

10

9.55%

10.46%

105

115

ring

20

740

2

49.6%

50.41%

367

373

iris

4

150

3

33.34%

33.34%

50

50

Mean

11.77

638.93

4.27

21%

50%

139

319.93

Median

9.5

521.5

3

23%

54%

73

209

 

 

Table 2. Description of imbalanced datasets.

Data set

#Atts.

#Examples

Imbalance

Size Of Min. Class

Size of Maj. Class

Abalone19

8

4174

0.77%

32

4142

Yeast6

8

1484

2.49%

37

1447

Yeast5

8

1484

2.96%

44

1440

Yeast4

8

1484

3.43%

51

1433

Yeast2vs8

8

482

4.15%

20

462

Glass5

9

214

4.2%

9

205

Abalone9vs18

8

731

5.65%

41

690

Glass4

9

214

6.07%

13

201

Ecoli4

7

336

6.74%

23

313

Glass2

9

214

8.78%

19

195

Vowel0

13

988

9.01%

89

899

Page-blocks0

10

5472

10.23%

560

4912

Ecoli3

7

336

10.88%

37

299

Yeast3

8

1484

10.98%

163

1321

Glass6

9

214

13.55%

29

185

Segment0

19

2308

14.26%

329

1979

Ecoli2

7

336

15.48%

52

284

New-thyroid1

5

215

16.28%

35

180

New-thyroid2

5

215

16.89%

36

179

Ecoli1

7

336

22.92%

77

259

Vehicle0

18

846

23.64%

200

646

Glass0123vs456

9

214

23.83%

51

163

Haberman

3

306

27.42%

84

222

Vehicle1

18

846

28.37%

240

606

Vehicle2

18

846

28.37%

240

606

Vehicle3

18

846

28.37%

240

606

Yeast1

8

1484

28.91%

429

1055

Glass0

9

214

32.71%

70

144

Iris0

4

150

33.33%

50

100

Pima

8

768

34.84%

268

500

Ecoli0vs1

7

220

35%

77

143

Wisconsin

9

683

35%

239

444

Glass1

9

214

35.51%

76

138

Mean

9.39

919.94

17.61%

120

799.94

Median

8

482

15.48%

52

444

 


 

2  Subsample numbers by data set to achieve the selected coverage value

The tables in this section show the number of subsamples computed for each data set for 99% coverage value. Table 3 refers to standard data sets and Table 4 refers to imbalanced data sets.

For imbalanced data sets preprocessed with SMOTE, only the total example number and the size of the minority class change from the data sets without the preprocessing. In these data sets the minority class has been oversampled with SMOTE until it has the same size as the majority class.

 

Table 3. Subsample numbers for standard data sets.

Original

Training sample

Subsample set

 

Data set

Size

#Class

%Min

Size

Min.

Class

Size

Maj. Class Size

Size

Number

lymphography

148

4

1.36%

119

2

66

12

99

 

ecoli

336

8

0.6%

269

2

115

48

86

 

car

1728

4

3.77%

1383

53

969

212

82

 

nursery

1296

5

0.08%

1037

1

346

105

74

 

cleveland

297

5

4.38%

238

11

129

55

52

 

zoo

101

7

3.97%

81

4

33

28

36

 

glass

214

6

4.21%

172

8

62

48

34

 

flare

1066

6

4.04%

853

35

265

210

33

 

abalone

418

22

0.24%

335

1

56

154

35

 

balance

625

3

7.84%

500

40

231

120

25

 

dermatology

358

6

5.59%

287

17

89

102

22

 

hepatitis

80

2

16.25%

64

11

54

22

21

 

newthyroid

215

3

13.96%

172

24

120

72

21

 

haberman

306

2

26.48%

245

65

181

130

11

 

breast

277

2

29.25%

222

65

158

130

9

 

german

1000

2

30%

800

240

560

480

9

 

wisconsin

630

2

34.61%

504

175

330

350

7

 

contraceptive

1473

3

22.61%

1179

267

504

801

7

 

tictactoe

958

2

34.66%

767

266

502

532

7

 

pima

768

2

34.9%

615

215

401

430

6

 

magic

1902

2

35.13%

1522

535

988

1070

6

 

wine

178

3

26.97%

143

39

58

117

5

 

bupa

345

2

42.03%

276

116

160

232

4

 

heart

270

2

44.45%

216

96

120

192

3

 

australian

690

2

44.5%

552

246

307

492

3

 

crx

653

2

45.33%

523

238

286

476

3

 

vehicle

846

4

23.53%

677

160

175

640

3

 

penbased

1100

10

9.55%

880

84

92

840

3

 

ring

740

2

49.6%

592

294

299

588

3

 

iris[1]

150

3

33.34%

120

40

40

66

6

 

 Mean

638.94

4.27

22%

511.44

111.67

256.54

291.8

24

 

Median

521.5

3

23.07%

417.5

59

167.5

173

9

 

 

Table 4:Subsample amounts for imbalanced data sets

Original

Training sample

Subsample set

Data set

Size

%Min

Size

Min.

Class

Size

Maj. Class Size

Size

Number

Abalone19

4174

0.77

3340

26

3314

52

585

Yeast6

1484

2.49

1188

30

1158

60

176

Yeast5

1484

2.96

1189

36

1153

72

146

Yeast4

1484

3.43

1188

41

1147

82

127

Yeast2vs8

482

4.15

387

17

370

34

98

Glass5

214

4.2

173

8

165

16

93

Abalone9vs18

731

5.65

586

34

552

68

73

Glass4

214

6.07

172

11

161

22

66

Ecoli4

336

6.74

270

19

251

38

59

Glass2

214

8.78

173

16

157

32

43

Vowel0

988

9.01

792

72

720

144

44

Page-blocks0

5472

10.23

4378

448

3930

896

39

Ecoli3

336

10.88

270

30

240

60

35

Yeast3

1484

10.98

1188

131

1057

262

35

Glass6

214

13.55

173

24

149

48

27

Segment0

2308

14.26

1848

264

1584

528

26

Ecoli2

336

15.48

270

42

228

84

23

New-thyroid1

215

16.28

173

29

144

58

21

New-thyroid2

215

16.89

173

30

143

60

20

Ecoli1

336

22.92

270

62

208

124

14

Vehicle0

846

23.64

677

160

517

320

13

Glass0123vs456

214

23.83

172

41

131

82

13

Haberman

306

27.42

246

68

178

136

10

Vehicle1

846

28.37

678

193

485

386

10

Vehicle2

846

28.37

678

193

485

386

10

Vehicle3

846

28.37

678

193

485

386

10

Yeast1

1484

28.91

1188

344

844

688

9

Glass0

214

32.71

172

56

116

112

7

Iris0

150

33.33

121

40

81

80

7

Pima

768

34.84

616

215

401

430

6

Ecoli0vs1

220

35

177

62

115

124

6

Wisconsin

683

35

548

192

356

384

6

Glass1

214

35.51

172

61

111

122

6

 Mean

919.94

17.61

737.09

96.61

640.48

193.21

56

Median

482

15.48

387

42

356

84

23

 

3. Results for the discriminating capacity, structural complexity, and computational cost measures.

This section includes the full tables of the results related to the algorithms compared in the study (PCTBagging with 11 consolidation percentages, Bagging, CTC, and C4.5) for the three performance metrics used in the study: AUC, Number of Internal Nodes, and Time. Numbers in bold indicate the best value for that particular dataset. In these tables we have treated C4.5 as reference for all algorithms. Cells with gray background indicate algorithms performing better than C4.5.

 

3.1 Results for the AUC measure

 

Table 5. AUC values for all algorithms over 96 datasets.

CTC

PCTBagging

Bagging

C4.5

 

 

100%

90%

80%

70%

60%

50%

40%

30%

20%

10%

0%

 

 

1.Standard lymphography

.7755

.8032

.7998

.7941

.7984

.8049

.8260

.8282

.8547

.8598

.8640

.8646

.8646

.8193

1.Standard ecoli

.8942

.8884

.8894

.8903

.8935

.8985

.9057

.9066

.9119

.9198

.9391

.9392

.9392

.8780

1.Standard car

.9468

.9450

.9472

.9452

.9421

.9388

.9358

.9339

.9307

.9343

.9366

.9459

.9459

.9681

1.Standard nursery

.9646

.9570

.9559

.9544

.9519

.9498

.9479

.9457

.9431

.9446

.9464

.9455

.9455

.9610

1.Standard cleveland

.6668