主管:中国科学院
主办:中国优选法统筹法与经济数学研究会
   中国科学院科技战略咨询研究院

中国管理科学 ›› 2016, Vol. 24 ›› Issue (6): 124-131.doi: 10.16381/j.cnki.issn1003-207x.2016.06.015

• 论文 • 上一篇    下一篇

客户信用评估半监督协同训练模型研究

肖进1, 薛书田1, 黄静2, 谢玲1, 顾新1,3   

  1. 1. 四川大学商学院, 四川 成都 610064;
    2. 四川大学公共管理学院, 四川 成都 610064;
    3. 四川大学软科学研究所, 四川 成都 610064
  • 收稿日期:2015-02-13 修回日期:2015-05-12 出版日期:2016-06-20 发布日期:2016-07-05
  • 通讯作者: 黄静(1978-),女(汉族),四川大竹人,四川大学公共管理学院讲师,博士后,研究方向:公共管理计量研究,E-mail:totojh@scu.edu.cn. E-mail:totojh@scu.edu.cn
  • 基金资助:

    国家自然科学基金资助项目(71471124,71571126);四川省青年基金(2015RZ0056);四川省社科规划项目(SC14C019);四川大学优秀青年基金项目(2013SCU04A08);四川大学哲学社会科学青年学术人才基金(skqx201607);四川省教育厅创新团队资助项目(13TD0040)

A Semi-Supervised Co-Training Model for Customer Credit Scoring

XIAO Jin1, XUE Shu-tian1, HUANG Jiing2, XIE Ling1, GU Xin1,3   

  1. 1. Business School, Sichuan University, Chengdu 610064, China;
    2. School of Public Administration of Sichuan University, Chengdu 610064, China;
    3. Soft Science Institute of Sichuan University, Chengdu, 610064, China
  • Received:2015-02-13 Revised:2015-05-12 Online:2016-06-20 Published:2016-07-05

摘要: 在现实的很多信用评估问题中,由于对样本进行类别标记需要花费大量的人力、财力和物力,往往只能获取少量有类别标签的样本来训练分类模型,而把数据库中大量无类别标签的客户样本舍弃。为解决这一问题,本研究引入半监督学习技术,并将其与多分类器集成技术中的随机子空间方法(Random Subspace, RSS)相结合,构建了类别不平衡环境下基于RSS的半监督协同训练模型RSSCI。该模型主要包括三个阶段:1)使用RSS方法训练得到若干基本分类器;2)从大量无类别标签数据集中选择性标记一部分最合适的样本加入到原始训练集中;3)在最终的训练集上训练分类模型,并对测试集样本进行分类。在三个客户信用评估数据集上进行实证分析,结果表明,RSSCI模型的信用评估性能不仅优于常用的监督式集成信用评估模型,也优于已有的一些半监督协同训练信用评估模型。

关键词: 信用评估, 类别分布不平衡, 半监督, 协同训练, RSS

Abstract: Customer credit scoring is one of the most important issues in customer relationship management (CRM). In some real credit scoring issues, many customer samples without class labels are abandoned and just only a few samples with class labels can be used to train the classification models, because it costs a lot of manpower, financial and material resources for labeling the samples. Furthermore, single classification model is difficult to achieve the accurate classification of the whole sample space as the current customer credit scoring problem with class imbalance characteristic. To solve the two problems, semi-supervised learning is introduced and combined with random subspace (RSS) in multiple classifiers ensemble, and then RSS is proposed based semi-supervised co-training model for class imbalance, RSSCI. This model includes the following three phases: 1) Obtains many base classifiers by RSS; 2) Labels some most appropriate samples in U which obtains lots of samples without class labels. Firstly, 3 base classifiers with the best performance are selected to classify the samples in U, the samples with the same forecasted class are put into the candidate set, and then the label confidence of each sample is calculated. Considering the class imbalance of the training data, the candidate are divided set into the positive and negative subsets, and the samples with higher confidence are selected from the two subsets according to the ratio of two classes in the original training set and added the original training set; 3) Trains the classification model in the final training set, and classifies the test set. Empirical analysis is conducted in three credit scoring datasets (German, Australia, UK-thomas, all of them are imbalanced data sets of a type distribution ; moreover, German and Australia are from the UCI international public database) , and the results show that the performance of RSSCI model is superior to the common used supervised ensemble credit scoring models and some existing semi-supervised CO-training credit scoring models, demonstrating the superiority of the RSSCI model of selective mechanism of labeling samples. In CRM, there are a lot of customer classification problems, such as customer churn prediction, customer targeting, which are similar to customer credit scoring. Thus, the model proposed in this study can also be used to solve the above problems, and thus is expected to achieve satisfaction classification performance.

Key words: credit scoring, class imbalance, semi-supervised, co-training, RSS

中图分类号: