Data mining refers to a computational process of exploring and analyzing large amounts of data in order to discover useful information[1,2]. The ultimate goal of data mining is to deliver predictive models applicable to new data; such models are increasingly becoming important to the actuarial profession. There are four main types of data mining tasks: association rule learning, clustering, classification, and regression. There are two types of data: labelled and unlabelled. Labelled data has a specially designated attribute and the aim is to use the given data to predict the value of that attribute for new data. Unlabelled data does not have such a designated attribute. The first two data mining tasks, association rule learning and clustering, work with unlabelled data and are known as unsupervised learning. The last two data mining tasks, classification and regression, work with labelled data and are called supervised learning.
In this research project, we aim to examine and evaluate the existing tools and approaches in data mining in the hope of introducing their usefulness for analyzing data in actuarial science and insurance. In particular, we focus on tools and methods that will effectively allow actuaries to perform predictive analytics in the following three areas: claims tracking and monitoring in life insurance, understanding policyholder behavior in general insurance, and model efficiency for variable annuity products.
Claims tracking and monitoring is an important risk management function for a life insurer. Death claims are generally the largest cash flow item affecting its balance sheet and income statement. Yet several insurers do not have a systematic claims tracking and monitoring process leaving some important questions unanswered: (1) For a particular period, are death claims statistically deviating from expectations? (2) Can we identify natural clusters of policies in an efficient manner where actual death claims are statistically different from expectations? (3) Do these significant deviations represent a single occurrence or a trend? In seeking for answers to these questions, we will use data mining techniques to conduct empirical investigation based on a large dataset provided by Prudential, a major life insurer, that contains all life insurance claims data since year 2000. Results of such a monitoring process will provide insurers guidance for further risk management actions.
While understanding policyholder behavior is important in several lines of insurance, we will focus on the automobile insurance market because of the available data we have. In this market, there is a preexisting belief that a higher premium adjustment is typically applied following the period of making a policy claim; this can affect a policyholder’s decision whether to switch companies after controlling for significant characteristics. While this is relatively challenging for insurers to empirically investigate, our Singapore dataset is very suitable for this type of analysis, which requires a follow-up of policyholders switching between companies and capturing the behavioral pattern of this switch. This is because we have detailed, micro-level automobile insurance records in 3 different files over a 9-year period of 45 insurance companies in Singapore: the policy file has over 5 million records, the claims file has under a million records, and the payment file has over 4 million records. Here, we avoid the problem encountered with classical actuarial approaches based on aggregated data by analyzing micro-level, individual data using data mining tools and techniques to better understand, visualize, and extract hidden information which otherwise will be difficult to acquire.
In the past decade, the rapid growth of variable annuities has posed great challenges to insurance companies, who rely on Monte Carlo simulation to value the complex guarantees embedded in variable annuity contracts. However, Monte Carlo simulation is extremely computationally demanding for valuing large portfolios of variable annuities. Valuing large portfolios in a timely manner is important for hedging and financial reporting. In a series of recent papers[5,6], we have shown that metamodeling approaches are promising solutions to address the aforementioned computational issues. Metamodeling approaches have two major interrelated components: an experimental design method and a metamodel. Here we will explore data mining techniques to create efficient experimental designs and metamodels.
 Max Bramer. Principles of Data Mining. Springer, New York, NY, 2nd edition, 2013.
 Pawel Cichosz. Data Mining Algorithms: Explained Using R. Wiley, Hoboken, NJ, 2015.
 Jeyaraj Vadiveloo, Gao Niu, Emiliano A. Valdez, and Guojun Gan. Unlocking reserve assumptions using retrospective analysis. Accepted for publication, Actuarial Sciences and Quantitative Finance: ICASQF, Cartagena, Colombia, 2016 (Springer Proceedings in Mathematics and Statistics).
 Jason Campbell, Michael Chan, Kate Li, Louis Lombardi, Lucian Lombardi, Marianne Purushotham, and Anand Rao. Modeling of policyholder behavior for life insurance and annuity products: A survey and literature review. Society of Actuaries, 2014.
 Guojun Gan and Emiliano A Valdez. Regression modeling for the valuation of large variable annuity portfolios. Accepted for publication, North American Actuarial Journal, June 2017.
 Guojun Gan and Emiliano A Valdez. An empirical comparison of some experimental designs for the valuation of large variable annuity portfolios. Dependence Modeling, Volume 4, Issue 1 (December 2016).