Data Mining

Data mining refers to a computational process of exploring and analyzing large amounts of data in order to discover useful information [14, 15, 6, 3, 7, 4, 5, 1]. To give a perspective, there are four main types of data mining tasks: association rule learning, clustering, classification, and regression. We have identified that these types of data mining tasks are useful in each of the research strands discussed in this research proposal. There are two types of data: labelled and unlabelled. Labelled data has a specially designated attribute and the aim is to use the given data to predict the value of that attribute for new data. Unlabelled data does not have such a designated attribute. The first two data mining tasks, association rule learning and clustering, work with unlabelled data and are known as unsupervised learning. The last two data mining tasks, classification and regression, work with labelled data and are called supervised learning.

Association rule learning concerns about finding interesting relationships and correlations that exist amongst the values of variables [3, 1]. A typical application of association rule learning is to find customer purchase behavior from market basket transaction data. The primary aim usually is to find associations and correlations of the various items purchased by consumers. Data clustering is a form of unsupervised learning and refers to a process of dividing a set of items into homogeneous groups or clusters such that items in the same cluster are similar to each other and items from different clusters are distinct [12, 11]. During the past six decades, many clustering algorithms have been developed by researchers from different fields of studies. Most clustering algorithms are formulated as an optimization problem.

Classification is one type of supervised learning where the designated attribute (i.e., the label) is categorical. For example, a P&C insurer may want to classify a policyholder to be at high risk, medium risk, or low risk of filing a claim. There are several ways to classify data [3, 1]: nearest neighbor matching, using classification rules, and using classification trees. Regression, also known as numerical prediction, is another type of supervised learning where the designated attribute is numerical [9, 3].

Another technique identified as useful for our project is text mining. It refers to a field of study that focuses on analyzing unstructured text information [16, 2] and consists of two main phases: first, preprocess the unstructured text data by transforming the data into standard numerical form; second, apply statistical analysis to extract useful patterns and concepts from the preprocessed data. Techniques for text mining are similar to those for data mining e.g., the k-means algorithm that is well known in data mining can be used to partition a collection of documents. Text mining has been applied to solve insurance problems. [13] used it to predict insurance claim costs based on unstructured text data. [8] also applied text mining to an `accident description' dataset to extract new variables used to predict the likelihood of attorney involvement during the claims process, as well as the severity of claims. With its huge potential use, research articles on its actuarial applications are still considered scarce.

[1] Charu C. Aggarwal. Data Mining: The Textbook. Springer, New York, NY, 2015.
[2] Charu C. Aggarwal and Chengxiang Zhai, editors. Mining Text Data. Springer, New York, NY, 2012.
[3] Max Bramer. Principles of Data Mining. Springer, New York, NY, 2nd edition, 2013.
[4] Min Chen, Shiwen Mao, Yin Zhang, and Victor CM Leung. Big Data: Related Technologies, Challenges and Future Prospects. Springer, New York, NY, 2014.
[5] Pawel Cichosz. Data Mining Algorithms: Explained Using R. Wiley, Hoboken, NJ, 2015.
[6] Robson L.F. Cordeiro, C. Faloutsos, and C.T. Junior. Data Mining in Large Sets of Complex Data. Springer Briefs in Computer Science. Springer, New York, NY, 2013.
[7] Jared Dean. Big Data, Data Mining, and Machine Learning: Value Creation for Business Leaders and Practitioners. Wiley, Hoboken, NJ, 2014.
[8] Louise Francis and Matt Flynn. Text mining handbook. Casualty Actuarial Society E-Forum, Spring 2010.
[9] Edward W. Frees. Regression Modeling with Actuarial and Financial Applications. Cambridge University Press, 2009.
[10] Edward W. Frees, Richard A. Derrig, and Glenn Meyers. Predictive Modeling Applications in Actuarial Science, Volume 2: Case Studies in Insurance. Cambridge University Press, Cambridge, U.K., 2016.
[11] Guojun Gan. Data Clustering in C++: An Object-Oriented Approach. Data Mining and Knowledge Discovery Series. Chapman & Hall/CRC Press, Boca Raton, FL, USA, 2011.
[12] Guojun Gan, Chaoqun Ma, and Jianhong Wu. Data Clustering: Theory, Algorithms, and Applications. SIAM Press, Philadelphia, PA, 2007.
[13] Inna Kolyshkina and Marcel van Rooyen. Text mining for insurance claim cost prediction. Institute of Actuaries of Australia, October 2005.
[14] David L. Olson and Dursun Delen. Advanced Data Mining Techniques. Springer, New York, NY, 2008.
[15] Bruce Ratner. Statistical and Machine-Learning Data Mining: Techniques for Better Predictive Modeling and Analysis of Big Data. CRC Press, Boca Raton, FL, 2nd edition, 2011.
[16] Sholom M. Weiss, Nitin Indurkhya, Tong Zhang, and Fred J. Damerau. Text Mining: Predictive Methods for Analyzing Unstructured Information. Springer, New York, NY, 2005.