Data was collected from hospital discharge data from the California, State Inpatient Databases (SID), Healthcare Cost and Utilization Project cite{hcupnet2003utilization}, Agency for Healthcare Research and Quality. This data tracked all hospital admissions at an individual level, limiting the maximum number of 15 diagnoses for each admission. Each diagnosis was represented as ICD-9-CM code. For the experimental evaluation, all pediatric patient data between January 2009 and December 2011 were used.

With over 15,000 ICD-9-CM codes, using all codes as values of categorical feature would be highly challenging for a learning algorithm. Therefore, we transformed the diagnoses to binary features with the occurrence of diagnosis considered as a positive value, otherwise negative. With this step, we transformed 15 features (the maximum number of diagnosis on admission) to over 15,000 binary-valued features. Next, we excluded features with less than 50 positive values. This transformation resulted in 851 ICD-9-CM codes (input features) responsible for readmission within 30 days, as suggested in cite{stiglic2014readmission}. As an output, we used 50 most frequent diagnoses observed in patients in next readmission, also coded as binary features. If the patient was not readmitted, all 50 features were coded as 0. Thus, MLC problem considered in this research was highly sparse and highly dimensional for both input and output data.

subsubsection{Experimental Setup}

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now

In this research, we wanted to examine:
item(1) Does structuring the output space in form of hierarchies (expert-provided or data-derived) can improve the performance of the models that use only the “flat” multi-label output space?
item(2) Does data-derived hierarchies yield satisfactory results when compared with expert-provided hierarchy?

For answering the question (1), we compare the performance of PCTs in multi-label setting (that consider only the “flat” output space) and PCTs in hierarchical multi-label setting (able to exploit the information derived by the experts or the data in the hierarchies on the output space).

For question (2), we compare the performance of the hierarchical multi-label classification models. We employed CCS as a expert-driven hierarchy, while we constructed the data-driven hierarchies with balanced k-means (k ranging from 2 to 5) and hierarchical clustering.

The comparison of the methods was performed using the CLUS system (url{}) for predictive clustering. For each experiment, we constructed single tree models. F-test pruning was used to minimize overfitting in the produced models optimizing the predictive performance cite{vens2008decision}. The exact Fisher test was used to check if a given split/test in an internal node of the tree results in a statistically significant reduction in variance. If there is no such split/test, the node is converted to a leaf. A significance level is selected from the values 0.125, 0.1, 0.05, 0.01, 0.005 and 0.001 to optimize predictive performance by using internal 3-fold cross validation.

The balanced k-means clustering method that is used for deriving the label hierarchies, requires being configured the number of clusters k. For this parameter, four different values (2, 3, 4 and 5) were considered. For hierarchical clustering we used single linkage and complete linkage hierarchical clustering algorithms.


Performance of predictive models are evaluated on the labels (diagnoses) that are considered leafs in the target hierarchy. This way, we measure the influence of the inclusion of the hierarchies in the learning process on the predictive performance of the models. In our experiments, six example-based evaluation measures (Hamming loss, accuracy, precision, recall, F1 score and subset accuracy) and six label-based evaluation measures (micro precision, micro recall, micro F1, macro precision, macro recall and macro F1), and also four ranking-based evaluation measures (one-error, coverage, ranking loss and average precision) were used. Please note that these evaluation measures require predictions stating that a given label is present or not (binary 1/0 predictions). However, most predictive models produce confidence (or probability) which is a numerical value for each label. Final value for each label is obtained if that numerical value exceeds some pre-defined threshold $ au$. Therefore, selection of $ au$ directly influences the performance of the predictive model. In order to obtain the best performing model, we applied a threshold calibration method by choosing the threshold (Equation
ef{tau}) that minimizes the difference in label cardinality between the training data and the predictions for the test data cite{Read09classifier}.

au= underset{ au in 0.00, 0.05, …, 1.00}{argmin}{|LabelCard(E^{train})-LabelCard(H_{ au}(E^{test}))|}

oindent where $E^{train}$ is the training set and a classifier $H_{ au}$ has made predictions for test set $E^{test}$ under threshold $ au$. We do not use the output space of the test set while calculating the threshold.

In order to prevent over-fitting, the data was separated into training and test sets. The training set consisted of all pediatric data from 2009-2010 period while test set consisted of 2011 data. All performances of final models reported in this paper were based on a test set. This way we prevented over-fitting of predictive model and prevented misleading conclusions. All experiments were performed on a server with an Intel Xeon processor at 2.5GHz and 64GB of RAM with the Windows Server 2012 R2 operating system.

Categories: Articles


I'm Garrett!

Would you like to get a custom essay? How about receiving a customized one?

Check it out