ISSN: 2378-315X BBIJ

Biometrics & Biostatistics International Journal
Research Article
Volume 4 Issue 5 - 2016
Studying Microarray Gene Expression Data of Schizophrenic Patients for Derivation of a Diagnostic Signature through the Aid of Machine Learning
Marianthi Logotheti1,2, Eleftherios Pilalis2,3, Nikolaos Venizelos1, Fragiskos Kolisis4, Aristotelis Chatziioannou2,3*
1Neuropsychiatric Research Laboratory, Faculty of Medicine and Health, School of Health and Medical Sciences, Örebro University, Örebro, Sweden
2 Metabolic Engineering and Bioinformatics Group, Institute of Biology, Medicinal Chemistry and Biotechnology, National Hellenic Research Foundation, Athens, Greece
3 e-NIOS Applications PC, Athens, Greece
4 Laboratory of Biotechnology, School of Chemical Engineering, National Technical University of Athens, Athens, Greece
Received: July 19, 2016 | Published: September 29, 2016
*Corresponding author: Aristotelis Chatziioannou, Metabolic Engineering and Bioinformatics Group, Institute of Biology, Medicinal Chemistry and Biotechnology, National Hellenic Research Foundation, 48 Vassileos Constantinou ave, 11635, Athens, Greece, Tel: +30 210 7273751; Email:
Citation: Logotheti M, Pilalis E, Venizelos N, Kolisis F, Chatziioannou A (2016) Studying Microarray Gene Expression Data of Schizophrenic Patients for Derivation of A Diagnostic Signature Through The Aid of Machine Learning. Biom Biostat Int J 4(5): 00106. DOI: 10.15406/bbij.2016.04.00106

Abstract

Schizophrenia is a complex psychiatric disease that is affected by multiple genes, some of which could be used as biomarkers for specific diagnosis of the disease. In this work, we explore the power of machine learning methodologies for predicting schizophrenia, through the derivation of a biomarker gene signature for robust diagnostic classification purposes. Postmortem brain gene expression from the anterior prefrontal cortex of schizophrenia patients were used as training data for the construction of the classifiers. Several machine learning algorithms, such as support vector machines, random forests, and extremely randomized trees classifiers were developed and their performance was tested. After applying the feature selection method of support vector machines recursive feature elimination a 21-gene model was derived. Using these genes for developing classification models, the random forests algorithm outperformed all examined algorithms achieving an area under the curve of 0.98 and sensitivity of 0.89, discriminating schizophrenia from healthy control samples with high efficiency. The 21-gene model that was derived. from the feature selection is suggested for classifying schizophrenic patients, as it was successfully applied on an independent dataset of postmortem brain samples from the superior temporal cortex, and resulted in a classification model that achieved an area under the curve score of 0.91. Additionally, the functional analysis of the statisticallysignificant genes indicated many mechanisms related to the immune system.

Keywords: Classification; Schizophrenia; Machine learning; Gene expression; Microarray studies; Support vector machines; Adaboost

Abbreviations

AUC: Area Under the Curve; CV: Cross-Validation; GO: Gene Ontology; Extra Trees: Extremely Randomized trees; kNN: k-Nearest Neighbor; RF: Random Forests; RMA: Robust Multi-array Analysis; ROC: Receiver Operating Characteristic; SVM: Support Vector Machines; SVM-RFE: Support Vector Machines Recursive Feature Elimination; SZ: Schizophrenia

Introduction

Schizophrenia (SZ) is a serious psychiatric disease, with a complex genetic basis that affects around 1% of the population worldwide. The symptoms of the disease are divided into positive, negative and cognitive symptoms. Positive symptoms include hallucinations, delusions as well as disorganised speech and behaviour. Negative symptoms include anhedonia, social withdrawal, and lack of motivation and energy. Finally, cognitive symptoms involve cognitive dysfunctions of patients suffering from SZ. Pharmacological treatment of the disease mostly deals with the positive, psychotic symptoms of the disease, but does not improve cognitive and social dysfunction. Moreover, the etiology of SZ predicates upon a combination of genetic and environmental factors, probably in early life, that affect neurogenesis and neuronal plasticity [1]. DNA microarray technologies enabling genome-wide gene expression profiling have been intensely exploited in the last decade, in order to promote the elucidation of the underlying biological mechanisms of SZ [2-5]. These studies, through the high dimensional data that they yield, can prove to be very useful for the generation of diagnostic biomarker signatures in the management of SZ. The usefulness of these data is based on the fact that they may reveal several genes that act synergistically. Probably, the genes that present these synergistic effects with other genes cannot be associated with SZ on their own. The importance of the development of classification models in SZ is great as, at the moment, the diagnosis of the disease is based exclusively on the evaluation of the clinical symptoms after they have manifested. Despite much research effort, some of the most crucial questions regarding SZ have not been answered. The heterogeneity and the multi-factorial background of SZ suggest the study of this disease through statistical methods for the identification of patterns in the data. Differentially expressed genes occurring from microarray experiments can be utilized as classifying biomarkers gain and can reveal underlying genetic factors in relation to important psychiatric diseases, such as SZ [6].

Classification includes two main methodological models: the supervised and the unsupervised model. In unsupervised learning, the instances are unlabeled and the aim is to discover useful classes [7]. Supervised learning includes instances with known labels. In this study, supervised methods are used [6]. In supervised learning, the classes are first defined and then the aim is to build a classifier that can separate samples among the defined classes citation [8]. The discrimination of the classes in this study is based on the gene expression profiles of the samples.

Algorithms

Support vector machines (SVM)

In SVM, good separation of classes is achieved by the hyper-plane that has the largest distance to the nearest training data points of any class. The instances that are on the boundaries of the margin and determine the position and the orientation of the hyperplane are called the support vectors [9]. SVM have some mathematical attributes that make them advantageous for gene expression classification, such as their ability to deal with large feature spaces and their ability to recognise outliers [10].

Extremely randomized trees (Extra Trees)

The Extra Trees classifier belongs to the tree classifier algorithms and is extremely randomized. Its difference from other tree algorithms lies in the way it is built. At the point where the algorithm seeks the the most discriminative thresholds for the separation of the samples of a node into two groups, random thresholds are drawn for each of the randomly-selected features. Then, the best randomly-generated threshold is chosen as the splitting rule[11].

Random forests (RF)

The RF classification algorithm is based on an ensemble of classification trees. Each classification tree is developed with bootstrap sampling of the data and for each split a random subset of the variables is used. RF uses two approaches: bagging or bootstrap aggregation that combines unstable learners and random variable selection for building the tree. No pruning is applied on the trees, in order to achieve low-bias trees. Additionally, bagging and random variable selection create trees with low correlation. As a classification method in microarray studies, it gives good performances even with noisy predictive variables and for this reason, it doesn’t need gene pre-selection. Finally, good performance is not so dependent on fine-tuning the parameters of the algorithm [12].

 Nearest neighbors

The k-nearest neighbors (kNN) algorithm is one of the most widely used and simplest methods among machine learning classification algorithms. In the training process, the kNN algorithm classifies an unlabeled instance based on the most common label of its k-neighbors in the training set. The distance metric that is used for the identification of the nearest neighbors affects the performance of the classifier [13-15].

 Adaboost

This classification algorithm boosts the performance of a simple classifier by combining a set of weak classifiers to a stronger learning algorithm. In this way, the weak classifiers have to perform only a little better than a random guessing, but the final combined classifier usually results in a good performance. In order to boost a weak classifier, it is forced to solve a series of learning problems. After every learning round, the examples are weighted and the importance of the ones that were falsely classified by the previous weak classifier is increased [16].

Evaluation

In this specific study, cross-validation (CV) has been used as an evaluation method of the classifier. In n-fold CV, the training set is divided into n subsets. One after the other, one subset is used as a test subset for the trained classifier and the remaining n-1 subsets are used as the training subset [17]. The performance of the different classification algorithms is evaluated through receiver operating characteristic (ROC) curves [18]. In binary classification the outcomes can be labeled either as positive or negative. The true positive (also known as sensitivity or recall) rate refers to the proportion of positive samples that are correctly predicted as positive, whereas the false positive (also known as 1-specificity) rate refers to the proportion of negative examples that are incorrectly predicted as positive. The Y axis of the ROC curve represents the true positive rate and the X axis represents the false positive rate. The upper-left corner of the plot is the “ideal” point, as the true positive rate equals 1 and the false positive rate equals 0. After constructing a ROC curve for each classifier, the area under the curve (AUC), defined as the area between the ROC curve and the X axis, is used for the prediction performance of each classifier. In this study, the ROC curve for each classifier is estimated using a 10-fold CV procedure and we compare the mean AUC occurring from each curve. A larger AUC usually means a better classifier [19]. Other metrics for evaluating the performance of a classifier are precision, sensitivity and accuracy. Precision is the ability of the classifier to not label a sample that is negative as positive. As mentioned before, sensitivity equals to the proportion of positive samples that are correctly predicted as positive and, finally, accuracy is the number of correct predictions made divided by the total number of predictions made [20].

Feature selection

Feature selection can prove to be very important, as it can reveal subsets of informative genes that can discriminate schizophrenics from healthy control subjects. There are three main feature selection methods: filter, wrapper, and embedded methods. Filter methods filter out features that, based on statistical methods, are not informative. Filter feature selection is performed before applying classification (e.g. Fisher criterion score). Wrapper methods (e.g. stepwise forward selection and stepwise backward selection) search for optimal feature subsets, and utilize a classifier in order to evaluate the predictive power of the feature subsets. Compared to the filter methods, wrapper methods are usually more computationally demanding; but, they also provide more accurate results [21]. The embedded methods select features while building a model. Embedded techniques are more computationally efficient than wrapper methods. An example of embedded methods is support vector recursive feature elimination (SVM-RFE), which is also used in this study. SVM-RFE is based on an iterative method of setting aside the feature with the lowest weight for each prediction method, until the optimal subset of genes is left [22].

The aim of this study is to test if the microarray gene expression data from a postmortem brain dataset contain enough information for the classification of SZ. For this reason several classification algorithms have been tested and their performance has been evaluated.

Materials and Methods

Data preprocessing and analysis

A dataset that includes brain postmortem gene expression data of 28 schizophrenic and 23 healthy control subjects, derived from Broadmann area 10 (anterior prefrontal cortex), accessible at NCBI GEO database [23] with the accession number GSE 17612, was analyzed using the Bioconductor package ‘affy’ [24] through the R programming system [25]. Gene expression profiles were generated using the Affymetrix HG-U133 Plus 2.0 GeneChip. In this study, the robust multi-array analysis (RMA) method was used, which performs background correction on the Affymetrix perfect match data, applies quantile normalization and then performs summarization of the probe set information using median polish [26]. The limma (moderated t-test) Bioconductor package of R has been used towards the identification of differentially expressed genes among the two classes [27]. Transcripts were characterized as differentially expressed if their unadjusted p-value was less than 0.01. The differentially expressed genes were used as an input for the pathway analysis and gene prioritization as well as for the classification task.

Pathway analysis and gene prioritization

The differentially expressed genes were imported into the Bioinfominer web tool (available online: www.bioinfominer.com) for functional analysis based on established statistical tests and using different ontology databases, namely Gene Ontology (GO) [28], Reactome [29], Human Phenotype Ontology [30] and MGI Mammalian Phenotype Ontology [31]. In this way, significant biological mechanisms associated to the input data were revealed. The next part of the analysis include the identification and the prioritization of master regulatory genes, which represent hub nodes in the GO tree structure. These genes play a central role, as they are related to many distinct, cross-talking GO terms [32].

Classification algorithms and parameter optimization

In this study, the following classification techniques for the discrimination of the two classes (SZ and healthy controls) have been used: SVM, Extra Trees, RF, AdaBoost, and kNN classification algorithms. The classification algorithms come with a set of parameters. In this study, the parameters of the utilized classifiers were optimized with a CV grid search. Using this exhaustive search for each classifier, this method selects those parameters that maximize the mean AUC score of the CV [33]. For all the classification models developed in this paper, parameter optimization has been performed. All of the machine learning methods were implemented in scikit-learn [34].

Feature selection

The differentially expressed genes were used as the dataset for SVM-RFE method, in order to filter out the optimum informative feature set [35]. Generally, SVM-RFE selects the minimum informative subset of features that separates classes, by progressively removing features that are not informative. This procedure has many rounds. At each round one gene is eliminated and an SVM classifier is trained based on the rest of the genes. That procedure is recursively repeated on the pruned sets until the number of features that present the best performance according to CV is reached [36]. The reason for using SVM-RFE is that we aim at developing a sensitive and specific classification algorithm, based on realistic clinical biomarkers, assaying a small number of genes [37].

Data collection and classification of the independent test cohort

The second dataset (NCBI GEO accession number: GSE 21935) was used as an independent group of samples in order to examine if the final genes occurring from the feature selection can be used as biomarkers in SZ. For this reason, the 21-gene model was used as an input for testing if those genes can discriminate SZ samples from healthy control samples on this independent dataset. The classification task was applied on the normalized gene expression values of the dataset, also resulting from RMA. The dataset included samples from the Brodmann Area 22 (superior temporal cortex) of 23 schizophrenic patients and 19 healthy controls.

Model evaluation

ROC curve was mainly used in this study as a metric to evaluate the output quality of each classification model, created from the 10-fold CV. Each of the 10 different splits of the dataset generated by the 10-fold CV results in a curve. Taking all of these curves, the mean AUC of each classifier is calculated. A classifier with larger mean AUC is considered to have better performance. Other metrics were also used as evaluation criteria for the performance of the classifiers, including accuracy, precision, and sensitivity.

Results

Differentially expressed genes

Applying the criteria described above (see Data preprocessing and analysis in Materials and Methods), the microarray output showed that 164 genes were differentially expressed in schizophrenic patients compared to the healthy controls (Supplementary Table 1).

Pathway analysis and gene prioritization

The differentially expressed genes of (Supplementary Table 1) were submitted to the Bioinfominer web application for the elucidation of the overrepresented GO terms, Reactome pathways, Human Phenotype Ontology terms, and MGI Mammalian Phenotype Ontology terms. The full results are presented in (Supplementary Tables 2-5). The hub genes resulting from the gene prioritization corresponding to GO enrichment analysis are presented in (Supplementary Table 6).

RF

Extra Trees

kNN

Adaboost

SVM

Parameter

Optimal Value

Parameter

Optimal Value

Parameter

Optimal value

Parameter

Optimal value

Parameter

Optimal value

Criterion

gini

Criterion:

gini

N_neighbors

5

N_estimators

500

kernel

linear

Max_features

3.16

Max_features

3.16

P

1

Learning rate

1

C

1000

N_estimators

10

N_estimators

10

weights

uniform

Table 1: Exhaustive grid search results for developed classification algorithms based on the 164 differentially expressed genes of the training dataset. Possible combinations of the parameter values are evaluated and the best combination is presented for each tested classification algorithm (RF, Extra Trees, kNN, AdaBoost, SVM).

RF parameters: Criterion: the function to measure the quality of a split; gini corresponds to the Gini impurity.
Max_features: the number of features to consider when looking for the best split.
N_estimators: the number of trees in the forest.
Extra Trees parameters: Criterion. Max_features N_estimators as described for RF.
kNN parameters: N_neighbors: number of neighbors to use.
P: power parameter for the Minkowski metric; p=1 is equivalent to using manhattan distance.
Weights: weight function used in prediction; in uniform weights all points in each neighborhood are weighted equally.
AdaBoost parameters: N_estimators: the maximum number of estimators at which boosting is terminated. Learning rate: Learning rate at which the contribution of each classifier shrinks.
SVM parameters: Kernel: specifies the kernel type to be used in the algorithm.
C: penalty parameter C of the error term.

Parameter optimization    

A grid search was performed for the classifiers that used the expression values of the differentially expressed genes of the study. The parameters of each developed classification algorithm that were subjected to grid search through CV, as well as their final values that optimize their corresponding classifiers are presented in (Table 1). A grid search was also applied for the classifiers that were developed based on the genes that occurred from the feature selection (Supplementary Table 7) and for the classifiers developed for the testing dataset (Supplementary Table 8).

Classifier performance, feature selection

Classification algorithms, based on three datasets were developed. The first dataset contained the gene expression values of the differentially expressed genes from the training data (GSE 17612), the second dataset contained gene expression values of the 21 genes that occurred after the feature selection and the third dataset included the 21 genes also obtained from the SVM-RFE, with their corresponding gene expression values obtained from the independent test data (GSE 21935).

More specifically, in the context of the classification task: SZ vs healthy controls, classification algorithms that used the differentially expressed genes as input were compared. According to the mean AUC of the ROC curve, the SZ samples can be distinguished from healthy controls on the basis of gene expression. Figures 1 and 2 show the ROC curves response of 10-fold CV for the developed classifiers that used the differentially expressed genes and the genes resulting from the feature selection as an input, respectively. More specifically, (Figures 1a-1e) compare the classification techniques of SVM, Extra Trees classifiers, RF, kNN, and AdaBoost. The AdaBoost method, yielding a CV AUC of 0.95, generally outperforms the other tested classification techniques. Mean precision, accuracy, and sensitivity of each developed classifier are also presented in (Supplementary Table 9). In this study, the SVM-RFE with stratified CV was used in order to find the ranks and the optimal number of features for classifying SZ. Among the 164 differentially expressed genes, the maximal classification performance was achieved with 21 genes (Table 2). Then, each classifier incorporated genes that occurred from SVM-RFE. The 21-gene model for each classifier was validated using 10-fold CV. The final achieved mean AUC scores of all tested classification methods using the 21-gene model as input are presented in (Figures 2a-2e). The best performance was achieved by the RF classifier, which achieved a mean AUC of 0.98. The 21-gene model was also used for developing classifiers on an independent dataset, based on the gene expression values obtained from the analysis of this specific dataset. The performance metrics of these classifiers are also presented in (Supplementary Table 9). The RF classifier resulted in the best mean AUC score of 0.91.

Gene Symbol

Gene Title

p-value

ARHGAP25

Rho GTPase activating protein 25

0.006071

GHR

growth hormone receptor

0.007136

CCL3

C-C motif chemokine ligand 3

0.006522

RPS6KA2

ribosomal protein S6 kinase A2

0.003058

CRYBG3

crystallin beta-gamma domain containing 3

0.002981

COX4I1

cytochrome c oxidase subunit 4I1

0.001857

KDM3A

lysine demethylase 3A

0.004227

LOC728613

programmed cell death 6 pseudogene

0.002835

CCNA2

cyclin A2

0.006075

S100A8

S100 calcium binding protein A8

0.000122

COX19

COX19 cytochrome c oxidase assembly factor

0.000137

MIR210HG

MIR210 host gene

0.002779

LOC100134317

hypothetical LOC100134317

0.009156

LONRF3

LON peptidase N-terminal domain and ring finger 3

0.006143

GSTM3

glutathione S-transferase mu 3 (brain)

0.005909

VCPIP1

valosin containing protein (p97)/p47 complex interacting protein 1

0.006035

GJB2

gap junction protein beta 2

0.000687

LCOR

ligand dependent nuclear receptor corepressor

0.001645

MRS2

MRS2, magnesium transporter

0.0075

NAA38

N(alpha)-acetyltransferase 38, NatC auxiliary subunit

0.004166

ANKRD37

ankyrin repeat domain 37

0.006384

Table 2: Genes that occurred after the SVM-RFE feature selection and could discriminate the postmortem samples of SZ and healthy control subjects based on the differentially expressed genes of the GSE17612 dataset.

Figure 1: ROC curve analysis for the evaluation of the classification of SZ versus healthy controls after testing different classification algorithms: (a) SVM; (b) Extra Tree Classifiers; (c) RF; (d) AdaBoost; (e) kNN. The figure shows the ROC response of different classifiers, created from 10-fold CV. With the help of the ten occurring curves from each classification algorithm, the mean AUC for each algorithm is also calculated and presented in the figures as the crooked line in every case (Mean ROC).
Figure 2: ROC curve analysis for the evaluation of the classification of SZ versus healthy controls after the SVM-RFE feature selection. The figures depict the evaluation results of the ROC analysis with five different algorithms: (a) SVM; (b) Extra Trees Classifiers; (c) RF; (d) Adaboost and (e) kNN. The figure shows the ROC response of different classification algorithms, created from 10-fold CV. With the help of the ten occurring curves from each classification algorithm, the mean AUC for each algorithm is also calculated and presented in the figures as the crooked line in every case (Mean ROC).

Discussion

The top ranked biological processes resulting from the Bioinfominer tool includes calcium mediated signaling (CCL3, ALMS1, LAT2) (Supplementary Table 2). The Ca2+ signaling pathway is a major component of the mechanisms that regulate neuronal excitability, information processing, and cognition. Differences in gene transcription related to calcium signaling can prove to be very important, as they may lead to alterations in the neuronal signaling. Abnormalities of the Ca2+ signaling pathway have been related to the development of SZ as well as of bipolar disorder [38]. In addition there are findings that suggest that calcium is capable of inducing structural and cognitive deficits observed in SZ [39]. The Reactome pathway analysis (Supplementary Table 3) resulted in FCGR activation (SYK, HCK, FCGR3A), classical antibody-mediated complement activation (C1QC, C1QB), complement cascade (CD59, C1QC, C1QB), and Fcgamma receptor dependent phagocytosis (SYK, HCK, DOCK1, FCGR3A), which are all clustered to innate immune system. The MGI Mammalian Phenotype Ontology analysis resulted in abnormal neutrophil morphology (S100A9, ITGA), abnormal neutrophil physiology (SYK, HCK, ITGAM, S100A9), and abnormal lymphatic vessel morphology (VEGFA, SYK, GJB2), which are all related to immune system phenotype (Supplementary Table 4). Finally, as shown in (Supplementary Table 5), the Human Phenotype Ontology analysis results in abnormalities related to the immune system, such as decreased serum complement C4b (C1QB and C1QC), Hashimoto thyroiditis (CIQB, CIQC) and increased antibody level in the blood (DSP, FAM13A and SAMHD1). These findings are in accordance to other schizophrenic studies. In a postmortem study of schizophrenic patients, the immune-related pathway has been reported to be involved in the pathology of SZ. In the same study, arachidonic acid cascade markers were found to have increased [40]. A gene expression study on peripheral blood mononuclear cells identified differentially expressed genes related to the immune pathways in schizophrenic patients [41]. Another SZ study of microarray data on Broadmann Area 22 reports a decrease of neuroinflammation related pathways, which may result to cognitive impairment and progression of SZ disease [42]. Finally, the enrichment analysis of MGI Mammalian Phenotype Ontology terms (Supplementary Table 4) revealed another important term, namely abnormal central nervous system synaptic transmission (LZTS1, TRIB2, TNC, CSPG5). Many published SZ studies suggest there is an altered expression of presynaptic proteins. Anatomical and functional synaptic abnormalities probably contribute to the pathology and symptomatology of the disease, but synaptic disturbances are most likely to be a part of a complex network of events leading to the expression of the disease [43].

The Reactome analysis also resulted in signaling by retinoic acid including the genes CYP26B1, ALDH1A3 CRABP1, and ALDH1A1 (Supplementary Table 3). Retinoid dysfunction may be also involved in the pathophysiology of SZ. CYP26B1 and ALDH1A3 as well as other genes involved in the synthesis and transportation of retinoic acid are implicated in SZ [40]. The functional analysis also detected an integrin-mediated signaling pathway among the GO terms represented by the genes SYK, HCK, DOCK1, and ITGAM (Supplementary Table 2). The antipsychotic agent penfluridol has been reported to act through inhibition of the integrin signaling [44].

Using supervised methods, we concluded that SZ can be classified by postmortem gene expression, even without applying any feature selection method, achieving an AUC score of 0.95 (Figure 1d) and sensitivity of 0.96 (Supplementary Table 9) with the use of the AdaBoost algorithm. Other classification algorithms also performed well, such as SVM with AUC 0.93 (Figure1a), accuracy 0.94, precision 0.97, and sensitivity 0.93 (Supplementary Table 9). SVM-RFE feature selection concluded to 21 genes and with their gene expression values a RF classifier was developed with 0.98 AUC score (Figure 2c). Additionally, the good performances of the classification models after applying the 21-gene model on the testing set supports the generalization of the 21-gene model to a test dataset, independent from the samples included in the model construction, with final AUC performance of 0.91 and 0.85 sensitivity, achieved by the RF classification model (Supplementary Table 9).

The 21 genes after the SVM-RFE feature selection (Table 2) reported in this study could be considered a candidate biomarker set for the diagnosis of SZ, to serve as a starting point for its further validation. As shown in (Supplementary Table 1), the genes rendering from the feature selection do not present high fold changes. There are other studies supporting that top-ranked genes may lose essential information specifically for classification purposes because of the fact that they are usually highly correlated [7, 45]. Therefore, we considered that it is reasonable for the feature selection algorithms to identify statistically significant genes with small fold changes as predictors for classification. Among the 21 genes, S100A8 and CCL3 genes have been previously associated to SZ [46,47]. The S100A8 gene encodes a member of the S100 protein family. S100 proteins are involved in many cellular processes, such as cell cycle progression and differentiation. This specific protein acts as cytokine [provided by RefSeq]. The S100A8 gene also presents the greatest fold change among the upregulated genes of the study (Supplementary Table 1). The S100A8 gene dimerizes with the S100A9 gene, which was also shown to be differentially expressed in this study (Supplementary Table 1). This dimerization forms calprotectin, which is involved in innate immunity and inflammation. S100A8 is also reported to be upregulated at the protein level in another schizophrenic study [47]. The CCL3 gene encodes a small inducible cytokine. Through binding to CCR1, CCR4 and CCR5 receptors, it participates in inflammatory responses [provided by RefSeq]. Chemokines are associated to neurobiological mechanisms, such as neurogenesis regulation or neurotransmitter-like effects, probably implicated in psychiatric disorders. It has been reported that many chemokines, including CXCL8 (IL-8), CCL2, CCL3 and CCL5, have been non-specifically associated to psychiatric diseases [48]. All the aforementioned genes of the 21-gene model were also included in the hub genes list (Supplementary Table 6) resulted from the gene prioritization.

It is also worth mentioning that genes RPS6KA2 and CCNA2 are involved in the Reactome pathway of (R-HSA-2559582), which is also known as senescence messaging secretome. Oxidative stress can induce DNA damage, and the persistent DNA damage may be a senescence-associated secretory phenotype initiator [49]. Generally, cellular senescence and apoptosis (programmed cell death) are ways to control DNA damage and exacerbation of those processes has been previously related SZ [50]. Another hub gene included in the differentially expressed gene list is SYK (Supplementary Table 6), which is a member of the non-receptor type tyrosine protein kinases family. The encoded protein participates in coupling activated immunoreceptors to downstream signaling events that facilitate various cellular responses, such as proliferation, differentiation, and phagocytosis [provided by RefSeq]. It is considered to modulate epithelial cell growth. SYK was also identified to be upregulated in a study that examines the involvement of the immune system in the etiology of SZ [40]. Finally, the differentially expressed hub gene VEGFA encodes a vascular endothelial growth factor A. This growth factor is involved in neurotrophy and neurogenesis, both possibly implicated in the pathophysiology of SZ. However, a study on the Hann Chinese population found no significant associations between different haplotypes of VEGFA and the risk of SZ [51].

Conclusion

This study revealed 164 genes that were statistically significant. Furthermore, among the differentially expressed genes, CCL3, S100A8, SYK and VEGFA have been previously implicated in SZ and other psychiatric diseases. The main identified statistically significant ontological terms of interest that have been previously related to SZ are immune-related mechanisms. Other interesting mechanisms have also been found to be overrepresented, such as central nervous system synaptic transmission, integrin mediated signaling and retinoic acid signaling. In this study, the importance of the integrated feature selection and classification algorithm for the prediction of classes and for the identification of significant genes have been revealed once more [52]. In summary, RF after the feature selection method of SVM-RFE outperformed the other tested classification methods with an AUC score of 0.98. The feature selection resulted to 21 genes that could discriminate schizophrenic and healthy control samples in two different independent datasets of postmortem brain samples obtained from two different brain regions.

References

  1. Kahn RS, Sommer IE, Murray RM, Meyer Lindenberg A, Weinberger DR, et al. (2015) Schizophrenia. Nature Reviews Disease Primers 1: 15067.
  2. Middleton FA, Mirnics K, Pierri JN, Lewis DA, Levitt P (2002) Gene expression profiling reveals alterations of specific metabolic pathways in schizophrenia. J Neurosci 22(7): 2718-2729.
  3. Iwamoto K, Kato T (2006) Gene expression profiling in schizophrenia and related mental disorders. Neuroscientist 12 (4): 349-361.
  4. Dean B, Keriakous D, Scarr E, Thomas EA (2007) Gene expression profiling in Brodmann's area 46 from subjects with schizophrenia. Aust N Z J Psychiatry 41(4): 308-320.
  5. Cattane N, Minelli A, Milanesi E, Maj C, Bignotti S, et al. (2015) Altered gene expression in schizophrenia: findings from transcriptional signatures in fibroblasts and blood. PLoS One 10(2): e0116686.
  6. Kotsiantis SB (2007) Supervised Machine Learning: A Review of Classification Techniques. Real Word AI Systems with Applications in eHealth, HCI, Information Retrieval and Pervasive Technologies. Emerging Artificial Intelligence Applications in Computer Engineering, Greece.
  7. Takahashi M, Hayashi H, Watanabe Y, Sawamura K, Fukui N, et al. (2010) Diagnostic classification of schizophrenia by neural network analysis of blood-based gene expression signatures. Schizophr Res 119(1-3): 210-218.
  8. Tarca AL, Romero R, Draghici S (2006) Analysis of microarray experiments of gene expression profiling. Am J Obstet Gynecol 195(2): 373-388.
  9. Struyf J, Dobrin S, Page D (2008) Combining gene expression, demographic and clinical data in modeling disease: a case study of bipolar disorder and schizophrenia. BMC Genomics 9: 531.
  10. Brown MP, Grundy WN, Lin D, Cristianini N, Sugnet CW, et al. (2000) Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc Natl Acad Sci U S A 97(1): 262-267.
  11. Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Machine Learning 63(1): 3-42.
  12. Díaz-Uriarte R, Alvarez de Andrés S (2006) Gene selection and classification of microarray data using random forest. BMC Bioinformatics 7(3): 1-13.
  13. Sarkar M, Leong TY (2000) Application of K-nearest neighbours algorithm on breast cancer diagnosis problem. Proc AMIA Symp 759-63.
  14. Weinberger KQ, Saul LK (2009) Distance Metric Learning for Large Margin nearest Neighbour Classification. Journal of Machine Learning Research 10: 207-244.
  15. Asyali MH, Colak D, Demirkaya O, Inan MS (2006) Gene Expression Profile Classification: A Review. Current Bioinformatics 1(1): 55-73.
  16. Mozos OM, Stachniss C, Burgard W (2005) Supervised Learning of Places from Range Data using Adaboost. IEEE International Conference on Robotics and Automation 1730-1735.
  17. Hsu CW, Chang CC, Lin CJ (2003) A Practical Guide to Support Vector Classification. Department of Computer Science, National Taiwan University pp. 1-16.
  18. Bradley AP (1997) The use of the area under the Roc curve in the evaluation of machine learning algorithms. Pattern Recognition 30(7): 1145-1159.
  19. Hajian Tilaki K (2013) Receiver Operating Characteristic (ROC) Curve Analysis for Medical Diagnostic Test Evaluation. Caspian J Intern Med 4(2): 627-635.
  20. Sokolova M, Lapalme G (2009) A systematic analysis of performance measures for classification tasks. Information Processing & Management 45(4): 427-437.
  21. Saeys Y, Inza I, Larranaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23(19): 2507-2517.
  22. Lin X, Yang F, Zhou L, Yin P, Kong H, et al. (2012) A support vector machine-recursive feature elimination feature selection method based on artificial contrast variables and mutual information. J Chromatogr B Analyt Technol Biomed Life Sci 910: 149-155.
  23. Clough E, Barrett T (2016) The Gene Expression Omnibus Database. Methods Mol Biol 1418: 93-110.
  24. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, et al. (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 5(10): R80.
  25. Team RDC (2010) R A language and environment for statistical computing. R Foundation for Statistical Computing: Vienna, Austria
  26. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, et al. (2003) Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4(2): 249-264.
  27. Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, et al. (2015) limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res 43(7): e47.
  28. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, et al. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25(1): 25-29.
  29. D'Eustachio P (2011) Reactome knowledgebase of human biological pathways and processes. Methods Mol Biol 694: 49-61.
  30. Groza T, Kohler S, Moldenhauer D, Vasilevsky N, Baynam G, et al. (2015) The Human Phenotype Ontology: Semantic Unification of Common and Rare Disease. Am J Hum Genet 97(1): 111-124.
  31. Smith CL, Eppig JT (2012) The Mammalian Phenotype Ontology as a unifying standard for experimental and high-throughput phenotyping data. Mamm Genome 23(9-10): 653-668.
  32. Koutsandreas T, Pilalis E, Vlachavas EI, Koczan D, Klippel S, et al. (2015) Making sense of the biological complexity through the platform-driven unification of the analytical and visualization tasks. IEEE 15th International Conference on Bioinformatics and Bioengineering (BIBE) 1-6.
  33. Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. Journal of Machine Learning Research 13: 281-305.
  34. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, et al. (2011) Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12: 2825-2830.
  35. Maldonado S, Weber R (2009) A wrapper method for feature selection using Support Vector Machines. Information Sciences 179(13): 2208-2217.
  36. Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene Selection for Cancer Classification using Support Vector Machines. Mach Learn 46: 389-422.
  37. Clelland CL, Read LL, Panek LJ, Nadrich RH, Bancroft C, et al. (2013) Utilization of never-medicated bipolar disorder patients towards development and validation of a peripheral biomarker profile. PLoS One 8(6): e69082.
  38. Berridge MJ (2014) Calcium signalling and psychiatric disease: bipolar disorder and schizophrenia. Cell Tissue Res 357 (2): 477-492.
  39. Lidow MS (2003) Calcium signaling dysfunction in schizophrenia: a unifying approach. Brain Res Brain Res Rev 43(1): 70-84.
  40. Huang KC, Yang KC, Lin H, Tsao Tsun Hui T, Lee WK, et al. (2013) Analysis of schizophrenia and hepatocellular carcinoma genetic network with corresponding modularity and pathways: novel insights to the immune system. BMC Genomics 14 Suppl 5: S10.
  41. Gardiner EJ, Cairns MJ, Liu B, Beveridge NJ, Carr V, et al. (2013) Gene expression analysis reveals schizophrenia-associated dysregulation of immune pathways in peripheral blood mononuclear cells. J Psychiatr Res 47(4): 425-437.
  42. Rao JS, Kim HW, Harry GJ, Rapoport SI, Reese EA (2013) Increased neuroinflammatory and arachidonic acid cascade markers, and reduced synaptic proteins, in the postmortem frontal cortex from schizophrenia patients. Schizophr Res 147(1): 24-31.
  43. Faludi G, Mirnics K (2011) Synaptic changes in the brain of subjects with schizophrenia. Int J Dev Neurosci 29(3): 305-309.
  44. Ranjan A, Gupta P, Srivastava SK (2016) Penfluridol: An Antipsychotic Agent Suppresses Metastatic Tumor Growth in Triple-Negative Breast Cancer by Inhibiting Integrin Signaling Axis. Cancer Res 76(4): 877-890.
  45. Liu B, Cui Q, Jiang T, Ma S (2004) A combinational feature selection and ensemble neural network method for classification of gene expression data. BMC Bioinformatics 5: 136.
  46. Xu J, Sun J, Chen J, Wang L, Li A, et al. (2012) RNA-Seq analysis implicates dysregulation of the immune system in schizophrenia. BMC Genomics 13 Suppl 8: S2.
  47. Perez-Santiago J, Diez Alarcia R, Callado LF, Zhang JX, Chana G, et al. (2012) A combined analysis of microarray gene expression studies of the human prefrontal cortex identifies genes implicated in schizophrenia. J Psychiatr Res 46 (11): 1464-1474.
  48. Stuart MJ, Baune BT (2014) Chemokines and chemokine receptors in mood disorders, schizophrenia, and cognitive impairment: a systematic review of biomarker studies. Neurosci Biobehav Rev 42: 93-115.
  49. Rodier F, Coppe JP, Patil CK, Hoeijmakers WA, Munoz DP, et al. (2009) Persistent DNA damage signalling triggers senescence-associated inflammatory cytokine secretion. Nat Cell Biol 11(8): 973-979.
  50. Papanastasiou E, Gaughran F, Smith S (2011) Schizophrenia as segmental progeria. Journal of the Royal Society of Medicine 104(11): 475-484.
  51. Gao K, Wang Q, Zhang Y, Wang D, Fu Y, et al. (2015) Association study of VEGFA polymorphisms with schizophrenia in Han Chinese population. Neurosci Lett 590: 121-125.
  52. Pirooznia M, Yang JY, Yang MQ, Deng Y (2008) A comparative study of different machine learning methods on microarray gene expression data. BMC Genomics 9 Suppl 1: S13.

Supplementary Tables

Gene Symbol Gene Title Fold Change (log2) p-value
S100A8 S100 calcium binding protein A8 1.826497 0.000122
C1QB complement component 1, q subcomponent, B chain 0.731081 0.005022
S100A9 S100 calcium binding protein A9 0.662098 0.004514
C1QC complement component 1, q subcomponent, C chain 0.643863 0.008943
FCGR3A Fc fragment of IgG receptor IIIa 0.557102 0.005952
BAG3 BCL2 associated athanogene 3 0.537188 0.006337
SLC16A3 solute carrier family 16 member 3 0.482482 0.007186
FCGR3B Fc fragment of IgG receptor IIIb 0.446586 0.005075
ALOX5AP arachidonate 5-lipoxygenase activating protein 0.389708 0.005349
RNASET2 ribonuclease T2 0.370092 0.003478
ANKRD37 ankyrin repeat domain 37 0.352631 0.006384
HK2 hexokinase 2 0.348767 0.001683
VEGFA vascular endothelial growth factor A 0.33662 0.005811
HCK HCK proto-oncogene, Src family tyrosine kinase 0.335132 0.00693
DOCK8 dedicator of cytokinesis 8 0.333469 0.00656
APBB1IP amyloid beta precursor protein binding family B member 1 interacting protein 0.320001 0.008019
DIS3L2 DIS3 like 3'-5' exoribonuclease 2 0.281186 0.009462
CD59 CD59 molecule 0.279477 0.009546
SAMHD1 SAM domain and HD domain 1 0.273887 0.004751
ACVRL1 activin A receptor like type 1 0.269788 0.003338
LAT2 linker for activation of T-cells family member 2 0.242039 0.002806
ITGAM integrin subunit alpha M 0.240918 0.005904
SYK spleen tyrosine kinase 0.240093 0.00638
P4HA1 prolyl 4-hydroxylase subunit alpha 1 0.237544 0.00403
MIR100HG mir-100-let-7a-2 cluster host gene 0.230178 0.006184
MIR210HG MIR210 host gene 0.215512 0.002779
KDM3A lysine demethylase 3A 0.213788 0.004227
GSTM3 glutathione S-transferase mu 3 (brain) 0.212229 0.005909
LONRF3 LON peptidase N-terminal domain and ring finger 3 0.208657 0.006143
CRYBG3 crystallin beta-gamma domain containing 3 0.20283 0.002981
LCOR ligand dependent nuclear receptor corepressor 0.19837 0.001645
AKAP12 A-kinase anchoring protein 12 0.196526 0.001368
ARID5B AT-rich interaction domain 5B 0.194599 0.00905
CCDC69 coiled-coil domain containing 69 0.192349 0.001352
SMAD7 SMAD family member 7 0.189792 0.003705
IGF1R insulin like growth factor 1 receptor 0.186403 0.009458
APOC2 apolipoprotein C-II 0.183114 0.002289
NAA38 N(alpha)-acetyltransferase 38, NatC auxiliary subunit 0.182993 0.004166
PAPD5 PAP associated domain containing 5 0.182046 0.008159
JAKMIP2 janus kinase and microtubule interacting protein 2 0.178467 0.005588
MRS2 MRS2, magnesium transporter 0.177144 0.0075
GHR growth hormone receptor 0.17617 0.007136
C2CD2 C2 calcium-dependent domain containing 2 0.170099 0.00975
   
COX19 COX19 cytochrome c oxidase 0.16423 0.000137
  assembly factor    
SNRNP48 small nuclear ribonucleoprotein U11/U12 subunit 48 0.163574 0.009062
MOB1A MOB kinase activator 1A 0.163389 0.007832
ARHGAP25 Rho GTPase activating protein 25 0.161316 0.006071
COX4I1 cytochrome c oxidase subunit 4I1 0.155805 0.001857
ABCD4 ATP binding cassette subfamily D member 4 0.15033 0.006105
   
ST3GAL1 ST3 beta-galactoside alpha-2,3-sialyltransferase 1 0.147061 0.009269
RPS6KA2 ribosomal protein S6 kinase A2 0.146662 0.003058
SOAT1 sterol O-acyltransferase 1 0.140692 0.006413
ZC3H18 zinc finger CCCH-type containing 18 0.138364 0.002404
MAX MYC associated factor X 0.13176 0.008971
RGS19 regulator of G-protein signaling 19 0.131705 0.005579
DKK1 dickkopf WNT signaling pathway inhibitor 1 0.131382 0.006626
   
GTF2A2 general transcription factor IIA 2 0.125946 0.008314
HSBP1L1 heat shock factor binding protein 1-like 1 0.12215 0.005242
C1GALT1C1 C1GALT1 specific chaperone 1 0.121886 0.007076
CASP6 caspase 6 0.121064 0.00282
SLC4A2 solute carrier family 4 member 2 0.118597 0.006411
VCPIP1 valosin containing protein (p97)/p47 complex interacting protein 1 0.114991 0.006035
CCNA2 cyclin A2 0.106498 0.006075
FBXO42 F-box protein 42 0.102273 0.007903
LOC284009 hypothetical LOC284009 0.099299 0.0068
ZNF202 zinc finger protein 202 0.099196 0.005258
KHDC1 KH homology domain containing 1 0.093737 0.008083
RPS6KB1 ribosomal protein S6 kinase B1 0.091089 0.007656
TMEM184C transmembrane protein 184C 0.090477 0.003114
FHAD1 forkhead-associated (FHA) phosphopeptide binding domain 1 -0.08689 0.005764
PRKRIP1 PRKR interacting protein 1 (IL11 inducible) -0.08776 0.002776
DOCK1 dedicator of cytokinesis 1 -0.09194 0.00703
FAM13A family with sequence similarity 13 member A -0.09411 0.006757
CHMP6 charged multivesicular body protein 6 -0.0953 0.009337
ZNF777 zinc finger protein 777 -0.09585 0.007735
FAM204A family with sequence similarity 204 member A -0.09614 0.009096
LOC440867 uncharacterized LOC440867 -0.10242 0.003157
ANAPC13 anaphase promoting complex subunit 13 -0.10469 0.009113
CORIN corin, serine peptidase -0.10562 0.003811
TMEM239 transmembrane protein 239 -0.10666 0.008018
VIPR2 vasoactive intestinal peptide receptor 2 -0.10844 0.004582
TEX264 testis expressed 264 -0.10958 0.001421
ZNF18 zinc finger protein 18 -0.1108 0.003814
HSD17B8 hydroxysteroid (17-beta) dehydrogenase 8 -0.11094 0.008904
JMJD4 jumonji domain containing 4 -0.113 0.005253
BROX BRO1 domain and CAAX motif containing -0.11408 0.003452
CCS copper chaperone for superoxide dismutase -0.11545 0.000723
ATP1A4 ATPase Na+/K+ transporting subunit alpha 4 -0.11584 0.006439
EXOSC9 exosome component 9 -0.11602 0.001765
LOC100506459 uncharacterized LOC100506459 -0.11984 0.003974
TSEN15 tRNA splicing endonuclease subunit 15 -0.12003 0.007396
DMKN dermokine -0.12071 0.004637
C3orf35 chromosome 3 open reading frame 35 -0.12101 0.001778
WDR86 WD repeat domain 86 -0.12271 0.008758
SMURF1 SMAD specific E3 ubiquitin protein ligase 1 -0.12448 0.00463
FAM173A family with sequence similarity 173 member A -0.12475 0.008634
LOC730098 hypothetical LOC730098 -0.12539 0.002728
IFT27 intraflagellar transport 27 -0.12605 0.004584
SAPCD1 suppressor APC domain containing 1 -0.12885 0.004477
LINC00900 long intergenic non-protein coding RNA 900 -0.13079 0.007456
LRRN4CL LRRN4 C-terminal like -0.13079 0.002182
JAG1 jagged 1 -0.13119 0.003896
C17orf97 chromosome 17 open reading frame 97 -0.13121 0.006676
PNLDC1 PARN like, ribonuclease domain containing 1 -0.13189 0.001077
CHCHD5 coiled-coil-helix-coiled-coil-helix domain containing 5 -0.1319 0.009292
PEX10 peroxisomal biogenesis factor 10 -0.13306 0.00123
PRR5L proline rich 5 like -0.13592 0.004642
PARGP1 poly(ADP-ribose) glycohydrolase pseudogene 1 -0.13691 0.008385
ZFP2 ZFP2 zinc finger protein -0.13968 0.00489
ADGRA1 adhesion G protein-coupled receptor A1 -0.14208 0.006006
IGLV1-44 immunoglobulin lambda variable 1-44 -0.14244 0.009018
MSANTD1 Myb/SANT DNA binding domain containing 1 -0.14359 0.009775
WSB1 WD repeat and SOCS box containing 1 -0.14615 0.000747
LZTS1 leucine zipper, putative tumor suppressor 1 -0.14794 0.007089
ALMS1 ALMS1, centrosome and basal body associated protein -0.14839 0.00191
SOHLH2 spermatogenesis and oogenesis specific basic helix-loop-helix 2 -0.14951 0.002074
ACOX3 acyl-CoA oxidase 3, pristanoyl -0.14973 0.006074
ICA1 islet cell autoantigen 1 -0.15208 0.000761
AMFR autocrine motility factor receptor -0.15799 0.005319
ALDH1A3 aldehyde dehydrogenase 1 -0.15804 0.003569
  family member A3    
PCDHA1 protocadherin alpha 1 -0.16169 0.008645
COL12A1 collagen type XII alpha 1 -0.1625 0.006727
FMOD fibromodulin -0.16822 0.00679
LOC440896 hypothetical LOC440896 -0.16924 0.006958
PXDN peroxidasin -0.17006 0.002642
ADAMTS8 ADAM metallopeptidase with thrombospondin type 1 motif 8 -0.1713 0.003075
LYRM4 LYR motif containing 4 -0.18083 0.000795
CSPG5 chondroitin sulfate proteoglycan 5 -0.1824 0.008227
MTG2 mitochondrial ribosome-associated GTPase 2 -0.18278 0.008865
SPAG6 sperm associated antigen 6 -0.18278 0.003408
IGFBP6 insulin like growth factor binding protein 6 -0.1859 0.004757
PDCD6 programmed cell death 6 -0.1912 0.000818
SLC22A8 solute carrier family 22 member 8 -0.19273 0.007573
LOC100134317 hypothetical LOC100134317 -0.19749 0.009156
SETD9 SET domain containing 9 -0.20602 0.008609
FLRT1 fibronectin leucine rich transmembrane protein 1 -0.20607 0.001075
GPCPD1 Glycerophosphocholine -0.2067 0.006461
  phosphodiesterase 1    
SLC6A20 solute carrier family 6 member 20 -0.20902 0.002204
PEX7 peroxisomal biogenesis factor 7 -0.21003 0.001087
TRIB2 tribbles pseudokinase 2 -0.21164 0.001142
RASGRP1 RAS guanyl releasing protein 1 -0.21276 0.005429
TNC tenascin C -0.21351 0.002824
AHSA2 AHA1, activator of heat shock 90kDa protein ATPase homolog 2 (yeast) -0.22234 0.007083
ADTRP androgen dependent TFPI regulating protein -0.23107 0.008219
AGA aspartylglucosaminidase -0.24003 0.006235
MPPED2 metallophosphoesterase domain containing 2 -0.24746 0.001065
CA4 carbonic anhydrase 4 -0.24836 0.003805
ARMCX4 armadillo repeat containing, X-linked 4 -0.25182 0.000947
SPON2 spondin 2 -0.25468 0.003923
C1orf95 chromosome 1 open reading frame 95 -0.27384 0.009582
ECM2 extracellular matrix protein 2 -0.28447 0.006839
TYRP1 tyrosinase-related protein 1 -0.29618 0.003082
PHLDB2 pleckstrin homology like domain family B member 2 -0.32016 0.009663
ALDH1A1 aldehyde dehydrogenase 1 family member A1 -0.32543 0.007148
CYP26B1 cytochrome P450 family 26 subfamily B member 1 -0.32578 0.000915
CRABP1 cellular retinoic acid binding protein 1 -0.37849 0.003868
SLC13A4 solute carrier family 13 member 4 -0.38768 0.001843
CCL3 C-C motif chemokine ligand 3 -0.4024 0.006522
FRZB frizzled-related protein -0.51562 0.001186
FRMPD2 FERM and PDZ domain containing 2 -0.52363 0.001204
GJB2 gap junction protein beta 2 -0.53054 0.000687
LOC728613 programmed cell death 6 pseudogene -0.66246 0.002835
DSP desmoplakin -0.70051 0.009817
OGN osteoglycin -0.80467 0.002176

Supplementary Table 1:The list of 164 differentially expressed genes identified after comparing the gene expression of healthy control and SZ samples and applying a p-value cut-off ≤ 0.01.

Rank

Term id

Term Definition

Enrichment

Hypergeometric p-value

Corrected p-value

1

GO:0046324

regulation of glucose import

2/7

2.27E-03

2.90E-03

2

GO:0001816

cytokine production

3/26

2.57E-03

7.00E-03

3

GO:0009060

aerobic respiration

3/28

3.18E-03

8.80E-03

4

GO:0042339

keratan sulfate metabolic process

3/32

4.67E-03

0.0117

5

GO:0016558

protein import into peroxisome matrix

2/10

4.76E-03

0.0174

6

GO:0030593

neutrophil chemotaxis

4/65

5.08E-03

0.0196

7

GO:0030199

collagen fibril organization

3/38

7.58E-03

0.0217

8

GO:0038083

peptidyl-tyrosine autophosphorylation

3/39

8.15E-03

0.0244

9

GO:0030514

negative regulation of BMP signaling pathway

3/42

1.00E-02

0.0271

10

GO:0038096

Fc-gamma receptor signaling pathway involved in phagocytosis

4/82

0.0119

0.0291

11

GO:0007229

integrin-mediated signaling pathway

4/84

0.0124

0.0341

12

GO:0071407

cellular response to organic cyclic compound

4/86

0.0134

0.0373

13

GO:0019722

calcium-mediated signaling

3/51

0.0169

0.0423

14

GO:0032760

positive regulation of tumor necrosis factor production

3/48

0.0144

0.0438

15

GO:0019370

leukotriene biosynthetic process

2/21

0.0206

0.0477

16

GO:0051090

regulation of sequence-specific DNA binding transcription factor activity

2/22

0.0225

0.0498

Supplementary Table 2: Overrepresented GO terms occurring from the enrichment analysis of the differentially expressed genes (category Biological Process). The ranking of statisticallysignificant terms is according to the corrected p-value.

Rank

Term id

Term Definition

Enrichment

Hypergeometric p-value

Corrected p-value

1

R-HSA-5365859

RA biosynthesis pathway

4/22

3.22E-05

3.10E-03

2

R-HSA-5362517

Signaling by Retinoic Acid

4/42

4.31E-04

9.00E-03

3

R-HSA-2029481

FCGR activation

3/19

5.19E-04

0.0123

4

R-HSA-354192

Integrin alphaIIb beta3 signaling

3/25

1.19E-03

0.0171

5

R-HSA-2022854

Keratan sulfate biosynthesis

3/27

1.49E-03

0.0186

6

R-HSA-1638074

Keratan sulfate/keratin metabolism

3/33

2.68E-03

0.0259

7

R-HSA-76009

Platelet Aggregation (Plug Formation)

3/34

2.92E-03

0.0283

8

R-HSA-3560782

Diseases associated with glycosaminoglycan metabolism

3/38

4.01E-03

0.0336

9

R-HSA-2022857

Keratan sulfate degradation

2/12

4.41E-03

0.0362

10

R-HSA-173623

Classical antibody-mediated complement activation

2/15

6.90E-03

0.0431

11

R-HSA-166658

Complement cascade

3/47

7.30E-03

0.0469

12

R-HSA-2029480

Fcgamma receptor (FCGR) dependent phagocytosis

4/91

7.45E-03

0.048

Supplementary Table 3: Overrepresented Reactome pathways occurring from the enrichment analysis of the differentially expressed genes. The ranking of statistically significant terms is according to the corrected p-value.

Rank

Term id

Term Definition

Enrichment

Hypergeometric p-value

Corrected p-value

1

MP:0010458

pulmonary trunk hypoplasia

2/3

4.07E-04

2.90E-03

2

MP:0000380

small hair follicles

2/5

1.34E-03

6.30E-03

3

MP:0011090

perinatal lethality, incomplete penetrance

9/226

1.52E-03

7.60E-03

4

MP:0010505

abnormal T wave

2/6

1.99E-03

0.0115

5

MP:0001879

abnormal lymphatic vessel morphology

3/24

2.69E-03

0.0152

6

MP:0001614

abnormal blood vessel morphology

6/142

6.77E-03

0.0176

7

MP:0001177

atelectasis

4/67

7.98E-03

0.019

8

MP:0000008

increased white adipose tissue amount

3/38

9.94E-03

0.0216

9

MP:0001261

obese

4/73

0.0107

0.0268

10

MP:0002828

abnormal renal glomerular capsule morphology

2/14

0.0113

0.0284

11

MP:0010239

decreased skeletal muscle weight

2/15

0.013

0.034

12

MP:0004938

dilated vasculature

2/17

0.0166

0.0365

13

MP:0002106

abnormal muscle physiology

4/84

0.0172

0.0365

14

MP:0002206

abnormal CNS synaptic transmission

4/87

0.0193

0.0373

15

MP:0005065

abnormal neutrophil morphology

2/18

0.0185

0.0397

16

MP:0009050

dilated proximal convoluted tubules

2/19

0.0205

0.0462

17

MP:0002463

abnormal neutrophil physiology

4/94

0.0249

0.0473

Supplementary Table 4: Overrepresented MGI Mammalian Phenotype Ontology terms occurring from the enrichment analysis of the differentially expressed genes. The ranking of statistically significant terms is according to the corrected p-value.

Rank

Term id

Term Definition

Enrichment

Hypergeometric p-value

Corrected p-value

1

HP:0200120

Chronic active hepatitis

3/11

2.00E-04

2.50E-03

2

HP:0001394

Cirrhosis

6/104

1.02E-03

3.90E-03

3

HP:0002138

Subarachnoid hemorrhage

2/5

1.16E-03

6.80E-03

4

HP:0045044

Decreased serum complement C4b

2/9

4.06E-03

8.40E-03

5

HP:0000872

Hashimoto thyroiditis

2/11

6.12E-03

0.0104

6

HP:0010702

Increased antibody level in blood

3/35

6.53E-03

0.0157

7

HP:0000992

Cutaneous photosensitivity

4/73

8.45E-03

0.0158

8

HP:0002910

Elevated hepatic transaminases

6/155

7.37E-03

0.0162

9

HP:0002092

Pulmonary hypertension

5/114

8.46E-03

0.0217

10

HP:0000979

Purpura

3/47

0.0147

0.0272

11

HP:0100729

Large face

2/20

0.0198

0.0277

12

HP:0002206

Pulmonary fibrosis

3/48

0.0155

0.0288

13

HP:0001635

Congestive heart failure

6/197

0.0217

0.0303

14

HP:0001808

Fragile nails

2/22

0.0238

0.0305

15

HP:0000311

Round face

4/98

0.0227

0.032

16

HP:0010982

Polygenic inheritance

2/23

0.0258

0.0353

17

HP:0011344

Severe global developmental delay

4/103

0.0266

0.0403

18

HP:0002633

Vasculitis

3/59

0.0268

0.0434

19

HP:0001369

Arthritis

4/104

0.0275

0.0447

20

HP:0006519

Alveolar cell carcinoma

2/25

0.0302

0.0467

21

HP:0002922

Increased CSF protein

2/26

0.0325

0.0476

22

HP:0009891

Underdeveloped supraorbital ridges

2/62

0.0304

0.049

Supplementary Table 5: Overrepresented Human Phenotype Ontology terms occurring from the enrichment analysis of the differentially expressed genes. The ranking of statistically significant terms is according to the corrected p-value.

Rank

Gene Symbol

Clusters

Enriched Clusters

Interactors

Associated Drugs

1

CCL3

4

4

0

1

2

SYK

4

4

2

10

3

RASGRP1

2

2

0

0

4

VEGFA

2

2

0

24

5

SMAD7

2

2

1

0

6

RPS6KB1

2

2

2

2

7

S100A9

2

2

2

0

8

S100A8

2

2

3

0

9

FMOD

2

1

0

0

10

HCK

2

2

0

5

Supplementary Table 6: Hub genes ranking according to ontological clusters amount. Hub genes according to ontological clusters amount. Bioinfominer reveals 10 genes as hub nodes of the enriched GO graph..

RF

Extra Trees

kNN

Adaboost

SVM

Parameter

Optimal Value

Parameter

Optimal Value

Parameter

Optimal Value

Parameter

Optimal Value

Parameter

Optimal Value

Criterion

gini

Criterion:

gini

N_neighbors

5

N_estimators

100

kernel

linear

Max_features

7.07

Max_features

5.5

p

1

Learning rate

1

C

200

N_estimators

50

N_estimators

30

weights

uniform

Supplementary Table 7: Exhaustive grid search results for the developed classification algorithms (RF, Extra Trees, kNN, Adaboost, SVM) based on genes that occurred from SVM-RFE feature selection. Possible combinations of parameter values are evaluated and the optimal value for each tested classification algorithm is presented.

RF parameters:Criterion: the function to measure the quality of a split; gini corresponds to the Gini impurity.
Max_features: the number of features to consider when looking for the best split;
N_estimators: the number of trees in the forest.
Extra Trees parameters: Criterion. Max_features. N_estimators as described in RF.
kNN parameters: N_neighbors: number of neighbors to use by default for k_neighbors queries.
P: power parameter for the Minkowski metric; p=1 is equivalent to using manhattan_distance.
Weights: weight function used in prediction; in uniform weights all points in each neighborhood are weighted equally.
AdaBoost parameters: N_estimators: the maximum number of estimators at which boosting is terminated. Learning rate: Learning rate at which the contribution of each classifier shrinks.
SVM parameters: Kernel: specifies the kernel type to be used in the algorithm.
C: penalty parameter C of the error term.

RF

Extra Trees

kNN

Adaboost

SVM

Parameter

Optimal Value

Parameter

Optimal Value

Parameter

Optimal Value

Parameter

Optimal Value

Parameter

Optimal Value

Criterion

gini

Criterion:

gini

N_neighbors

5

N_estimators

200

kernel

linear

Max_features

3.87

Max_features

10

p

1

Learning rate

1

C

1000

N_estimators

15

N_estimators

10

weights

uniform

Supplementary Table 8: Exhaustive grid search results for the construction of the classification models of gene expression from the independent test dataset GSE 21935, based on the 21 gene subset of the differentially expressed genes, after SVM-RFE feature selection on the GSE 17162 dataset. All the possible combinations of parameter values are evaluated and the best combination is presented for each tested classification algorithm.

RF parameters: Criterion: the function to measure the quality of a split; gini corresponds to the Gini impurity.
Max_features: the number of features to consider when looking for the best split.
N_estimators: the number of trees in the forest.
Extra Trees parameters: Criterion, Max_features, N_estimators.
kNN parameters: N_neighbors: number of neighbors to use by default for k_neighbors queries.
P: power parameter for the Minkowski metric; p=1 is equivalent to using manhattan distance.
Weights: weight function used in prediction; in uniform weights all points in each neighborhood are weighted equally.
Adaboost parameters: N_estimators: the maximum number of estimators at which boosting is terminated. Learning rate: Learning rate at which the contribution of each classifier shrinks.
SVM parameters: Kernel: specifies the kernel type to be used in the algorithm.
C: penalty parameter C of the error term.

GSE 17612 (no feature selection)

ROC_AUC

Accuracy

Precision

Sensitivity

kNN

0.88

0.85

0.84

0.93

SVM

0.93

0.94

0.97

0.93

RF

0.9

0.77

0.84

0.81

Extra Trees

0.85

0.78

0.89

0.75

AdaBoost

0.95

0.92

0.91

0.96

GSE 17612 (with feature selection)

ROC_AUC

Accuracy

Precision

Sensitivity

kNN

0.93

0.9

0.9

0.93

SVM

0.95

0.94

0.97

0.93

RF

0.98

0.83

0.93

0.89

Extra Trees

0.94

0.88

0.87

0.84

AdaBoost

0.93

0.78

0.8

0.77

GSE 21935

ROC_AUC

Accuracy

Precision

Sensitivity

kNN

0.72

0.63

0.79

0.58

SVM

0.82

0.76

0.87

0.74

RF

0.91

0.76

0.83

0.85

Extra Trees

0.76

0.6

0.76

0.65

Adaboost

0.9

0.81

0.75

0.86

Supplementary Table 9: Mean performance estimation values of different classification algorithms after applying 10-fold CV. Italics indicate the highest value corresponding to each performance metric.

© 2014-2016 MedCrave Group, All rights reserved. No part of this content may be reproduced or transmitted in any form or by any means as per the standard guidelines of fair use.
Creative Commons License Open Access by MedCrave Group is licensed under a Creative Commons Attribution 4.0 International License.
Based on a work at http://medcraveonline.com
Best viewed in Mozilla Firefox | Google Chrome | Above IE 7.0 version | Opera |Privacy Policy