MOJ ISSN: 2374-6920MOJPB

Proteomics & Bioinformatics
Volume 4 Issue 2 - 2016
Big Data Analytics and Cancer
Amy Makler and Ramaswamy Narayanan*
Department of Biological Sciences, Charles E Schmidt College of Science, Florida Atlantic University, USA
Received: October 03, 2016 | Published: October 13, 2016
*Corresponding author: Ramaswamy Narayanan, Department of Biological Sciences, Charles E. Schmidt College of Science, Florida Atlantic University, 777 Glades Road, Boca Raton, FL 33431, USA, Tel: +1-561-297-2247; Fax: +1-561-297-3859; Email:
Citation: Makler A, Narayanan R (2016) Big Data Analytics and Cancer. MOJ Proteomics Bioinform 4(2): 00115. DOI: 10.15406/mojpb.2016.04.00115

Keywords: Big data; Biobank; Cloud computing; Cancer; Electronic Medical Records; Genomics; Proteogenomics


The term big data has become a routine word across many disciplines [1-7]. The big data in medical terms generally encompasses Next Generation Sequencing (NGS) of the genome from individual patients, mRNA expression landscape of normal and diseased tissues, biobank tissue-derived information, clinical trials, drug efficacy and toxicology data and electronic medical records linked to medical imaging and insurance claims data [8-14]. During his State of the Union address (January 12, 2016), President Barack Obama announced the establishment of a Cancer Moonshot initiative to accelerate cancer research. This initiative, led by Vice President Joe Biden, aims to make therapies available to a large number of cancer patients and is projected to improve cancer prevention and detection it at an early stage. Recently (May 2016), the White House released The Federal Big Data Research and Development Strategic Plan, which provide guidance for developing or expanding Federal Big Data research and development (R&D) plans.

The Accelerating Medicines Partnership (AMP), a new venture involving the US National Institutes of Health (NIH), 10 biopharmaceutical companies, and several nonprofit organizations, has an initial fund of $230 Million. The overall goals are to transform the current approaches for diagnostics and treatments to a new dimension using big data analytics by jointly identifying and validating promising biological targets of disease. The initial therapeutic areas include Alzheimer’s disease, Type 2 diabetes and two autoimmune disorders, rheumatoid arthritis and systemic lupus erythematosus (lupus). The European drug research consortium projects that they will invest more than $5 billion in the next several years to apply big data techniques termed “Big Data for Better Outcomes,” to speed up clinical drug trials while developing a sustainable healthcare delivery system. In the UK, the National Institute for Health Research (NIHR) has put in place a series of initiatives to help exploit the nation’s strengths in technology, medical research and healthcare data. The Genomics England Project is expected to generate a vast amount of genetic information from 100,000 patients with an initial focus on cancer, rare diseases and infectious diseases.

Among numerous therapeutic areas, cancer research area has accumulated huge amounts of big data [15-18]. This includes datasets from thousands of patients encompassing gene expression, mutations, deletions and amplifications and proteogenomics data [19-22]. Increasingly, the basic research in cancer is integrated into translational medicine in an attempt to move the discoveries closer to the clinic [13,23-25].

Key cancer-related big datasets include

The Cancer Genome Atlas (TCGA) research network: In collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), TCGA has generated comprehensive, multi-dimensional maps of the key genomic changes in 33 types of cancer. The TCGA dataset to date incorporates 2.5 petabytes of data from tumor and matched normal tissues from more than 11,000 patients, is publically available [26];

The International Cancer Genome Consortium (ICGC): The ICGC data (release 22, Aug 2016) in total comprises data from more than 19,290 cancer donors spanning 70 projects and 21 tumor sites. The entire dataset is securely available on the Amazon Web Services (AWS) Cloud for access by cancer researchers worldwide [27];

Cancer Genome Hub at the University of California, Santa Cruz- UCSC: The Cancer Genomics Hub was established in August 2011 to provide a repository to TCGA. The CGHub has grown to be the largest database of cancer genomes in the world, storing more than 2.5 petabytes of data and serving downloads of nearly 3 petabytes per month [28];

The Catalogue of Somatic Mutations in Cancer (COSMIC): The COSMIC database is the world's largest and most comprehensive resource for exploring the impact of somatic mutations in human cancer. The latest release (v70; Aug 2014), describes 2,002, 811 coding point mutations in over one million tumor samples and across most human genes [29];

The integrated cancer knowledgebase (canSAR): The canSAR database applies machine-learning approaches to provide drug-discovery predictions. The growing database now holds the 3D structures of almost three million cavities on the surface of nearly 110,000 molecules [30] and

The National Cancer Institute's Clinical Proteomic Technologies for Cancer initiative: This database leverages proteogenomics analysis through the development of the Clinical Proteomic Tumor Analysis Consortium [31]. This consortium is composed of Proteome Characterization Centers, Data Center, and Resources Center, to produce a unique continuum that defines the proteins translated from cancer genomes [32]. This integrative approach provides the broad scientific community with knowledge that links genotype to proteotype and ultimately phenotype. The data sets, analytically validated assays, as well as high quality reagents are publicly accessible. These efforts together with other NCI programs; e.g., the NCI’s Cancer Therapy Evaluation Program (CTEP), the Early Detection Research Network (EDRN), the Cooperative Groups have broadened the scope of cancer research from the bench to bedside.

Other cancer-related metadata includes the Oncomine® Gene Browser (ThermoFisher Scientific) dataset which harbors comprehensive gene profiles across thousands of cancer patient genomes with >500 sources [33], The cBioPortal for cancer genomics which provides visualization, analysis and download of large-scale cancer genomics datasets [34] US Food and Drug Administration’s Mini-Sentinel [35], the National Patient-Centered Clinical Research Network- PCORNet [36], Claims datasets [37] and the American Society of Clinical Oncology’s CancerLinq [38].

Cloud-based computing efforts have greatly expanded the scope of mining the big data in cancer research by small to mid size research laboratories. The 1000 genomes Project cataloguing human sequence variations through deep sequencing of the 1000 genomes worldwide [39] uses a 200TB Amazon cloud-based data repository solution [40]. The Globus Genomics Systems [41] an Amazon cloud-based analysis and data management client is based on the open source, web-based Galaxy platform [42]. This system provides elastic scaling computer cluster infrastructure. Other data management systems that allow users to integrate large-scale genomics datasets include TranSMART [43], BioMart [44] and the Integrated Rule-oriented Data System (iRODS); open source data management software used by research organizations and government agencies worldwide. Google, Microsoft, Oracle and IBM also provide commercial cloud storage solutions used by research institutes including the National Institute of Health and the European Bioinformatics Institute.

In the area of breast cancer, the big data driven genomics has generated numerous “cancer signatures” which are being adopted into standard practice [22] such as the OncoType DX [45] and Mammaprint [46-48]. The “big data” analytics has also been used recently to predict if a patient is suffering from aggressive triple-negative breast cancer, slower-moving cancers or non-cancerous lesions with 95 percent accuracy [49].


Significant challenges exist before the revolution in big data analytics can indeed benefit the vast number of cancer patients [50-52]. Both the basic researchers and practicing oncologists increasingly face the complexity of a plethora of bioinformatics tools and soft wares. Harnessing terabytes to exabytes of data emerging from numerous studies is a daunting task. Systems standardization across multiple platforms for the diverse tools needs to be established. The quality of datasets, the verification of tissue integrity and the electronic medical records are some of the areas requiring considerable improvements.

The soft wares used in the Electronic Medical Records (EMRs) are in a state of development. Integration of EMR with genomics data from individual patients faces considerable challenges. The GWAS big datasets encompass millions of single nucleotide variations (SNPs) amounting to terabytes of information [53,54]. Meaningful interpretations from these vast amounts of genetic data are difficult. Multiple platforms are being used to store the medical information, which are often not compatible [55-58]. This introduces a considerable level of complexity in deriving patient-centric information. Standards need to be introduced for the software used for the EMR.

The Ethical, Legal, and Social Implications (ELSI) of the worldwide genome initiatives continue to raise strong concerns [59]. Identification of fifty individuals from the 1000 genome project and public genealogy information using short tandem repeats [60], underscores this point. Together with the increasing use of cloud-based storage of the genomics data including the GWAS data, which matches genotypes to phenotypes, adds to the urgent need for clear guidelines to maintain privacy and security [61]. Development of de-identification algorithms [62,63] and customized user interface [64] could begin to address these concerns.

These issues notwithstanding, one can anticipate that the big data infrastructure should help the oncologists and cancer patients around the globe in decades to come. The big data cancer analytics with data encompassing clinical trials to real-world patients and practices can provide answers to effectiveness of treatment and long-term outcome.


  1. Costa FF (2014) Big data in biomedicine. Drug discovery today 19(4): 433-440.
  2. Dinov ID (2016) Methodological challenges and analytic opportunities for modeling and interpreting Big Healthcare Data Gigascience 5: 12.
  3. Xue LC, Dobbs D, Bonvin AM, Honavar V (2015) Computational prediction of protein interfaces: A review of data driven methods. FEBS Lett 589(23): 3516-3526.
  4. Chen Y, Elenee Argentinis JD, Weber G (2016) IBM Watson: How Cognitive Computing Can Be Applied to Big Data Challenges in Life Sciences Research. Clin Ther 38(4): 688-701.
  5. Luo J, Wu M, Gopukumar D, Zhao Y (2016) Big Data Application in Biomedical Research and Health Care: A Literature Review. Biomed Inform Insights 8: 1-10.
  6. Rein R, Memmert D (2016) Big data and tactical analysis in elite soccer: future challe ges and opportunities for sports science. Springerplus 5(1): 1410.
  7. La Salle J, Williams KJ, Moritz C (2016) Biodiversity analysis in the digital era. Philos Trans R Soc Lond B Biol Sci 371(1702): 20150337.
  8. Iorio F, Knijnenburg Theo A, Vis Daniel J, Bignell Graham R, Menden Michael P, et al. (2016) A Landscape of Pharmacogenomic Interactions in Cancer. Cell 166(3): 740-754.
  9. Ciriello G, Miller ML, Aksoy BA, Senbabaoglu Y, Schultz N, Sander C (2013) Emerging landscape of oncogenic signatures across human cancers. Nature genetics 45(10): 1127-1133.
  10. Barretina J, Caponigro G, Stransky N, Venkatesan K, Margolin AA, et al. (2012) The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483(7391): 603-607.
  11. Costello JC, Heiser LM, Georgii E, Gonen M, Menden MP, et al. (2014) A community effort to assess and improve drug sensitivity prediction algorithms. Nature biotechnology 32(12): 1202-1212.
  12. Kandoth C, McLellan MD, Vandin F, Ye K, Niu B, et al. (2013) Mutational landscape and significance across 12 major cancer types. Nature 502(7471): 333-339.
  13. Boonstra A, Broekhuis M (2010) Barriers to the acceptance of electronic medical records by physicians from systematic review to taxonomy and interventions. BMC Health Serv Res 10: 231.
  14. Peng H, Zhou J, Zhou Z, Bria A, Li Y, Kleissas DM, et al. (2016) Bioimage Informatics for Big Data. Adv Anat Embryol Cell Biol 219: 263-272.
  15. Coates J, Souhami L, El Naqa I (2016) Big Data Analytics for Prostate Radiotherapy. Front Oncol 6: 149.
  16. Swift SL, Stojdl DF (2016) Big Data Offers Novel Insights for Oncolytic Virus Immunotherapy. Viruses 8(2): E45.
  17. Yang Y, Dong X, Xie B, Ding N, Chen J, et al. (2015) Databases and web tools for cancer genomics study. Genomics Proteomics Bioinformatics 13(1): 46-50.
  18. Kim ES (2015) The Future of Molecular Medicine: Biomarkers, BATTLEs, and Big Data. Am Soc Clin Oncol Educ Book 22-27.
  19. Chelala C, Hahn SA, Whiteman HJ, Barry S, Hariharan D, et al. (2007) Pancreatic Expression database: a generic model for the organization, integration and mining of complex cancer datasets. BMC genomics 8: 439.
  20. Barrett JH, Iles MM, Harland M, Taylor JC, Aitken JF, et al. (2011) Genome-wide association study identifies three new melanoma susceptibility loci. Nat Genet 43(11): 1108-1113.
  21. Cancer Genome Atlas N (2012) Comprehensive molecular characterization of human colon and rectal cancer. Nature 487(7407): 330-337.
  22. Dawson SJ, Rueda OM, Aparicio S, Caldas C (2013) A new genome-driven integrated classification of breast cancer and its implications. The EMBO journal 32(5): 617-628.
  23. Meyer AM, Basch E (2015) Big data infrastructure for cancer outcomes research: implications for the practicing oncologist. J Oncol Pract 11(3):207-208.
  24. Iorio F, Knijnenburg TA, Vis DJ, Bignell GR, Menden MP, et al. (2016) A Landscape of Pharmacogenomic Interactions in Cancer. Cell 166(3): 740-754.
  25. Chen B, Butte AJ (2016) Leveraging big data to transform target selection and drug discovery. Clinical pharmacology and therapeutics 99(3): 285-297.
  26. Cancer Genome Atlas Research N, Weinstein JN, Collisson EA, Mills GB, Shaw KR, et al. (2013) The Cancer Genome Atlas Pan-Cancer analysis project. Nature Genet 45(10): 1113-1120.
  27. International Cancer Genome C, Hudson TJ, Anderson W, Artez A, Barker AD, et al. (2010) International network of cancer genome projects. Nature 464(7291): 993-998.
  28. Cline MS, Craft B, Swatloski T, Goldman M, Ma S, et al. (2013) Exploring TCGA Pan-Cancer Data at the UCSC Cancer Genomics Browser. Scientific Reports 3: 2652.
  29. Forbes SA, Beare D, Gunasekaran P, Leung K, Bindal N, et al. (2015) COSMIC: exploring the world's knowledge of somatic mutations in human cancer. Nucleic Acids Res 43(Database issue): D805-D811.
  30. Tym JE, Mitsopoulos C, Coker EA, Razaz P, Schierz AC, et al. (2016) canSAR: an updated cancer research and drug discovery knowledgebase. Nucleic Acids Res 44(D1): D938-D43.
  31. Zhang B, Wang J, Wang X, Zhu J, Liu Q, et al. (2014) Proteogenomic characterization of human colon and rectal cancer. Nature 513(7518): 382–387.
  32. Whiteaker JR, Halusa GN, Hoofnagle AN, Sharma V, MacLean B, et al. (2014) CPTAC Assay Portal: a repository of targeted proteomic assays. Nature Methods 11(7): 703-704.
  33. Rhodes DR, Kalyana-Sundaram S, Mahavisno V, Varambally R, Yu J, et al. (2007) Oncomine 3.0: genes, pathways, and networks in a collection of 18,000 cancer gene expression profiles. Neoplasia 9(2): 166-180.
  34. Cerami E, Gao J, Dogrusoz U, Gross BE, Sumer SO, et al. (2012)The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer Discov 2(5): 401-404.
  35. Platt R, Carnahan R (2012) The U.S. Food and Drug Administration's Mini-Sentinel Program. Pharmacoepidemiology and Drug Safety 21: 1-303.
  36. Fleurence RL, Curtis LH, Califf RM, Platt R, Selby JV, et al. (2014) Launching PCORnet, a national patient-centered clinical research network. JAMIA 21(4): 578-582.
  37. Porter J, Love D, Costello A, Peters A, Rudolph B (2015) All-Payer Claims Database Development Manual: Establishing a Foundation for Health Care Transparency and Informed Decision Making. APCD Council and West Health Policy Center 96: 2397-1053.
  38. Schilsky RL, Michels DL, Kearbey AH, Yu PP, Hudis CA (2014) Building a rapid learning health care system for oncology: the regulatory framework of CancerLinQ. J Clin Oncol 32(22): 2373-2379.
  39. The Genomes Project C (2015) A global reference for human genetic variation. Nature 526(7571): 68-74.
  40. Clarke L, Zheng-Bradley X, Smith R, Kulesha E, Xiao C, et al. (2012) The 1000 Genomes Project: data management and community access. Nature methods 9(5): 459-462.
  41. Madduri RK, Sulakhe D, Lacinski L, Liu B, Rodriguez A, et al. (2014) Experiences Building Globus Genomics: A Next-Generation Sequencing Analysis Service using Galaxy, Globus, and Amazon Web Services. Concurr Comput 26(13): 2266-2279.
  42. Goecks J, Nekrutenko A, Taylor J, Galaxy T (2010) Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol 11(8): R86.
  43. Athey BD, Braxenthaler M, Haas M, Guo Y (2013) tranSMART: An Open Source and Community-Driven Informatics and Data Sharing Platform for Clinical and Translational Research. AMIA Jt Summits Transl Sci Proc 2013: 6-8.
  44. Kasprzyk A (2011) BioMart: driving a paradigm change in biological data management. Database 2011: bar049.
  45. Paik S, Tang G, Shak S, Kim C, Baker J, et al. (2006) Gene expression and benefit of chemotherapy in women with node-negative, estrogen receptor-positive breast cancer. J Clin Oncol 24(23): 3726-3734.
  46. van 't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, et al. (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature 415(6871): 530-536.
  47. Jezequel P, Campone M, Gouraud W, Guerin-Charbonnel C, Leux C, et al. (2012) bc-GenExMiner: an easy-to-use online platform for gene prognostic analyses in breast cancer. Breast Cancer Res Treat 131(3): 765-775.
  48. Hudis CA (2015) Big data: Are large prospective randomized trials obsolete in the future? Breast 24 Suppl 2: S15-S18.
  49. Agner SC, Rosen MA, Englander S, Tomaszewski JE, Feldman MD, et al. (2014) Computerized Image Analysis for Identifying Triple-Negative Breast Cancers and Differentiating Them from Other Molecular Subtypes of Breast Cancer on Dynamic Contrast-enhanced MR Images: A Feasibility Study. Radiology 272(1): 91-99.
  50. Shaha SH, Sayeed Z, Anoushiravani AA, El-Othmani MM, Saleh KJ (2016) Big Data, Big Problems: Incorporating Mission, Values, and Culture in Provider Affiliations. Orthop Clin North Am 47(4): 725-732.
  51. Chatellier G, Varlet V, Blachier-Poisson C, participants of Giens Xxxi RTN (2016) "Big data" and "open data": What kind of access should researchers enjoy? Therapie 71(1): 97-105..
  52. Frelinger JA (2015) Big Data, Big Opportunities, and Big Challenges. J Investig Dermatol Symp Proc 17(2): 33-35.
  53. Ramos EM, Hoffman D, Junkins HA, Maglott D, Phan L, et al. (2014) Phenotype-Genotype Integrator (PheGenI): synthesizing genome-wide association study (GWAS) data with existing genomic resources. Eur J Hum Genet 22(1): 144-147.
  54. Welter D, MacArthur J, Morales J, Burdett T, Hall P, et al. (2014) The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res 42(D1): D1001-D1006.
  55. Farrugia G, Weinshilboum RM (2013) Challenges in implementing genomic medicine: the Mayo Clinic Center for Individualized Medicine. Clin Pharmacol Ther 94(2): 204-206.
  56. Gottesman O, Kuivaniemi H, Tromp G, Faucett WA, Li R, et al. (2013) The Electronic Medical Records and Genomics (eMERGE) Network: past, present, and future. Genet Med 15(10): 761-771.
  57. Kho AN, Rasmussen LV, Connolly JJ, Peissig PL, Starren J, et al. (2013) Practical challenges in integrating genomic data into the electronic health record. Genet Med 15(10): 772-778.
  58. Pathak J, Kho AN, Denny JC (2013) Electronic health records-driven phenotyping: challenges, recent advances, and perspectives. JAMIA 20(e2): e206-e211.
  59. Mittelstadt BD, Floridi L (2016) The Ethics of Big Data: Current and Foreseeable Issues in Biomedical Contexts. Sci Eng Ethics 22(2): 303-341.
  60. Gymrek M, McGuire AL, Golan D, Halperin E, Erlich Y (2013) Identifying personal genomes by surname inference. Science 339(6117): 321-324.
  61. Skripcak T, Belka C, Bosch W, Brink C, Brunner T, et al. (2014) Creating a data exchange strategy for radiotherapy research: towards federated databases and anonymised public datasets. Radiother Oncol 113(3): 303-309.
  62. Schell SR (2006) Creation of clinical research databases in the 21st century: a practical algorithm for HIPAA Compliance. Surg Infect (Larchmt) 7(1): 37-44.
  63. Fernandes AC, Cloete D, Broadbent MT, Hayes RD, Chang CK, et al. (2013) Development and evaluation of a de-identification procedure for a case register sourced from mental health electronic records. BMC Med Inform Decis Mak 13: 71.
  64. Patel AA, Gilbertson JR, Showe LC, London JW, Ross E, et al. (2007) A novel cross-disciplinary multi-institute approach to translational cancer research: lessons learned from Pennsylvania Cancer Alliance Bioinformatics Consortium (PCABC). Cancer Inform 3: 255-274.
© 2014-2016 MedCrave Group, All rights reserved. No part of this content may be reproduced or transmitted in any form or by any means as per the standard guidelines of fair use.
Creative Commons License Open Access by MedCrave Group is licensed under a Creative Commons Attribution 4.0 International License.
Based on a work at
Best viewed in Mozilla Firefox | Google Chrome | Above IE 7.0 version | Opera |Privacy Policy