'Data Lake' refers to the new practice in large enterprises to store all potentially relevant data in a Hadoop infrastructure for later analytics. Data Lakes promise to play a vital role in data analytics and numerous vendors are marketing Data Lakes as an essential part of a comprehensive Big Data strategy. Gartner recently noted that this approach is susceptible to problems with governance, provenance, curation, access control and that it would be very helpful if the data was self describing. So Gartner recommended strategies to add semantic consistency to a Data Lake.
We will present a Semantic Data Lake project, architected on top of Hadoop, that takes as input any data type (i.e. csv files, json, json-ld, XML, unstructured text, etc). The project includes a semantic layer that leverages a distributed parallel semantic indexing engine. This semantically indexed Data Lake can be accessed via map-reduce, Apache SPARK and SPARQL.
The project use case was developed for a hospital chain that already adheres to the Affordable Care Act (ACA) but needed a Data Lake that could provide (predictive) analytics for population research and personalized medicine. The resulting Data Lake contains internal data, data from other hospitals in the same region and publicly available data such as a drug database, clinical trials, etc. All data in the Semantic Data Lake has been curated and transformed to fit ontologies and vocabularies like Mesh, Snomed and UMLS. In addition, all temporal relationships in the hospital data are preserved to provide causal analytics.
During this presentation we will discuss and demonstrate how to add Intelligence to a Data Lake to deliver advanced predictive analytics for Affordable Care Act compliance and cost savings.