Spark and SPARQL for the Intelligent Data Lake
Share this Session:
  Jans Aasman   Jans Aasman
Franz Inc


Wednesday, August 19, 2015
11:45 AM - 12:30 PM

Level:  Technical - Intermediate

'Data Lake' refers to the new practice in large enterprises to store all potentially relevant data in a Hadoop infrastructure for later analytics. Data Lakes promise to play a vital role in data analytics and numerous vendors are marketing Data Lakes as an essential part of a comprehensive Big Data strategy. Gartner recently noted that this approach is susceptible to problems with governance, provenance, curation, access control and that it would be very helpful if the data was self describing. So Gartner recommended strategies to add semantic consistency to a Data Lake.

We will present a Semantic Data Lake project, architected on top of Hadoop, that takes as input any data type (i.e. csv files, json, json-ld, XML, unstructured text, etc). The project includes a semantic layer that leverages a distributed parallel semantic indexing engine. This semantically indexed Data Lake can be accessed via map-reduce, Apache SPARK and SPARQL.

The project use case was developed for a hospital chain that already adheres to the Affordable Care Act (ACA) but needed a Data Lake that could provide (predictive) analytics for population research and personalized medicine. The resulting Data Lake contains internal data, data from other hospitals in the same region and publicly available data such as a drug database, clinical trials, etc. All data in the Semantic Data Lake has been curated and transformed to fit ontologies and vocabularies like Mesh, Snomed and UMLS. In addition, all temporal relationships in the hospital data are preserved to provide causal analytics.

During this presentation we will discuss and demonstrate how to add Intelligence to a Data Lake to deliver advanced predictive analytics for Affordable Care Act compliance and cost savings.

Jans Aasman started his career as an experimental and cognitive psychologist, earning his PhD in cognitive science with a detailed model of car driver behavior using Lisp and Soar. He has spent most of his professional life in telecommunications research, specializing in intelligent user interfaces and applied artificial intelligence projects. From 1995 to 2004, he was also a part-time professor in the Industrial Design department of the Technical University of Delft. Jans is currently the CEO of Franz Inc., the leading supplier of commercial, persistent, and scalable RDF database products that provide the storage layer for powerful reasoning and ontology modeling capabilities for Semantic Web applications.

Close Window