Apache® Spark™ News

New Features in Machine Learning Pipelines in Spark 1.4

Spark 1.2 introduced Machine Learning (ML) Pipelines to facilitate the creation, tuning, and inspection of practical ML workflows.  Spark’s latest release, Spark 1.4, significantly extends the ML library.  In this post, we highlight  several new features in the ML Pipelines API, including:

Huawei Opens Astro to Boost Spark

Huawei announced that the Spark SQL on HBase package is now open source and available. Dubbed Astro, the end-to-end package combines the capabilities of Spark, Spark SQL and HBase, helps drive Spark adoption in broad a NoSQL customer base and provides powerful online query and analytics capabilities for large scale data processing in vertical enterprises.

Petabyte-Scale Text Processing with Spark

At Grammarly, we have long used Amazon EMR with Hadoop and Pig in support of our big data processing needs. However, we were really excited about the improvements that the maturing Apache Spark offers over Hadoop and Pig, and so set about getting Spark to work with our petabyte text data set. In this post, we describe the challenges we had in the process and a scalable working setup of Spark that we have discovered as a result. - See more at: http://tech.grammarly.com/blog/posts/Petabyte-Scale-Text-Processing-with-Spark.html#sthash.Jv2AAYTh.dpuf

Huawei Bears Open Source Gifts From China

Chinese technology giant Huawei has frequently been the subject of suspicion and sanction, particularly in the United States. But it’s also a company that produces key pieces of technology infrastructure, and an active contributor to various international open source initiatives. This week, at OSCON in Portland, Huawei announced the release of a new open source project, Astro. Astro tightly integrates the database capabilities of Apache HBase with the online query and analytics power of Apache Spark, potentially bringing Spark-powered data science a step closer to the huge structured data stores locked up inside many global enterprises.

How big data analytics help hospitals stop a killer

ClearStory Data, powered by Apache Spark, provides fast-cycle, near real-time measurements on the massive volumes of biosensor data analyzed by algorithms modeled after clinical practice standards used in conventional human clinical monitoring disciplines. A patient "storyboard" identifies and alerts clinicians to patients who might be at risk based upon the biosensor measures. Serum level testing can then be used to confirm the presence of SIRS and/or sepsis.