Spark 1.2 introduced Machine Learning (ML) Pipelines to facilitate the creation, tuning, and inspection of practical ML workflows. Spark’s latest release, Spark 1.4, significantly extends the ML library. In this post, we highlight several new features in the ML Pipelines API, including:
Apache® Spark™ News
Huawei announced that the Spark SQL on HBase package is now open source and available. Dubbed Astro, the end-to-end package combines the capabilities of Spark, Spark SQL and HBase, helps drive Spark adoption in broad a NoSQL customer base and provides powerful online query and analytics capabilities for large scale data processing in vertical enterprises.
At Grammarly, we have long used Amazon EMR with Hadoop and Pig in support of our big data processing needs. However, we were really excited about the improvements that the maturing Apache Spark offers over Hadoop and Pig, and so set about getting Spark to work with our petabyte text data set. In this post, we describe the challenges we had in the process and a scalable working setup of Spark that we have discovered as a result. - See more at: http://tech.grammarly.com/blog/posts/Petabyte-Scale-Text-Processing-with-Spark.html#sthash.Jv2AAYTh.dpuf
Thank you and we hope to hear from you!
Chinese technology giant Huawei has frequently been the subject of suspicion and sanction, particularly in the United States. But it’s also a company that produces key pieces of technology infrastructure, and an active contributor to various international open source initiatives. This week, at OSCON in Portland, Huawei announced the release of a new open source project, Astro. Astro tightly integrates the database capabilities of Apache HBase with the online query and analytics power of Apache Spark, potentially bringing Spark-powered data science a step closer to the huge structured data stores locked up inside many global enterprises.
ClearStory Data, powered by Apache Spark, provides fast-cycle, near real-time measurements on the massive volumes of biosensor data analyzed by algorithms modeled after clinical practice standards used in conventional human clinical monitoring disciplines. A patient "storyboard" identifies and alerts clinicians to patients who might be at risk based upon the biosensor measures. Serum level testing can then be used to confirm the presence of SIRS and/or sepsis.
Apache Spark 1.4.1 is out - download and upgrade today!
This is a joint blog post with our partner Hortonworks. Zhan Zhang is a member of technical staff at Hortonworks, where he collaborated with the Databricks team on this new feature.
IBM today pledged it would devote 3500 researchers to the open source big data project, Apache Spark. It also announced that it was open sourcing its own IBM SystemML machine learning technology in a move designed to help push it to the forefront of big data and machine learning.
At Collective we are working not only on cool things like Machine Learning and Predictive Modeling, but also on reporting that can be tedious and boring. However at our scale even simple reporting application can become challenging engineering problem. This post is based on talk that I gave at NY-Scala Meetup. Slides are available here.