Apache® Spark™ News

Taking Your Spark to Production Scale

At the Spark Summit 2015 conference held in San Francisco, Anil Gadre, Senior Vice President of Product Management for MapR, presented a featured keynote titled "Spark & Hadoop at Production Scale" where he highlighted how leading companies are deploying Spark with Hadoop in production. During his talk, he shared real-life customer examples of turning data into action using Spark and Hadoop, and he also discussed how advanced users are deploying Hadoop and Spark applications in one cluster with better reliability and performance at production scale.

Introducing Window Functions in Spark SQL

In this blog post, we introduce the new window function feature that was added in Spark 1.4. Window functions allow users of Spark SQL to calculate results such as the rank of a given row or a moving average over a range of input rows. They significantly improve the expressiveness of Spark’s SQL and DataFrame APIs. This blog will first introduce the concept of window functions and then discuss how to use them with Spark SQL and Spark’s DataFrame API.

Four Things to Know about Reliable Spark Streaming with Typesafe and Databricks

Last week, we were happy to have a Typesafe co-webinar with Databricks, the company founded by the team that started the Spark research project at UC Berkeley that later became Apache Spark. Our Big Data Architect Dean Wampler and Datatbrick's Lead Engineer for Spark Streaming, Tathagata Das (TD) provided a 1-hour presentation with Q/A on Spark Streaming, which makes it easy to build scalable fault-tolerant streaming applications with Apache Spark. In this webinar, we reviewed: - See more at: https://www.typesafe.com/blog/four-things-to-know-about-reliable-spark-streaming-typesafe-databricks#sthash.7Nm47kiw.dpuf

Python Versus R in Apache Spark

The June update to Apache Spark brought support for R, a significant enhancement that opens the big data platform to a large audience of new potential users. Support for R in Spark 1.4 also gives users an alternative to Python. But which language will emerge as the winner for doing data science in Spark? We spoke to Databricks Ali Ghodsi for answers.

Configuring and Deploying Apache Spark

I gave this talk at the inaugural SF Spark and Friends Meetup group in San Francisco during the week of the Spark Summit this year. While researching this talk, I realized there is very little material out there giving an overview of the many rich options for deploying and configuring Apache Spark. There are some specific articles by vendors - targeting YARN, or DSE, etc., but I think what developers really want is a broad overview. So, this post will give you that, but you will have to look through the slides here to dig through the meat of it. ...

A Spark is Lit in HDInsight

Apache Spark has garnered a lot of developer attention and is often the top of agenda in my customer interactions. Since we announced support for Spark in HDP, we have seen broad customer adoption of our Spark offering. Our customers love Spark for the simplicity of its API, speed of development and the runtime performance. Spark is also democratizing Machine Learning and making it easier and approachable to more developers. Today Microsoft announced support for Spark in HDInsight – this is a big step towards driving customer adoption for Spark workloads on Hadoop clusters in Azure.

How-to: Do Data Quality Checks using Apache Spark DataFrames

Apache Spark’s ability to support data quality checks via DataFrames is progressing rapidly. This post explains the state of the art and future possibilities. Apache Hadoop and Apache Spark make Big Data accessible and usable so we can easily find value, but that data has to be correct, first. This post will focus on this problem and how to solve it with Apache Spark 1.3 and Apache Spark 1.4 using DataFrames. (Note: although relatively new to Spark and thus not yet supported by Cloudera at the time of this writing, DataFrames are highly worthy of exploration and experimentation. Learn more about Cloudera’s support for Apache Spark here.)

Leaf in the Wild: Stratio Integrates Apache Spark and MongoDB to Unlock New Customer Insights for One of World’s Largest Banks

There is no question that Apache Spark is on fire. It’s the most active big data project in the Apache Software Foundation, and was recently “blessed” by IBM who committed 3,500 engineers to advancing it. While some are still confused by what it is, or claiming it will kill Hadoop (which it won’t, or at least not the non-MapReduce parts of it), there are already companies today harnessing its power to build next generation analytics applications. Stratio are one such company. With an impressive client list including BBVA, Just Eat, Santander, SAP, Sony and Telefonica, Stratio claims more projects and clients with its Apache Spark-certified Big Data (BD) platform than pretty much anyone else.

Apache Spark in the Enterprise and in China

IBM’s announcements at the recent Spark Summit in SF bodes well for enterprise adoption of Spark. Ben Horowitz jokingly referred to IBM’s endorsement as akin to a Rabbi blessing Spark as kosher for use in an enterprise. I recently sat down with a set of luminaries at the Spark Summit and asked them about how Spark is perceived in enterprises. Below is a selection of responses...