Apache® Spark™ News

Detecting Abuse at Scale: Locality Sensitive Hashing at Uber Engineering

With 5 million Uber trips taken daily by users worldwide, it is important for Uber engineers to ensure that data is accurate. If used correctly, metadata and aggregate data can quickly detect platform abuse, from spam to fake accounts and payment fraud. Amplifying the right data signals makes detection more precise and thus, more reliable. In this article, we will demonstrate how this powerful tool is used by Uber to detect fraudulent trips at scale.

Open Sourcing TensorFlowOnSpark: Distributed Deep Learning on Big-Data Clusters

Today, we are pleased to offer TensorFlowOnSpark to the community, our latest open source framework for distributed deep learning on big-data clusters. Deep learning (DL) has evolved significantly in recent years. At Yahoo, we’ve found that in order to gain insight from massive amounts of data, we need to deploy distributed deep learning. Existing DL frameworks often require us to set up separate clusters for deep learning, forcing us to create multiple programs for a machine learning pipeline (see Figure 1 below). Having separate clusters requires us to transfer large datasets between them, introducing unwanted system complexity and end-to-end learning latency.

Mesosphere Infinity: You’re 4 Words Away from a Complete Big Data System

One command, four words and users will have a next-generation big data system in place, capable of processing the streams of information flowing into their companies every second of every day. That’s the promise we’re making with the announcement of Mesosphere Infinity, a new product that combines a best-of-breed real-time analytics stack into a single package in our Datacenter Operating System (DCOS).

A Spark is Lit in HDInsight

Apache Spark has garnered a lot of developer attention and is often the top of agenda in my customer interactions. Since we announced support for Spark in HDP, we have seen broad customer adoption of our Spark offering. Our customers love Spark for the simplicity of its API, speed of development and the runtime performance. Spark is also democratizing Machine Learning and making it easier and approachable to more developers. Today Microsoft announced support for Spark in HDInsight – this is a big step towards driving customer adoption for Spark workloads on Hadoop clusters in Azure.

Leaf in the Wild: Stratio Integrates Apache Spark and MongoDB to Unlock New Customer Insights for One of World’s Largest Banks

There is no question that Apache Spark is on fire. It’s the most active big data project in the Apache Software Foundation, and was recently “blessed” by IBM who committed 3,500 engineers to advancing it. While some are still confused by what it is, or claiming it will kill Hadoop (which it won’t, or at least not the non-MapReduce parts of it), there are already companies today harnessing its power to build next generation analytics applications. Stratio are one such company. With an impressive client list including BBVA, Just Eat, Santander, SAP, Sony and Telefonica, Stratio claims more projects and clients with its Apache Spark-certified Big Data (BD) platform than pretty much anyone else.

Couchbase Spark Connector 1.0 Beta Release

More or less exactly two months after the second developer preview, I'm delighted to announce that we've shipped the first (and hopefully only) beta release of the Couchbase Spark Connector. It is a major step forward, bringing Spark 1.4 support as well as official documentation and lots of smaller enhancements. In particular: ....