Since the inception of Spark SQL in Spark 1.0, one of its most popular uses has been as a conduit for pulling data into the Spark platform. Early users loved Spark SQL’s support for reading data from existing Apache Hive tables as well as from the popular Parquet columnar format. We’ve since added support for other formats, such as JSON. In Spark 1.2, we’ve taken the next step to allow Spark to integrate natively with a far larger number of input sources. These new integrations are made possible through the inclusion of the new Spark SQL Data Sources API.
Apache® Spark™ News
Today, we are happy to announce Spark Packages (http://spark-packages.org), a community package index to track the growing number of open source packages and libraries that work with Apache Spark. Spark Packages makes it easy for users to find, discuss, rate, and install packages for any version of Spark, and makes it easy for developers to contribute packages.
We at Databricks are thrilled to announce the release of Spark 1.2! Spark 1.2 introduces many new features along with scalability, usability and performance improvements. This post will introduce some key features of Spark 1.2 and provide context on the priorities of Spark for this and the next release. In the next two weeks, we’ll be publishing blog posts with more details on feature additions in each of the major components. Spark 1.2 has been posted today on the Apache Spark website.
In the age of ‘Big Data,’ with datasets rapidly growing in size and complexity and cloud computing becoming more pervasive, data science techniques are fast becoming core components of large-scale data processing pipelines.
Samsung SDS is the business and IT solutions arm of Samsung Group. A global ICT service provider with over 17,000 employees worldwide and 6.7 billion USD in revenues, Samsung SDS tackles the challenges of some of the largest global enterprises in such industries as manufacturing, financial services, health care and retail.
More and more companies are using Apache Spark, and many Spark based pilots are currently deploying in production. In social media, at every big data conference or meetup, people describe new POC, prototypes, and production deployments using Spark.
A month ago, we shared with you our entry to the 2014 Gray Sort competition, a 3rd-party benchmark measuring how fast a system can sort 100 TB of data (1 trillion records). Today, we are happy to announce that our entry has been reviewed by the benchmark committee and we have officially won the Daytona GraySort contest!
Apache Spark has seen phenomenal adoption, being widely slated as the successor to Hadoop MapReduce, and being deployed in clusters from a handful to thousands of nodes. While it was clear to everybody that Spark is more efficient than MapReduce for data that fits in memory, we heard that some organizations were having trouble pushing it to large scale datasets that could not fit in memory. Therefore, since the inception of Databricks, we have devoted much effort, together with the Spark community, to improve the stability, scalability, and performance of Spark. Spark works well for gigabytes or terabytes of data, and it should also work well for petabytes.
The brain is the most complicated organ of the body, and probably one of the most complicated structures in the universe. It’s millions of neurons somehow work together to endow organisms with the extraordinary ability to interact with the world around them. Things our brains control effortlessly — kicking a ball, or reading and understanding this sentence — have proven extremely hard to implement in a machine.
Two powerful features of Apache Spark include its native APIs provided in Scala, Java and Python, and its compatibility with any Hadoop-based input or output source. This language support means that users can quickly become proficient in the use of Spark even without experience in Scala, and furthermore can leverage the extensive set of third-party libraries available (for example, the many data analysis libraries for Python).