Apache® Spark™ News

How-to: Do Data Quality Checks using Apache Spark DataFrames

Apache Spark’s ability to support data quality checks via DataFrames is progressing rapidly. This post explains the state of the art and future possibilities. Apache Hadoop and Apache Spark make Big Data accessible and usable so we can easily find value, but that data has to be correct, first. This post will focus on this problem and how to solve it with Apache Spark 1.3 and Apache Spark 1.4 using DataFrames. (Note: although relatively new to Spark and thus not yet supported by Cloudera at the time of this writing, DataFrames are highly worthy of exploration and experimentation. Learn more about Cloudera’s support for Apache Spark here.)