Dummy-Spark Release: pure python mock of spark for testing

In a previous post, I mentioned a little project to provide a pure-python mock of apache spark’s RDD object for testing and quick prototyping. Thanks to some help from contributors, we’ve made a bit of progress and now a good bit of the RDD API is supported, including using the newHadoopAPI with elasticsearch-hadoop, and pulling files from s3.

I’ve just published the v0.0.2 release, which can be installed as:

pip install dummyrdd==0.0.2

And used like:

from dummy_spark import SparkContext, SparkConf

sconf = SparkConf()
sc = SparkContext(master='', conf=sconf)
rdd = sc.parallelize([1, 2, 3, 4, 5])

print(rdd.count())
print(rdd.map(lambda x: x**2).collect())

In the new release, we’ve added two small bits of functionality:

newHadoopAPI support for elasticsearch-hadoop functions, mocked using elasticsearch-py. Should be 1-to-1 functionality and format returned for testing out pyspark programs that query ES into RDDs.
repartition implemented for all RDDs

These are in addition to the large list of implemented methods that can be found in the readme on github.

https://github.com/wdm0006/DummyRDD

The post Dummy-Spark Release: pure python mock of spark for testing appeared first on Will’s Noise.