Dummy-Spark Release: pure python mock of spark for testing

In a previous post, I mentioned a little project to provide a pure-python mock of apache spark’s RDD object for testing and quick prototyping.  Thanks to some help from contributors, we’ve made a bit of progress and now a good bit of the RDD API is supported, including using the newHadoopAPI with elasticsearch-hadoop, and pulling files from s3.

I’ve just published the v0.0.2 release, which can be installed as:

pip install dummyrdd==0.0.2

And used like:

from dummy_spark import SparkContext, SparkConf

sconf = SparkConf()
sc = SparkContext(master='', conf=sconf)
rdd = sc.parallelize([1, 2, 3, 4, 5])

print(rdd.count())
print(rdd.map(lambda x: x**2).collect())

In the new release, we’ve added two small bits of functionality:

  • newHadoopAPI support for elasticsearch-hadoop functions, mocked using elasticsearch-py.  Should be 1-to-1 functionality and format returned for testing out pyspark programs that query ES into RDDs.
  • repartition implemented for all RDDs

These are in addition to the large list of implemented methods that can be found in the readme on github.

https://github.com/wdm0006/DummyRDD

The post Dummy-Spark Release: pure python mock of spark for testing appeared first on Will’s Noise.