Spark 2.0 was released in July 2016. Changes from the release notes:

Unifying DataFrame and Dataset: In Scala and Java, DataFrame and Dataset have been unified, i.e. DataFrame is just a type alias for Dataset of Row. In Python and R, given the lack of type safety, DataFrame is the main programming interface.
SparkSession: new entry point that replaces the old SQLContext and HiveContext for DataFrame and Dataset APIs. SQLContext and HiveContext are kept for backward compatibility.

Table of Contents

Documentation

Setup

PySpark

pyspark

jobs

DataFrames

reading

rows and columns

collect

filter

select

groupBy

join

orderBy

explode

writing

SQL

RDD

UDF

PY4J

Documentation

Setup

$ virtualenv --python=python2.7 ve
$ . ve/bin/activate
$ pip install pyspark

PySpark

pyspark

pyspark is a Python module and a command line tool.

import pyspark

# A SparkSession can be used to create DataFrame, register DataFrame as tables,
# execute SQL over tables, cache tables, and read parquet files.
spark = SparkSession.builder.appName("SimpleApp").getOrCreate()

# A SparkContext represents the connection to a Spark cluster,
# and can be used to create RDD and broadcast variables on that cluster.
sc = pyspark.SparkContext()

The pyspark command line tool creates a SparkSession object named spark and a SparkContext object named sc for you as if the above code were executed at start up.

jobs

DataFrames

reading

The SparkSession object has a read attribute containing a DataFrameReader object. This object in turn has methods for creating a DataFrame object from files on the file system:

rows and columns

DataFrame

A DataFrame object represents a relational set of data. count returns the number of rows:

$ cat data.json
{"name": "John", "age": 32}
{"name": "Mary", "age": 27}

$ pyspark
>>> df = spark.read.json('data.json')
>>> df.count()
2

>>> df.columns
['age', 'name']
>>> df.age
Column<age>
>>> df['age']
Column<age>
>>> df.printSchema()
root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)

>>> df.head(1)
Row(_corrupt_record=None, age=32, name=u'John')
>>> df.show()
+---+----+
|age|name|
+---+----+
| 32|John|
| 27|Mary|
+---+----+

collect

to a list of rows

from a list of rows with parallize

filter

>>> df.filter("age > 30").collect()
[Row(age=32, name=u'John')]

>>> df.filter(df.age > 30).collect()
[Row(age=32, name=u'John')]

The first form takes a string which gets evaluated. There is an identifier for the columns.

The second form is a bit surprising. The pyspark.sql.column.Column class has overridden some of the operators to return pyspark.sql.column.Column values.

select

The select method takes as arguments strings or pyspark.sql.column.Column objects. If an argument is a string, it must refer to one of the columns:

>>> df.select('age').collect()
[Row(age=32), Row(age=27)]
>>> df.select(df.age).collect()
[Row(age=32), Row(age=27)]
>>> df.select('*').collect()
[Row(age=32, name=u'John'), Row(age=27, name=u'Mary')]
>>> df.select('*', df.age + 1).collect()
[Row(age=32, name=u'John', (age + 1)=33), Row(age=27, name=u'Mary', (age + 1)=28)]

The selectExpr takes as arguments strings or pyspark.sql.column.Column objects. If the argument is a string it is evaluated:

>>> df.selectExpr('*', 'age + 1').collect()
[Row(age=32, name=u'John', (age + 1)=33), Row(age=27, name=u'Mary', (age + 1)=28)]

groupBy

groupBy takes as arguments pyspark.sql.column.Column objects, which can be referenced by their names.

$ cat data.json
{"name": "John", "age": 32, "children": {"Billy": 7, "Sally": 3}, "sex": "male"}
{"name": "Mary", "age": 27, "children": {"Rocco": 2}, "sex": "female"}
{"name": "Lynn", "age": 42, "children": {}, "sex": "female"}

$ pyspark
>>> df = spark.read.json('data.json')
>>> df.groupBy('sex').avg('age').collect()
[Row(sex=u'female', avg(age)=34.5), Row(sex=u'male', avg(age)=32.0)]

join

orderBy

also limit, sort

explode

The function pyspark.sql.functions.explode can be used to deal with data that isn't in 1st normal form:

$ cat data.json
{"name": "John", "age": 32, "children": ["Billy", "Sally"]}
{"name": "Mary", "age": 27, "children": ["Rocco"]}

$ pyspark
>>> import pyspark.sql.functions as f
>>> df = spark.read.json('data.json')
>>> df.select('name', 'age', f.explode('children').alias('child')).collect()
[Row(name=u'John', age=32, child=u'Billy'), Row(name=u'John', age=32, child=u'Sally'), Row(name=u'Mary', age=27, child=u'Rocco')]

Here is an example of using split and explode to deal with data not in 1st normal form when it is serialized as a string. The 2nd argument to split is a regular expression:

$ cat data.json
{"name": "John", "age": 32, "children": "Billy|Sally"}
{"name": "Mary", "age": 27, "children": "Rocco"}

>>> import pyspark.sql.functions as f
>>> df = spark.read.json('data.json')
>>> df.select('name', 'age', f.explode(f.split(df['children'], '\|')).alias('child')).collect()
[Row(name=u'John', age=32, child=u'Billy'), Row(name=u'John', age=32, child=u'Sally'), Row(name=u'Mary', age=27, child=u'Rocco')]

Explode can also deal with dictionaries:

$ cat data.json
{"name": "John", "age": 32, "children": {"Billy": 7, "Sally": 3}}
{"name": "Mary", "age": 27, "children": {"Rocco": 2}}

>>> import pyspark.sql.functions as f
>>> import pyspark.sql.types as t
>>> df = spark.read.json(
    'data.json',
    schema=t.StructType(
        [
            t.StructField('name', t.StringType()),
            t.StructField('age', t.IntegerType()),
            t.StructField('children', t.MapType(t.StringType(), t.IntegerType()))
        ]
    )
)
>>> df.collect()
[Row(name=u'John', age=32, children={u'Billy': 7, u'Sally': 3}), Row(name=u'Mary', age=27, children={u'Rocco': 2})]
>>> df.select('name', 'age', f.explode(df['children']).alias('child_name', 'child_age')).collect()
[Row(name=u'John', age=32, child_name=u'Billy', child_age=7), Row(name=u'John', age=32, child_name=u'Sally', child_age=3), Row(name=u'Mary', age=27, child_name=u'Rocco', child_age=2)]

The opposite of exploding can be done with the collect_list or the collect_set functions. The latter de-dupes the list:

cat data.json
{"name": "John", "age": 32, "children": {"Billy": 7, "Sally": 3}, "sex": "male"}
{"name": "Mary", "age": 27, "children": {"Rocco": 2}, "sex": "female"}
{"name": "Lynn", "age": 42, "children": {}, "sex": "female"}
{"name": "John", "age": 67, "children": {"Ruth": 34}, "sex": "male"}

$ pyspark
>>> import pyspark.sql.functions as f
>>> df = spark.read.json('data.json')
>>> df.groupBy('sex').agg(f.collect_list('name')).show()
+------+------------------+
|   sex|collect_list(name)|
+------+------------------+
|female|      [Mary, Lynn]|
|  male|      [John, John]|
+------+------------------+

>>> df.groupBy('sex').agg(f.collect_set('name')).show()
+------+-----------------+
|   sex|collect_set(name)|
+------+-----------------+
|female|     [Mary, Lynn]|
|  male|           [John]|
+------+-----------------+

One can also create a map:

$ cat data.json
{"name": "John", "age": 32, "children": {"Billy": 7, "Sally": 3}, "sex": "male"}
{"name": "Mary", "age": 27, "children": {"Rocco": 2}, "sex": "female"}
{"name": "Lynn", "age": 42, "children": {}, "sex": "female"}
{"name": "Bill", "age": 67, "children": {"Ruth": 34}, "sex": "male"}

$ pyspark
>>> import pyspark.sql.functions as f
>>> df = spark.read.json('data.json')
>>> df.groupBy('sex').agg(f.collect_list(f.create_map('name', 'age'))).show()
+------+----------------------------+
|   sex|collect_list(map(name, age))|
+------+----------------------------+
|female|        [[Mary -> 27], [L...|
|  male|        [[John -> 32], [B...|
+------+----------------------------+

writing

The write attribute returns a DataFrameWriter object with these methods:

SQL

RDD

RDD Programming Guide

UDF

udf

PY4J

https://www.py4j.org/

CLARK GRUBB

Documentation

Setup

PySpark

pyspark

jobs

DataFrames

reading

rows and columns

collect

filter

select

groupBy

join

orderBy

explode

writing

SQL

RDD

UDF

PY4J