Skip to content

Additional stateless operators

While projection and selection are the most common stateless operators, you can use most Spark functions in Structured Streaming. The following additional Spark functions are examples of stateless operators that you can use in a stateless pipeline.

Using the withColumn function

You can use the withColumn function to add a new column, change the value of an existing column, convert the datatype of an existing column, or derive a new column from an existing column. You could, for example, CAST the data type or do a substring.

# Returns a DataFrame that creates new column named total that is the sum of columns 1 and 2.
df = df.withColumn("total", sum(df["col1"], df["col2"]))
df.show()
// Returns a DataFrame that creates new column named total that is the sum of columns 1 and 2.
val df = df.df.withColumn("total", sum(df["col1"], df["col2"]))
display(df)

Using the union function

You can use the union function to combine two or more data frames of the same schema and append one data frame to another or combine two data frames. Since the union function returns all rows from the data frames regardless of duplicate data, use the distinct function to return just one record when duplicates exist when unioning data frames.

# Returns a DataFrame that combines the rows of df1 and df2
df3 = df1.union(df2)
df3.show()
// Returns a DataFrame that combines the rows of df1 and df2
val df3 = df1.union(df2)
display(df3)

Using binary functions

You can use binary functions to serialize and deserialize data stored in binary formats, such as Protobuf or Avro. Any function in the Spark SQL Guide applies, including Protobuf and Avro.