Spark Structured Streaming: Output Modes

Output modes in Structured streaming decide how the result after processing a micro-batch will be written into the sink whether complete or only new rows or only updated rows.

To Control the output writing into the sink, Spark Structured streaming provides 3 different modes –

1. Append Mode – Only the new rows appended to the result table in the current micro-batch will be written to sink. This mode is used only with queries where existing rows in the result table cannot change. (eg. a map operation on an input stream

2. Updated Mode – Only updated rows (new + updated from last batch) will be written into the sink. This mode is best suitable for processing the stock related data where updates happen for a stock.

3. Complete Mode- All the rows are output i.e. the rows from the last batch + any modified rows from the last batch + new rows in the current batch.

output modes

The outputMode() option is used at the time of writing output to sink to indicate which output mode to use. (You can pass either String or enum using OutputMode class)
.outputMode(“complete”) or .outputMode(OutputMode.Complete)
.outputMode(“update”) or .outputMode(OutputMode.Update)
.outputMode(“append”) or .outputMode(OutputMode.Append)

Here is the code snippet-

val query = wordCount.writeStream
      .format("console") // --> sink type - kafka, console, memory or file
      .outputMode(OutputMode.Complete) // outputMode option used to specify output mode to sink
      .queryName("Spark Structured Streaming - Complete Output mode")
      .start() // --> Starts the execution of the streaming query,

Full source code for each outputMode is available in my gitHub repository
1. CompleteOutputModeDemo.scala
2. UpdateOutpuModeDemo.scala
3. AppendOutputMode

Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark.

Leave a Reply

Your email address will not be published. Required fields are marked *