Faq Flink

Last updated: Apr 3, 2026

FLINK FAQ

  • Flink is built around a distributed streaming data-flow engine written in Java and Scala. Flink runs every dataflow programm in a data-parallel and pipelined fashion.
  • Apache Flink is based on the Kappa architecture. The Kappa architecture uses a single processor - stream, who accepts all information as a stream, and the streaming engine processes data in real-time. Batch data in kappa architecture is a form of streaming data.

  • Ref

  • StreamGraph -> JobGraph -> ExecutionGraph -> physical graph
  • Program
    • It is a piece of code that is executed on the Flink Cluster.
  • Client
    • client server that takes code for to-run program, creates a job dataflow graph, which is then passed to JM (job manager).
  • Job manager (JM)
    • It is responsible for generating the execution graph after obtaining the Job Dataflow Graph from the Client. It assigns the job to TaskManagers in the cluster and monitors its execution.
    • trigger Checkpoint periodically
  • Task manager ™
    • It is in charge of executing all of the tasks assigned to it by JobManager. Both TaskManagers execute the tasks in their respective slots in the specified parallelism. It is in charge of informing JobManager about the status of the tasks.
  • Any type of data is produced as a stream of events. Data can be processed as unbounded or bounded streams.
  • Unbounded streams have a beginning but no end. They do not end and continue to provide data as it is produced.
  • Unbounded streams should be processed continuously, i.e., events should be handled as soon as they are consumed. Since the input is unbounded and will not be complete at any point in time, it is not possible to wait for all of the data to arrive.
  • Bounded streams have a beginning and an end point.
  • Bounded streams could be processed by consuming all data before doing any computations.
  • Ordered ingestion is not needed to process bounded streams since a bounded data set could always be sorted. Processing of bounded streams is also called as batch processing.
  • The Apache Flink Dataset API is used to do batch operations on data over time.
  • This API is available in Java, Scala, and Python.
  • It may perform various transformations on datasets such as filtering, mapping, aggregating, joining, and grouping.
  • The Apache Flink DataStream API is used to handle data in a continuous stream.
  • On the stream data, you can perform operations such as filtering, routing, windowing, and aggregation.
  • On this data stream, there are different sources such as message queues, files, and socket streams, and the resulting data can be written to different sinks such as command line terminals.
  • Table API is a relational API with an expression language similar to SQL.
  • This API is capable of batch and stream processing.
  • . It is compatible with the Java and Scala Dataset and Datastream APIs.
  • Tables can be generated from internal Datasets and Datastreams as well as from external data sources. You can use this relational API to perform operations such as join, select, aggregate, and filter.
  • FlinkML is the Flink Machine Learning (ML) library.
  • A Flink streaming application consists of four key programming constructs.
      1. Stream execution environment - Every Flink streaming application requires an environment in which the streaming program is executed.
      1. Data sources Data sources are applications or data stores from which the Flink application will read the input data from.
      1. Data streams and transformation operations - Input data from data sources are read by the Flink application in the form of data streams.
      1. Data sinks - The transformed data from data streams are consumed by data sinks which output the data to external systems.

12. explain Complex Event Processing (CEP) ?

  • FlinkCEP is an API in Apache Flink, which analyses event patterns on continuous streaming data. These events are near real time, which have high throughput and low latency. This API is used mostly on Sensor data, which come in real-time and are very complex to process.
  • CEP: It is used for complex event processing.
  • https://www.tutorialspoint.com/apache_flink/apache_flink_libraries.htm

Apache Flink can be deployed and configured in below ways.

  • Flink can be setup Locally (Single Machine)
  • It can be deployed on a VM machine.
  • We can use Flink as a Docker image.
  • We can configure and deploy Flink as a standalone cluster.
  • Flink can be deployed and configured on Hadoop YARN and another resource managing framework.
  • Flink can be deployed and configured on a Cloud system.
  • savepoint is a “global backup” of flink status of a timestamp moment
  • For software upgrade, change conf. Can make flink restart from the last savepoint
  • Triggered by users
  • will keep existing until user delete
  • CAN recognize savepoint as a special snapshot of checkpoint in a specific time
  • Trigger way
    • flink savepoint command
    • flink cancel -s command, when cancel flink job
    • via REST API : **/jobs/:jobid /savepoints**
  • Ref
  • checkpoint save current flink status, is a “false tolerance” mechanism
    • flink status
      • Operator status : e.g. offset
      • KeyedState status : e.g. : MapState、ListState、ValueState
  • make sure flink can auto-recover when error/exception during running
  • example : if flink failed on chk-5, then it will try to recover from chk-4
  • managed/op by flink. Users only need to define parameter
  • Auto op by flink
  • default concurrent = 1 -> there is ONLY ONE runs per flink app
  • 3 components work on checkpoint : JM, TM, ZK
  • Mechanisms
    • JM triggers checkpoint periodically
    • Once TM receive all CheckpointBarrier, it will start checkpoint op, once completed, TM will inform JM. ONLY when all sink operator finish their checkpoint, and send back to JM, such checkpoint is called completed
      • in HA, checkpoint will be saved in ZK as well
      • CheckpointBarrier is a special event, will flow to downstream operator, ONLY when skink operator receive it, we say checkpoint completed
      • NOTE : CheckpointBarrier sync time also need to be considered
  • CheckpointCoordinator is an important class in Checkpoint op
    • it has below important methods
      • triggerCheckpoint
      • triggerSavepoint
      • restoreSavepoint
      • restoreLatestCheckpointedState
      • receiveAcknowledgeMessage
  • Disable checkpoint via CheckpointConfig setting
    • DELETE_ON_CANCELLATION : delete checkpoint when program canceled
    • RETAIN_ON_CANCELLATION : keep checkpoint when program canceled
  • Common reasons why checkpoint fail:
    • No exception handling on client/external system code
      • e.g. json parse error, time out, exception on sink system
    • freqent GC, Not enough memory -> cause OOM
    • internet issues, machine issues
  • Things to care when set up checkpoint
  • Ref
  • A special event
  • Will follow event from upstream operator to downstream operator
  • ONLY when “final” operator (e.g. sink operator) receive Barrier, and confirm checkpoint is OK, then this “checkpoint” is completed
  • Is a common concept in stream framework

  • It happens when "downstream" CAN'T catchup "upstream"'s processing speed

    • -> So there’s a mechanism push back to upstream and ask them slow down their process
  • Can due to internet, disk I/O, freqent GC, data hotpoint… or data skew, code efficiency, TM memory, its GC

  • Can also affect checkpoint

    • if data get delay -> checkpoint get delay (more longer)
    • if “exactly once” -> have to wait delay barrier -> more data will be cached -> checkpoint get bigger
    • above make cause checkpoint failed, or OOM
  • Can monitor via Flink UI (version > 1.13)

  • Not really a problem at all cases, have a backpressure sometimes means system is using fully of its resources. But a serious backpressure can cause system delay

  • Ref

  • Components
    • ResourceManager (RM)
    • NodeManager (NM)
    • AppMaster (job manager runs in the same container)
    • Container (task manager runs on it)
  • Steps
    • step 1) flink client upload flink jar, app jar, and conf to HDFS
    • step 2) client submit job to Yarn ResourceManager, and register resources
    • step 3) ResourceManager diepense container and launch AppMaster, then AppMaster load jar files and config env, then launch job manager
    • step 4) once above steps success, AppMaster knows job manager’s ip
    • step 5) Job Manager create a new flink conf for task manager, this conf also uploaded to HDFS. AppMaster offer flink web service endpoint
      • NOTE : endpoints offered by Yarn are ALL temporal ones, user can run multiple flink on Yarn
    • step 6) AppMaster ask ResourceManager for resources. NodeManager load flink jar and launch TaskManager
    • step 7) TaskManager successfully launched, send hear-beat to job manager, ready to execute missions (job manager’s command)
  • pros:
    • “use as needed” -> can raise memory usage pct in system
    • can run jobs basd on “priority” as priority setting in Yarn jobs
    • can automatically deal with Failover on various roles
      • JobManager, TaskManager error/exception… can automatically retry/rerun by Yarn
  • Pic
  • Ref
  • EvenTime
    • “event happened time”. The most accurate time when things happended
  • IngestionTime
    • time when event ingested into Flink, the time source operator created. Is Flink job manager’s system time in most cases
  • ProcessingTime
    • Event processed time. Is timestamp when transformation happened. created in Flink task manager

Ref