'데이터 엔지니어'로 성장하기

정리하는 걸 좋아하고, 남이 읽으면 더 좋아함

Data/Airflow

Airflow) pod external sigterm으로 죽는 현상_해결못해서 retries함

MightyTedKim 2022. 6. 2. 09:53
728x90
반응형

요약

  1. 상황
  2. 원인
  3. 조치

설명

상황

 

sparkK8sOperator 사용할때 sensor가 poking하다가 갑자기 죽음

결과를 보면 정상 실행되는데, sensor가 로그를 가져오지 못해서 에러로 표시됨

[2022-06-01 20:01:18,396] {spark_kubernetes.py:121} INFO - Spark application is still in state: RUNNING
[2022-06-01 20:02:18,402] {spark_kubernetes.py:101} INFO - Poking: spark-test-20220531t193000-1
[2022-06-01 20:02:18,424] {spark_kubernetes.py:121} INFO - Spark application is still in state: RUNNING
[2022-06-01 20:03:18,467] {spark_kubernetes.py:101} INFO - Poking: spark-test-20220531t193000-1
[2022-06-01 20:03:18,474] {spark_kubernetes.py:121} INFO - Spark application is still in state: RUNNING
[2022-06-01 20:04:18,534] {spark_kubernetes.py:101} INFO - Poking: spark-test-20220531t193000-1
[2022-06-01 20:04:18,546] {spark_kubernetes.py:121} INFO - Spark application is still in state: RUNNING
[2022-06-01 20:05:14,512] {local_task_job.py:209} WARNING - State of this instance has been externally set to failed. Terminating instance.
[2022-06-01 20:05:14,513] {process_utils.py:100} INFO - Sending Signals.SIGTERM to GPID 18
[2022-06-01 20:05:14,513] {taskinstance.py:1236} ERROR - Received SIGTERM. Terminating subprocesses.
[2022-06-01 20:05:14,539] {process_utils.py:66} INFO - Process psutil.Process(pid=18, status='terminated', exitcode=0, started='19:30:15') (18) terminated with exit code 

 

원인

1. metada db cpu 사용률 100%

-> 이거였음

grafana로 확인했을 때 helm chart 기본 postgresql cpu가 너무 작았음

    resources:
          requests:
            cpu: 250m
            memory: 256Mi

이렇게 늘리니까, 최근 1주일간 죽지 않음

        resources:
          limits:
            cpu: 500m
            memory: 512Mi
          requests:
            cpu: 500m
            memory: 512Mi

기본 postgresql 말고, 따로 db 만들어야겠음

 

2. airflow 설정도 해좀

설정 변경했는데 똑같음

 

• killed_task_cleanup_time = 604800
• schedule_after_task_execution = False
• scheduler_heartbeat_sec = 60
• scheduler_health_check_threshold = 120
• job_heartbeat_sec = 60

https://github.com/apache/airflow/issues/14672

 

아래는 document 내용

killed_task_cleanup_time

When a task is killed forcefully, this is the amount of time in seconds that it has to cleanup after it is sent a SIGTERM, before it is SIGKILLED

schedule_after_task_execution

New in version 2.0.0.

Should the Task supervisor process perform a “mini scheduler” to attempt to schedule more tasks of the same DAG. Leaving this on will mean tasks in the same DAG execute quicker, but might starve out other dags in some circumstances

scheduler_heartbeat_sec

The scheduler constantly tries to trigger new tasks (look at the scheduler section in the docs for more information). This defines how often the scheduler should run (in seconds).

scheduler_health_check_threshold

New in version 1.10.2.

If the last scheduler heartbeat happened more than scheduler_health_check_threshold ago (in seconds), scheduler is considered unhealthy. This is used by the health check in the “/health” endpoint

job_heartbeat_sec

Task instances listen for external kill signal (when you clear tasks from the CLI or the UI), this defines the frequency at which they should listen (in seconds).

 

https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#job-heartbeat-sec

 

Workaround for now would be to set AIRFLOW__SCHEDULER__SCHEDULE_AFTER_TASK_EXECUTION to False

https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#schedule-after-task-execution

조치

 

참고

 

https://stackoverflow.com/questions/59298566/airflow-sending-sigterms-to-tasks-randomly

728x90
반응형