Airflow) pod external sigterm으로 죽는 현상

Data/Airflow

Airflow) pod external sigterm으로 죽는 현상_해결못해서 retries함

MightyTedKim 2022. 6. 2. 09:53

728x90

요약

상황
원인
조치

설명

상황

sparkK8sOperator 사용할때 sensor가 poking하다가 갑자기 죽음

결과를 보면 정상 실행되는데, sensor가 로그를 가져오지 못해서 에러로 표시됨

[2022-06-01 20:01:18,396] {spark_kubernetes.py:121} INFO - Spark application is still in state: RUNNING
[2022-06-01 20:02:18,402] {spark_kubernetes.py:101} INFO - Poking: spark-test-20220531t193000-1
[2022-06-01 20:02:18,424] {spark_kubernetes.py:121} INFO - Spark application is still in state: RUNNING
[2022-06-01 20:03:18,467] {spark_kubernetes.py:101} INFO - Poking: spark-test-20220531t193000-1
[2022-06-01 20:03:18,474] {spark_kubernetes.py:121} INFO - Spark application is still in state: RUNNING
[2022-06-01 20:04:18,534] {spark_kubernetes.py:101} INFO - Poking: spark-test-20220531t193000-1
[2022-06-01 20:04:18,546] {spark_kubernetes.py:121} INFO - Spark application is still in state: RUNNING
[2022-06-01 20:05:14,512] {local_task_job.py:209} WARNING - State of this instance has been externally set to failed. Terminating instance.
[2022-06-01 20:05:14,513] {process_utils.py:100} INFO - Sending Signals.SIGTERM to GPID 18
[2022-06-01 20:05:14,513] {taskinstance.py:1236} ERROR - Received SIGTERM. Terminating subprocesses.
[2022-06-01 20:05:14,539] {process_utils.py:66} INFO - Process psutil.Process(pid=18, status='terminated', exitcode=0, started='19:30:15') (18) terminated with exit code

원인

1. metada db cpu 사용률 100%

-> 이거였음

grafana로 확인했을 때 helm chart 기본 postgresql cpu가 너무 작았음

resources:
requests:
cpu: 250m
memory: 256Mi

이렇게 늘리니까, 최근 1주일간 죽지 않음

resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 500m
memory: 512Mi

기본 postgresql 말고, 따로 db 만들어야겠음

2. airflow 설정도 해좀

설정 변경했는데 똑같음

• killed_task_cleanup_time = 604800
• schedule_after_task_execution = False
• scheduler_heartbeat_sec = 60
• scheduler_health_check_threshold = 120
• job_heartbeat_sec = 60

https://github.com/apache/airflow/issues/14672

아래는 document 내용

killed_task_cleanup_time

When a task is killed forcefully, this is the amount of time in seconds that it has to cleanup after it is sent a SIGTERM, before it is SIGKILLED

schedule_after_task_execution

New in version 2.0.0.

Should the Task supervisor process perform a “mini scheduler” to attempt to schedule more tasks of the same DAG. Leaving this on will mean tasks in the same DAG execute quicker, but might starve out other dags in some circumstances

scheduler_heartbeat_sec

The scheduler constantly tries to trigger new tasks (look at the scheduler section in the docs for more information). This defines how often the scheduler should run (in seconds).

scheduler_health_check_threshold

New in version 1.10.2.

If the last scheduler heartbeat happened more than scheduler_health_check_threshold ago (in seconds), scheduler is considered unhealthy. This is used by the health check in the “/health” endpoint

job_heartbeat_sec

Task instances listen for external kill signal (when you clear tasks from the CLI or the UI), this defines the frequency at which they should listen (in seconds).

https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#job-heartbeat-sec

Workaround for now would be to set AIRFLOW__SCHEDULER__SCHEDULE_AFTER_TASK_EXECUTION to False

https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#schedule-after-task-execution

조치

참고

https://stackoverflow.com/questions/59298566/airflow-sending-sigterms-to-tasks-randomly

728x90

저작자표시 (새창열림)

'Data > Airflow' 카테고리의 다른 글

Airflow) 'Task 분리' 실무 적용하기 _k8s/spark (0)	2022.07.04
Airflow) custom operator 만들기_s3 Prefix copy (0)	2022.06.12
Airflow) S3CopyObjectOperator 이용해서 copy 하기 (0)	2022.06.02
Slipp) Airflow2.0스터디_워크플로 트러거_4주차(6장) (0)	2022.05.28
Slipp) Airflow2.0 스터디_3주차(4/5장) (0)	2022.05.15

현재글Airflow) pod external sigterm으로 죽는 현상_해결못해서 retries함

기록의 공간 :: mightytedkim