요약
- 상황
- 원인
- 조치
설명
상황
sparkK8sOperator 사용할때 sensor가 poking하다가 갑자기 죽음
결과를 보면 정상 실행되는데, sensor가 로그를 가져오지 못해서 에러로 표시됨
[2022-06-01 20:01:18,396] {spark_kubernetes.py:121} INFO - Spark application is still in state: RUNNING [2022-06-01 20:02:18,402] {spark_kubernetes.py:101} INFO - Poking: spark-test-20220531t193000-1 [2022-06-01 20:02:18,424] {spark_kubernetes.py:121} INFO - Spark application is still in state: RUNNING [2022-06-01 20:03:18,467] {spark_kubernetes.py:101} INFO - Poking: spark-test-20220531t193000-1 [2022-06-01 20:03:18,474] {spark_kubernetes.py:121} INFO - Spark application is still in state: RUNNING [2022-06-01 20:04:18,534] {spark_kubernetes.py:101} INFO - Poking: spark-test-20220531t193000-1 [2022-06-01 20:04:18,546] {spark_kubernetes.py:121} INFO - Spark application is still in state: RUNNING [2022-06-01 20:05:14,512] {local_task_job.py:209} WARNING - State of this instance has been externally set to failed. Terminating instance. [2022-06-01 20:05:14,513] {process_utils.py:100} INFO - Sending Signals.SIGTERM to GPID 18 [2022-06-01 20:05:14,513] {taskinstance.py:1236} ERROR - Received SIGTERM. Terminating subprocesses. [2022-06-01 20:05:14,539] {process_utils.py:66} INFO - Process psutil.Process(pid=18, status='terminated', exitcode=0, started='19:30:15') (18) terminated with exit code |
원인
1. metada db cpu 사용률 100%
-> 이거였음
grafana로 확인했을 때 helm chart 기본 postgresql cpu가 너무 작았음
resources:
requests:
cpu: 250m
memory: 256Mi
이렇게 늘리니까, 최근 1주일간 죽지 않음
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 500m
memory: 512Mi
기본 postgresql 말고, 따로 db 만들어야겠음
2. airflow 설정도 해좀
설정 변경했는데 똑같음
• killed_task_cleanup_time = 604800
• schedule_after_task_execution = False
• scheduler_heartbeat_sec = 60
• scheduler_health_check_threshold = 120
• job_heartbeat_sec = 60
https://github.com/apache/airflow/issues/14672
아래는 document 내용
killed_task_cleanup_time
When a task is killed forcefully, this is the amount of time in seconds that it has to cleanup after it is sent a SIGTERM, before it is SIGKILLED
schedule_after_task_execution
New in version 2.0.0.
Should the Task supervisor process perform a “mini scheduler” to attempt to schedule more tasks of the same DAG. Leaving this on will mean tasks in the same DAG execute quicker, but might starve out other dags in some circumstances
scheduler_heartbeat_sec
The scheduler constantly tries to trigger new tasks (look at the scheduler section in the docs for more information). This defines how often the scheduler should run (in seconds).
scheduler_health_check_threshold
New in version 1.10.2.
If the last scheduler heartbeat happened more than scheduler_health_check_threshold ago (in seconds), scheduler is considered unhealthy. This is used by the health check in the “/health” endpoint
job_heartbeat_sec
Task instances listen for external kill signal (when you clear tasks from the CLI or the UI), this defines the frequency at which they should listen (in seconds).
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#job-heartbeat-sec
Workaround for now would be to set AIRFLOW__SCHEDULER__SCHEDULE_AFTER_TASK_EXECUTION to False
조치
참고
https://stackoverflow.com/questions/59298566/airflow-sending-sigterms-to-tasks-randomly
'Data > Airflow' 카테고리의 다른 글
Airflow) 'Task 분리' 실무 적용하기 _k8s/spark (0) | 2022.07.04 |
---|---|
Airflow) custom operator 만들기_s3 Prefix copy (0) | 2022.06.12 |
Airflow) S3CopyObjectOperator 이용해서 copy 하기 (0) | 2022.06.02 |
Slipp) Airflow2.0스터디_워크플로 트러거_4주차(6장) (0) | 2022.05.28 |
Slipp) Airflow2.0 스터디_3주차(4/5장) (0) | 2022.05.15 |