spark image를 만드는 법과 VM,내부망안에서 제가 겪은 경험을 공유하려 해요
별거 아니지만 처음에 할 때는 너무 막막했었어요 ㅎ
글을 구조는 아래와 같아요
1. 상황
- spark image를 만들기
- VM, 내부망에서 build 실패
2. 해결
- spark-3.1.2-bin-hadoop3.2/kubernetes/Dockerfile
- Sources.list 주석
- HTTP_PROXY, HTTPS_PROXY 설정
- kubernetes/dockerfiles/spark/bindings/python/Dockerfile
- pypi.org를 신뢰할 수 있는 host
- HTTP_PROXY, HTTPS_PROXY 설정
1. 상황
spark image를 만들기
spark를 실행하는 base image를 만들 때 인프라 상황이 다를 수 있어서, 직접 Docker build 하는 것을 추천해요.
EC2에서 빌드했을 때는 잘되는 것을 확인했습니다.
먼저 docker-image-tool.sh 이라는 스크립트를 실행하면 되요
이 스크립트는 uid를 정해주고, dockerbuild 를 두개 연속으로 실행합니다.
docker base 이미지는 위에서 언급했듯이 총 3개의 단계를 통해서 빌드가되요
- /spark-3.1.2-bin-hadoop3.2/bin/docker-image-tool.sh
- /spark-3.1.2-bin-hadoop3.2/kubernetes/dockerfiles/spark/Dockerfile
- /spark-3.1.2-bin-hadoop3.2/kubernetes/dockerfiles/spark/bindings/python/Dockerfile
# base_image 빌드할 때 사용하는 명령어 ./bin/docker-image-tool.sh \ -r spark-base -t 1.0.0 \ - p ./kubernetes/dockerfiles/spark/bindings/python/Dockerfile build # 사용되는 dockerfile2개 /spark-3.1.2-bin-hadoop3.2/kubernetes/dockerfiles/spark/Dockerfile /spark-3.1.2-bin-hadoop3.2/kubernetes/dockerfiles/spark/bindings/python/Dockerfile $ docker images spark-base/spark 1.0.0 cc04a135019b 16 hours ago 631MB spark-base/spark-py 1.0.0 2e5208904510 16 hours ago 635MB |
외부망에서 테스트 할 때는 이미지가 잘만들어졌는데, 내부망에서도 빌드를 하고 싶었어요.
아래는 실패했을 때 발생한 로그들이에요
VM, 내부망에서 build 실패
vm 안에서 빌드했을 때 에러가 2개 발생했어요. 뭔가 길게 나왓는데
1. apt-get
$ ./bin/docker-image-tool.sh -r spark-base -t 1.0.0 -p ./kubernetes/dockerfiles/spark/bindings/python/Dockerfile build
Sending build context to Docker daemon 356.8MB
Step 1/18 : ARG java_image_tag=11-jre-slim
Step 2/18 : FROM openjdk:${java_image_tag} ---> e4beed9b17a3
Step 3/18 : ARG spark_uid=185 ---> Using cache ---> b098f4c33b7f
Step 4/18 : RUN apt-get update && ln -s /lib /lib64 && apt install -y bash tini libc6 libpam-modules krb5-user libnss3 procps && mkdir -p /opt/spark && mkdir -p /opt/spark/examples && mkdir -p /opt/spark/work-dir && to uch /opt/spark/RELEASE && rm /bin/sh && ln -sv /bin/bash /bin/sh && echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && chgrp root /etc/passwd && chmod ug+rw /etc/passwd && rm -rf /var/cache/apt/* ---> Running in b7962b7dc9af + sed -i s/http:\/\/deb.\(.*\)/https:\/\/deb.\1/g /etc/apt/sources.list + apt-get update
Err:1 https://deb.debian.org/debian bullseye InRelease Could not handshake: Error in the pull function. [IP: 146.75.50.132 443]
Err:2 http://security.debian.org/debian-security bullseye-security InRelease Connection failed [IP: 151.101.2.132 80]
Err:3 https://deb.debian.org/debian bullseye-updates InRelease Could not handshake: Error in the pull function. [IP: 146.75.50.132 443] Reading package lists... W: Failed to fetch https://deb.debian.org/debian/dists/bullseye/InRelease Could not handshake: Error in the pull function. [IP: 146.75.50.132 443] W: Failed to fetch http://security.debian.org/debian-security/dists/bullseye-security/InRelease Connection failed [IP: 151.101.2.132 80] W: Failed to fetch https://deb.debian.org/debian/dists/bullseye-updates/InRelease Could not handshake: Error in the pull function. [IP: 146.75.50.132 443] W: Some index files failed to download. They have been ignored, or old ones used instead. + ln -s /lib /lib64 + apt install -y bash tini libc6 libpam-modules krb5-user libnss3 procps
2. pip install
Processing triggers for libc-bin (2.31-13+deb11u2) ...
Requirement already satisfied: pip in /usr/lib/python3/dist-packages (20.3.4)
WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1123)'))': /simple/setuptools/
Could not fetch URL https://pypi.org/simple/setuptools/: There was a problem confirming the ssl certificate: HTTPSConnectionPool(host='pypi.org', port=443): Max retries exceeded with url: /simple/setuptools/ (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1123)'))) - skipping
rm: cannot remove '/root/.cache': No such file or directory
The command '/bin/sh -c apt-get update && apt install -y python3 python3-pip && pip3 install --upgrade pip setuptools && rm -r /root/.cache && rm -rf /var/cache/apt/*' returned a non-zero code: 1
Failed to build PySpark Docker image, please refer to Docker build output for details.
그냥 내부망이라서 포기할라 그러다가 좀 더 찾아봤어요
2. 해결
각각 에러를 확인했어요
spark-3.1.2-bin-hadoop3.2/kubernetes/Dockerfile
1. sources.list
2. HTTP_PROXY, HTTPS_PROXY 설정
먼저, apt-get 같은 경우는 HTTPSConnectionPool 이라는 단어가 보이네요
https 관련된 문제고 아래 코드를 보니까 sources.list의 http를 https로 바꾸는 스크립크가 잇네요
RUN set -ex && \ sed -i 's/http:\/\/deb.\(.*\)/https:\/\/deb.\1/g' /etc/apt/sources.list && \ |
요놈이 sources/list를 https로 만들고 있더라고요. 주석해줬습니다.
프록시 설정을 또 해줫어요. 되기는 되는데...
env가 이미지 안에 들어가서, s3에 접근이 안되요.
ENV https_proxy=http://1**.2**.**.2**:**** http_proxy=http://1**.2**.**.2**:**** |
아래는 s3에 접근할 때 발생한 오류에요
http.client_exceptions.ServerTimeoutError: Connection timeout to host |
telnet은 가는데 curl은 안됨. 분명 http 리턴값이 있는 endpoint인데
curl -vvv 를 해보니 http_proxy로 우회하는것을 발견했습니다. (역시 변수가 문제였어요)
쉘로 컨테이너 안에 들어가서 unset을 해주고 테스트하니 정상 작동하네요
Docker build 할 때 ENV가 들어갔나봐요
$ unset http_proxy $ unset https_proxy |
최종
$ /spark-3.1.2-bin-hadoop3.2/kubernetes/dockerfiles/spark/Dockerfile # Specify the User that the actual main process will run as ARG spark_uid=185 USER ${spark_uid} # Before building the docker image, first build and make a Spark distribution following # the instructions in http://spark.apache.org/docs/latest/building-spark.html. # If this docker file is being used in the context of building your images from a Spark # distribution, the docker build command should be invoked from the top level directory # of the Spark distribution. E.g.: # docker build -t spark:latest -f kubernetes/dockerfiles/spark/Dockerfile . #RUN set -ex && \ # sed -i 's/http:\/\/deb.\(.*\)/https:\/\/deb.\1/g' /etc/apt/sources.list && \ RUN https_proxy=http://1**.2**.**.2**:**** http_proxy=http://1**.2**.**.2**:**** apt-get update && \ ln -s /lib /lib64 && \ https_proxy=http://1**.2**.**.2**:**** http_proxy=http://1**.2**.**.2**:**** apt-get install -y bash tini libc6 libpam-modules krb5-user libnss3 procps && \ mkdir -p /opt/spark && \ mkdir -p /opt/spark/examples && \ mkdir -p /opt/spark/work-dir && \ touch /opt/spark/RELEASE && \ rm /bin/sh && \ ln -sv /bin/bash /bin/sh && \ echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && \ chgrp root /etc/passwd && chmod ug+rw /etc/passwd && \ rm -rf /var/cache/apt/* |
뿌듯 뿌듯. 이제 끝이다 생각했는데.. pip가 말썽이네요
kubernetes/dockerfiles/spark/bindings/python/Dockerfile
1. HTTP_PROXY, HTTPS_PROXY 설정
2. pypi.org를 신뢰할 수 있는 host
2번째 도커파일을 빌드하려니까 443에 대한 접근이 안된다고 나와요
Could not fetch URL https://pypi.org/simple/setuptools/: There was a problem confirming the ssl certificate: HTTPSConnectionPool(host='pypi.org', port=443): |
pypi.org를 신뢰할 수 있는 host로 설정하고 실행해줍니다.
pip3 --trusted-host pypi.org --trusted-host files.pythonhosted.org install --upgrade pip setuptools && \ |
얘도 프록시 설정을 해줫어요
https_proxy=http://1**.2**.**.2**:**** http_proxy=http://1**.2**.**.2**:**** \ pip3 --trusted-host pypi.org --trusted-host files.pythonhosted.org install --upgrade pip setuptools && |
최종
$ vi kubernetes/dockerfiles/spark/bindings/python/Dockerfile ARG base_img FROM $base_img WORKDIR / # Reset to root to run installation tasks USER 0 RUN mkdir ${SPARK_HOME}/python RUN https_proxy=http://1**.2**.**.2**:**** http_proxy=http://1**.2**.**.2**:**** apt-get update && \ https_proxy=http://1**.2**.**.2**:**** http_proxy=http://1**.2**.**.2**:**** apt install -y python3 python3-pip && \ https_proxy=http://1**.2**.**.2**:**** http_proxy=http://1**.2**.**.2**:**** pip3 --trusted-host pypi.org --trusted-host files.pythonhosted.org install --upgrade pip setuptools && \ # Removed the .cache to save space rm -r /root/.cache && rm -rf /var/cache/apt/* COPY python/pyspark ${SPARK_HOME}/python/pyspark COPY python/lib ${SPARK_HOME}/python/lib WORKDIR /opt/spark/work-dir ENTRYPOINT [ "/opt/entrypoint.sh" ] |
엄청 돌아 돌아 여기까지 왔네요.
덕분에 이제 내부망 안에서도 vim curl net-stat tcpdump 나 py4j 같은 라이브러리를 넣을 수 있게 되었어요.
쓰다보니 길게 썼네요ㅎ
링크
'Data > Spark' 카테고리의 다른 글
Spark) k8s,jupyterhub에서 sparkUI 사용하기 (0) | 2022.03.24 |
---|---|
Spark) spark_submit시 spark.app.id warning_ jupyterhub (0) | 2022.03.24 |
Spark) spark volume data spill 이슈_spark-local-dir (0) | 2022.03.21 |
Spark) Thrift serverHive-Metastore OOM 해결_메모리 추가할당 (0) | 2022.01.10 |
Spark) Spark Thrift Server 클러스터에서 올리기 (0) | 2021.12.19 |