'데이터 엔지니어'로 성장하기

정리하는 걸 좋아하고, 남이 읽으면 더 좋아함

Data/Spark

Spark) Spark Base Image Docker Build(VM, 내부망)

MightyTedKim 2021. 11. 11. 08:40
728x90
반응형

spark image를 만드는 법VM,내부망안에서 제가 겪은 경험을 공유하려 해요

별거 아니지만 처음에 할 때는 너무 막막했었어요 ㅎ

 

글을 구조는  아래와 같아요

 

1. 상황

  • spark image를 만들기
  • VM, 내부망에서 build  실패

2. 해결

  • spark-3.1.2-bin-hadoop3.2/kubernetes/Dockerfile
    • Sources.list 주석
    • HTTP_PROXY, HTTPS_PROXY 설정
  • kubernetes/dockerfiles/spark/bindings/python/Dockerfile
    • pypi.org를 신뢰할 수 있는 host
    • HTTP_PROXY, HTTPS_PROXY 설정

1. 상황

spark image를 만들기

spark를 실행하는 base image를 만들 때 인프라 상황이 다를 수 있어서, 직접 Docker build 하는 것을 추천해요.

EC2에서 빌드했을 때는 잘되는 것을 확인했습니다. 

 

먼저 docker-image-tool.sh 이라는 스크립트를 실행하면 되요

이 스크립트는 uid를 정해주고, dockerbuild 를 두개 연속으로 실행합니다.

docker base 이미지는 위에서 언급했듯이 총 3개의 단계를 통해서 빌드가되요

  1. /spark-3.1.2-bin-hadoop3.2/bin/docker-image-tool.sh 
  2. /spark-3.1.2-bin-hadoop3.2/kubernetes/dockerfiles/spark/Dockerfile
  3. /spark-3.1.2-bin-hadoop3.2/kubernetes/dockerfiles/spark/bindings/python/Dockerfile
# base_image 빌드할 때 사용하는 명령어
 ./bin/docker-image-tool.sh \

 -r spark-base -t 1.0.0 \
 - p ./kubernetes/dockerfiles/spark/bindings/python/Dockerfile build

# 사용되는 dockerfile2개
/spark-3.1.2-bin-hadoop3.2/kubernetes/dockerfiles/spark/Dockerfile
/spark-3.1.2-bin-hadoop3.2/kubernetes/dockerfiles/spark/bindings/python/Dockerfile

$  docker images
spark-base/spark             1.0.0              cc04a135019b   16 hours ago    631MB
spark-base/spark-py         1.0.0              2e5208904510   16 hours ago    635MB

 

외부망에서 테스트 할 때는 이미지가 잘만들어졌는데, 내부망에서도 빌드를 하고 싶었어요.

아래는 실패했을 때 발생한 로그들이에요

VM, 내부망에서 build  실패

vm 안에서 빌드했을 때 에러가 2개 발생했어요. 뭔가 길게 나왓는데

 

1. apt-get

$ ./bin/docker-image-tool.sh -r spark-base -t 1.0.0 -p ./kubernetes/dockerfiles/spark/bindings/python/Dockerfile build

Sending build context to Docker daemon 356.8MB
Step 1/18 : ARG java_image_tag=11-jre-slim
Step 2/18 : FROM openjdk:${java_image_tag} ---> e4beed9b17a3
Step 3/18 : ARG spark_uid=185 ---> Using cache ---> b098f4c33b7f
Step 4/18 : RUN apt-get update && ln -s /lib /lib64 && apt install -y bash tini libc6 libpam-modules krb5-user libnss3 procps && mkdir -p /opt/spark && mkdir -p /opt/spark/examples && mkdir -p /opt/spark/work-dir && to uch /opt/spark/RELEASE && rm /bin/sh && ln -sv /bin/bash /bin/sh && echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && chgrp root /etc/passwd && chmod ug+rw /etc/passwd && rm -rf /var/cache/apt/* ---> Running in b7962b7dc9af + sed -i s/http:\/\/deb.\(.*\)/https:\/\/deb.\1/g /etc/apt/sources.list + apt-get update
Err:1 https://deb.debian.org/debian bullseye InRelease Could not handshake: Error in the pull function. [IP: 146.75.50.132 443]
Err:2 http://security.debian.org/debian-security bullseye-security InRelease Connection failed [IP: 151.101.2.132 80]
Err:3 https://deb.debian.org/debian bullseye-updates InRelease Could not handshake: Error in the pull function. [IP: 146.75.50.132 443] Reading package lists... W: Failed to fetch https://deb.debian.org/debian/dists/bullseye/InRelease Could not handshake: Error in the pull function. [IP: 146.75.50.132 443] W: Failed to fetch http://security.debian.org/debian-security/dists/bullseye-security/InRelease Connection failed [IP: 151.101.2.132 80] W: Failed to fetch https://deb.debian.org/debian/dists/bullseye-updates/InRelease Could not handshake: Error in the pull function. [IP: 146.75.50.132 443] W: Some index files failed to download. They have been ignored, or old ones used instead. + ln -s /lib /lib64 + apt install -y bash tini libc6 libpam-modules krb5-user libnss3 procps

 

2. pip install 

Processing triggers for libc-bin (2.31-13+deb11u2) ...
Requirement already satisfied: pip in /usr/lib/python3/dist-packages (20.3.4)

WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1123)'))': /simple/setuptools/
Could not fetch URL https://pypi.org/simple/setuptools/: There was a problem confirming the ssl certificate: HTTPSConnectionPool(host='pypi.org', port=443): Max retries exceeded with url: /simple/setuptools/ (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1123)'))) - skipping

rm: cannot remove '/root/.cache': No such file or directory
The command '/bin/sh -c apt-get update &&     apt install -y python3 python3-pip &&     pip3 install --upgrade pip setuptools &&     rm -r /root/.cache && rm -rf /var/cache/apt/*' returned a non-zero code: 1
Failed to build PySpark Docker image, please refer to Docker build output for details.

 

그냥 내부망이라서 포기할라 그러다가 좀 더 찾아봤어요


2. 해결

각각 에러를 확인했어요

 

spark-3.1.2-bin-hadoop3.2/kubernetes/Dockerfile

1. sources.list
2. HTTP_PROXY, HTTPS_PROXY 설정

먼저, apt-get 같은 경우는 HTTPSConnectionPool 이라는 단어가 보이네요

https 관련된 문제고 아래 코드를 보니까 sources.list의 http를 https로 바꾸는 스크립크가 잇네요

RUN set -ex && \ 
    sed -i 's/http:\/\/deb.\(.*\)/https:\/\/deb.\1/g' /etc/apt/sources.list && \

요놈이 sources/list를 https로 만들고 있더라고요. 주석해줬습니다.


프록시 설정을 또 해줫어요. 되기는 되는데...

env가 이미지 안에 들어가서, s3에 접근이 안되요.

ENV https_proxy=http://1**.2**.**.2**:****  http_proxy=http://1**.2**.**.2**:****

 

아래는 s3에 접근할 때 발생한 오류에요

http.client_exceptions.ServerTimeoutError: Connection timeout to host 

 

telnet은 가는데 curl은 안됨. 분명 http 리턴값이 있는 endpoint인데

curl -vvv 를 해보니 http_proxy로 우회하는것을 발견했습니다. (역시 변수가 문제였어요)

 

쉘로 컨테이너 안에 들어가서 unset을 해주고 테스트하니 정상 작동하네요

Docker build 할 때 ENV가 들어갔나봐요

$ unset http_proxy
$ unset https_proxy

최종

$ /spark-3.1.2-bin-hadoop3.2/kubernetes/dockerfiles/spark/Dockerfile

# Specify the User that the actual main process will run as
ARG spark_uid=185
USER ${spark_uid}

# Before building the docker image, first build and make a Spark distribution following
# the instructions in http://spark.apache.org/docs/latest/building-spark.html.
# If this docker file is being used in the context of building your images from a Spark
# distribution, the docker build command should be invoked from the top level directory
# of the Spark distribution. E.g.:
# docker build -t spark:latest -f kubernetes/dockerfiles/spark/Dockerfile .

#RUN set -ex && \ 
#    sed -i 's/http:\/\/deb.\(.*\)/https:\/\/deb.\1/g' /etc/apt/sources.list && \
RUN https_proxy=http://1**.2**.**.2**:**** http_proxy=http://1**.2**.**.2**:**** apt-get update && \
    ln -s /lib /lib64 && \
    https_proxy=http://1**.2**.**.2**:**** http_proxy=http://1**.2**.**.2**:**** apt-get install -y bash tini libc6 libpam-modules krb5-user libnss3 procps && \
    mkdir -p /opt/spark && \
    mkdir -p /opt/spark/examples && \
    mkdir -p /opt/spark/work-dir && \
    touch /opt/spark/RELEASE && \
    rm /bin/sh && \
    ln -sv /bin/bash /bin/sh && \
    echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && \
    chgrp root /etc/passwd && chmod ug+rw /etc/passwd && \
    rm -rf /var/cache/apt/*

뿌듯 뿌듯. 이제 끝이다 생각했는데.. pip가 말썽이네요


kubernetes/dockerfiles/spark/bindings/python/Dockerfile

 

1. HTTP_PROXY, HTTPS_PROXY 설정
2. pypi.org를 신뢰할 수 있는 host

2번째 도커파일을 빌드하려니까 443에 대한 접근이 안된다고 나와요

Could not fetch URL https://pypi.org/simple/setuptools/: 
There was a problem confirming the ssl certificate: HTTPSConnectionPool(host='pypi.org', port=443):

 

pypi.org를 신뢰할 수 있는 host로 설정하고 실행해줍니다.

pip3 --trusted-host pypi.org --trusted-host files.pythonhosted.org install --upgrade pip setuptools && \

 

얘도 프록시 설정을 해줫어요

https_proxy=http://1**.2**.**.2**:**** http_proxy=http://1**.2**.**.2**:**** \
pip3 --trusted-host pypi.org --trusted-host files.pythonhosted.org install --upgrade pip setuptools && 

최종

$ vi kubernetes/dockerfiles/spark/bindings/python/Dockerfile

ARG base_img

FROM $base_img

WORKDIR /

# Reset to root to run installation tasks
USER 0

RUN mkdir ${SPARK_HOME}/python
RUN https_proxy=http://1**.2**.**.2**:**** http_proxy=http://1**.2**.**.2**:****  apt-get update && \
    https_proxy=http://1**.2**.**.2**:**** http_proxy=http://1**.2**.**.2**:****  apt install -y python3 python3-pip && \
    https_proxy=http://1**.2**.**.2**:**** http_proxy=http://1**.2**.**.2**:****  pip3 --trusted-host pypi.org --trusted-host files.pythonhosted.org install --upgrade pip setuptools && \
    # Removed the .cache to save space
    rm -r /root/.cache && rm -rf /var/cache/apt/*

COPY python/pyspark ${SPARK_HOME}/python/pyspark
COPY python/lib ${SPARK_HOME}/python/lib

WORKDIR /opt/spark/work-dir
ENTRYPOINT [ "/opt/entrypoint.sh" ]

엄청 돌아 돌아 여기까지 왔네요.

덕분에 이제 내부망 안에서도 vim curl net-stat tcpdump 나 py4j 같은 라이브러리를 넣을 수 있게 되었어요.

 

쓰다보니 길게 썼네요ㅎ

링크

 

GitHub - apache/spark: Apache Spark - A unified analytics engine for large-scale data processing

Apache Spark - A unified analytics engine for large-scale data processing - GitHub - apache/spark: Apache Spark - A unified analytics engine for large-scale data processing

github.com

 

 

728x90
반응형