이전 게시물에서는 kuberenetes 없이 python으로 minio의 결과를 확인해봤어요
이 포스팅에서는 kubernetes pod 안에서 minio의 결과를 호출하는지 확인해봤어요
유투브는 크게 2가지로 나누어지는데 이 게시물에서는 sparkoperator를 따라했어요.
1. minio on spark: minio > pyenv > spark-submit 로 결과 출력
2. kubernetes: sparkoperator를 이용해 kubernetes pod로 결과 출력
1. POD에서 pyspark로 minio 데이터 읽어오기
1. spark operator POD 실행하기
> spark operator는 구글에서 공개한 프로젝트로, spark를 사용하는데 필요한 환경을 구성해줍니다.(쉬워요)
https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
$ helm install sparkoperator spark-operator/spark-operator
NAME: sparkoperator
LAST DEPLOYED: Tue Sep 14 15:54:30 2021
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
# k get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
minio-789b6fbc4c-tbtbw 1/1 Running 0 12d 19.233.98.111 kube01 <none> <none>
sparkoperator-spark-operator-58d4d9b449-fwrwp 1/1 Running 0 8d 19.233.89.12 kube04 <none> <none>
2. pyspark 실행할 이미지 만들기
> 구글에서 공개하는 이미지를 사용해도 되지만, hadoop2.7과의 호환성 때문에 에러가 있다고 하네요. 하지만 사용자의 환경에 맞춰서 직접 베이스 이미지를 만들 수 잇어요
> m1 macbook(arm)으로 이미지를 빌드할 경우에는 centos나 ubuntu에서 오류가 날 수 있어요
> 프록시 설정이 되어 있으면 debian 쪽 https에 접근하지 못해서 오류가 날 수 있어요
- 프록시 설정 전
$ ./bin/docker-image-tool.sh -r spark-base -t 1.0.0 -p ./kubernetes/dockerfiles/spark/bindings/python/Dockerfile build
Sending build context to Docker daemon 356.8MB
Step 1/18 : ARG java_image_tag=11-jre-slim
Step 2/18 : FROM openjdk:${java_image_tag} ---> e4beed9b17a3
Step 3/18 : ARG spark_uid=185 ---> Using cache ---> b098f4c33b7f
Step 4/18 : RUN set -ex && sed -i 's/http:\/\/deb.\(.*\)/https:\/\/deb.\1/g' /etc/apt/sources.list && apt-get update && ln -s /lib /lib64 && apt install -y bash tini libc6 libpam-modules krb5-user libnss3 procps && mkdir -p /opt/spark && mkdir -p /opt/spark/examples && mkdir -p /opt/spark/work-dir && to uch /opt/spark/RELEASE && rm /bin/sh && ln -sv /bin/bash /bin/sh && echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && chgrp root /etc/passwd && chmod ug+rw /etc/passwd && rm -rf /var/cache/apt/* ---> Running in b7962b7dc9af
+ sed -i s/http:\/\/deb.\(.*\)/https:\/\/deb.\1/g /etc/apt/sources.list
+ apt-get update
Err:1 https://deb.debian.org/debian bullseye InRelease Could not handshake: Error in the pull function. [IP: 146.75.50.132 443]
Err:2 http://security.debian.org/debian-security bullseye-security InRelease Connection failed [IP: 151.101.2.132 80]
Err:3 https://deb.debian.org/debian bullseye-updates InRelease Could not handshake: Error in the pull function. [IP: 146.75.50.132 443] Reading package lists...
W: Failed to fetch https://deb.debian.org/debian/dists/bullseye/InRelease Could not handshake: Error in the pull function. [IP: 146.75.50.132 443]
W: Failed to fetch http://security.debian.org/debian-security/dists/bullseye-security/InRelease Connection failed [IP: 151.101.2.132 80] W: Failed to fetch https://deb.debian.org/debian/dists/bullseye-updates/InRelease Could not handshake: Error in the pull function. [IP: 146.75.50.132 443]
W: Some index files failed to download. They have been ignored, or old ones used instead.
+ ln -s /lib /lib64
+ apt install -y bash tini libc6 libpam-modules krb5-user libnss3 procps
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
- 프록시 설정 후
- 주석한 코드
- set -ex && sed -i 's/http:\/\/deb.\(.*\)/https:\/\/deb.\1/g' /etc/apt/sources.list &&
- 추가한 코드
- ENV DEBIAN_FRONTEND=noninteractive \
TZ=Asia/Seoul \
https_proxy=http://1**.2**.**.2**:80** \
http_proxy=http://1**.2**.**.2**:80**
# Before building the docker image, first build and make a Spark distribution following
# the instructions in http://spark.apache.org/docs/latest/building-spark.html.
# If this docker file is being used in the context of building your images from a Spark
# distribution, the docker build command should be invoked from the top level directory
# of the Spark distribution. E.g.:
# docker build -t spark:latest -f kubernetes/dockerfiles/spark/Dockerfile .
#RUN set -ex && \
# sed -i 's/http:\/\/deb.\(.*\)/https:\/\/deb.\1/g' /etc/apt/sources.list && \
RUN apt-get update && \
ln -s /lib /lib64 && \
apt-get install -y bash tini libc6 libpam-modules krb5-user libnss3 procps && \
mkdir -p /opt/spark && \
mkdir -p /opt/spark/examples && \
mkdir -p /opt/spark/work-dir && \
touch /opt/spark/RELEASE && \
rm /bin/sh && \
ln -sv /bin/bash /bin/sh && \
echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && \
chgrp root /etc/passwd && chmod ug+rw /etc/passwd && \
rm -rf /var/cache/apt/*
- ENV DEBIAN_FRONTEND=noninteractive \
$ ./bin/docker-image-tool.sh -r spark-base -t 1.0.0 -p ./kubernetes/dockerfiles/spark/bindings/python/Dockerfile build
Sending build context to Docker daemon 356.8MB
Step 1/18 : ARG java_image_tag=11-jre-slim
Step 2/18 : FROM openjdk:${java_image_tag} ---> e4beed9b17a3
Step 3/18 : ARG spark_uid=185 ---> Using cache ---> b098f4c33b7f
Step 4/18 : RUN apt-get update && ln -s /lib /lib64 && apt install -y bash tini libc6 libpam-modules krb5-user libnss3 procps && mkdir -p /opt/spark && mkdir -p /opt/spark/examples && mkdir -p /opt/spark/work-dir && to uch /opt/spark/RELEASE && rm /bin/sh && ln -sv /bin/bash /bin/sh && echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && chgrp root /etc/passwd && chmod ug+rw /etc/passwd && rm -rf /var/cache/apt/* ---> Running in b7962b7dc9af
+ sed -i s/http:\/\/deb.\(.*\)/https:\/\/deb.\1/g /etc/apt/sources.list
+ apt-get update
Err:1 https://deb.debian.org/debian bullseye InRelease Could not handshake: Error in the pull function. [IP: 146.75.50.132 443]
Err:2 http://security.debian.org/debian-security bullseye-security InRelease Connection failed [IP: 151.101.2.132 80]
Err:3 https://deb.debian.org/debian bullseye-updates InRelease Could not handshake: Error in the pull function. [IP: 146.75.50.132 443] Reading package lists...
W: Failed to fetch https://deb.debian.org/debian/dists/bullseye/InRelease Could not handshake: Error in the pull function. [IP: 146.75.50.132 443]
W: Failed to fetch http://security.debian.org/debian-security/dists/bullseye-security/InRelease Connection failed [IP: 151.101.2.132 80] W: Failed to fetch https://deb.debian.org/debian/dists/bullseye-updates/InRelease Could not handshake: Error in the pull function. [IP: 146.75.50.132 443]
W: Some index files failed to download. They have been ignored, or old ones used instead.
+ ln -s /lib /lib64
+ apt install -y bash tini libc6 libpam-modules krb5-user libnss3 procps
WARNING: apt does not have
$ ./bin/docker-image-tool.sh -r spark-base -t 1.0.0 -p ./kubernetes/dockerfiles/spark/bindings/python/Dockerfile build
Sending build context to Docker daemon 356.8MB
Step 1/18 : ARG java_image_tag=11-jre-slim
Step 2/18 : FROM openjdk:${java_image_tag} ---> e4beed9b17a3
Step 3/18 : ARG spark_uid=185 ---> Using cache ---> b098f4c33b7f
Step 4/18 : RUN set -ex && sed -i 's/http:\/\/deb.\(.*\)/https:\/\/deb.\1/g' /etc/apt/sources.list && apt-get update && ln -s /lib /lib64 && apt install -y bash tini libc6 libpam-modules krb5-user libnss3 procps && mkdir -p /opt/spark && mkdir -p /opt/spark/examples && mkdir -p /opt/spark/work-dir && to uch /opt/spark/RELEASE && rm /bin/sh && ln -sv /bin/bash /bin/sh && echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && chgrp root /etc/passwd && chmod ug+rw /etc/passwd && rm -rf /var/cache/apt/* ---> Running in b7962b7dc9af
+ sed -i s/http:\/\/deb.\(.*\)/https:\/\/deb.\1/g /etc/apt/sources.list
+ apt-get update
Err:1 https://deb.debian.org/debian bullseye InRelease Could not handshake: Error in the pull function. [IP: 146.75.50.132 443]
Err:2 http://security.debian.org/debian-security bullseye-security InRelease Connection failed [IP: 151.101.2.132 80]
Err:3 https://deb.debian.org/debian bullseye-updates InRelease Could not handshake: Error in the pull function. [IP: 146.75.50.132 443] Reading package lists...
W: Failed to fetch https://deb.debian.org/debian/dists/bullseye/InRelease Could not handshake: Error in the pull function. [IP: 146.75.50.132 443]
W: Failed to fetch http://security.debian.org/debian-security/dists/bullseye-security/InRelease Connection failed [IP: 151.101.2.132 80] W: Failed to fetch https://deb.debian.org/debian/dists/bullseye-updates/InRelease Could not handshake: Error in the pull function. [IP: 146.75.50.132 443]
W: Some index files failed to download. They have been ignored, or old ones used instead.
+ ln -s /lib /lib64
+ apt install -y bash tini libc6 libpam-modules krb5-user libnss3 procps
WARNING: apt does not have
$ wget https://mirror.navercorp.com/apache/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
$ tar zxvf spark-3.1.2-bin-hadoop3.2.tgz
$ cd ./spark-3.1.2-bin-hadoop3.2
# ./bin/docker-image-tool.sh -r <repo> -t my-tag -p ./kubernetes/dockerfiles/spark/bindings/python/Dockerfile build
## -r registry 이름으로 이미지가 만들어짐 ex) [spark-base]/spark-py:1.0.0]
$ ./bin/docker-image-tool.sh \
-r deet1107 \
-t 1.0.0 \
-p ./kubernetes/dockerfiles/spark/bindings/python/Dockerfile build
## 제 docker_hub id가 deet1107이라서 아래처럼 이미지가 생성
$ docker ps
deet1107/spark-py 1.0.0 86aed3057138 4 hours ago 880MB
$ docker push deet1107/sparkpy:1.0.0
3. spark_base 이미지로 main.py 포함해서 빌드
$ tree sparkjob/
sparkjob/
├── Dockerfile
├── main.py
-------------
$ vi Dockerfile
FROM deet1107/spark-py:1.0.0
USER root
WORKDIR /app
COPY main.py .
-------------
$ vi main.py
from pyspark import SparkContext
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
def load_config(spark_context: SparkContext):
spark_context._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'myaccesskey')
spark_context._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'mysecretkey')
spark_context._jsc.hadoopConfiguration().set('fs.s3a.path.style.access', 'true')
spark_context._jsc.hadoopConfiguration().set('fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')
spark_context._jsc.hadoopConfiguration().set('fs.s3a.endpoint', 'http://19.233.98.111:9000') # k get pods -o wide 의 IP
spark_context._jsc.hadoopConfiguration().set('fs.s3a.connection.ssl.enabled', 'false')
load_config(spark.sparkContext)
dataframe = spark.read.json('s3a://test/*')
average = dataframe.agg({'amount': 'avg'})
average.show()
-------------
# build
$ docker build -t deet1107/test-k8s:v1.0.0 .
# push
$ docker push deet1107/test-k8s:v1.0.0
3. 이미지로 파드 생성하기
$ vi k8s.yml
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: sparkjob #pyspark-pi
namespace: default
spec:
type: Python
pythonVersion: "3"
mode: cluster
imagePullSecrets:
- regcred
image: "deet1107/test-spark-k8s:v1.1.0" #spark-py:v3.1.1
imagePullPolicy: Always
mainApplicationFile: local:///app/main.py #local:///opt/spark/examples/src/main/python/pi.py
sparkVersion: "3.1.2" #"3.1.1"
restartPolicy:
type: Never
driver:
cores: 1
coreLimit: "1000m"
memory: "512m"
labels:
version: 3.1.2
serviceAccount: sparkoperator-spark-operator #sparkoperator
executor:
cores: 1
instances: 1
memory: "512m"
labels:
version: 3.1.2
deps:
jars:
- https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.2.0/hadoop-aws-3.2.0.jar
- https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.375/aws-java-sdk-bundle-1.11.375.jar
# 내부망에서 ssl 오류나서 tomcat 경로에 넣음
# - http://10.***.29.77/files/sparkjob/hadoop-aws-3.2.0.jar
# - http://10.***.29.77/files/sparkjob/aws-java-sdk-bundle-1.11.375.jar
-------------
$ kubectl apply -f k8s.yml
# spark-job 확인
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
minio-789b6fbc4c-tbtbw 1/1 Running 0 12d
sparkjob-driver 0/1 Completed 0 5d16h
sparkoperator-spark-operator-58d4d9b449-fwrwp 1/1 Running 0 8d
-------------
# 10초뒤 spark-job의 로그 확인
$ kubectl logs sparkjob-driver
(결과)
+-----------+
|avg(amount)|
+-----------+
| 2800.0|
+-----------+
pod 안에서 서로 통신하면 minio에 있는 데이터를 spark-submit으로 실행하는 예시를 확인했습니다.
이제 hello world를 마쳤으니, 외부에 있는 minio 들과 연결해봐야겠네요 ㅎ
Brad Sheppard 덕분에 수월하게 테스트했어요 (Thanks bro!)
1. minio on spark: minio > pyenv > spark-submit 로 결과 출력
https://mightytedkim.tistory.com/27?category=922753
'기타 > K8S' 카테고리의 다른 글
Kubernetes) Forbidden User 보는 방법_role,sa (0) | 2021.11.23 |
---|---|
Kubernetes) Metrics server 오류_kubelet-insecure-tls (0) | 2021.11.16 |
youtube) k8s+spark+minio 실습 따라하기_1 :: mightytedkim (0) | 2021.09.21 |
Slipp) K8S 스터디4주차_디플로이먼트, 서비스, 잡 :: mightytedkim (0) | 2021.09.20 |
Slipp) K8S 스터디3주차_minikube 실습 :: mightytedkim (0) | 2021.09.11 |