'데이터 엔지니어'로 성장하기

정리하는 걸 좋아하고, 남이 읽으면 더 좋아함

기타/K8S

youtube) k8s+spark+minio 실습 따라하기_2 :: mightytedkim

MightyTedKim 2021. 9. 21. 14:11
728x90
반응형

이전 게시물에서는 kuberenetes 없이 python으로 minio의 결과를 확인해봤어요

 

Kubernetes) How to run Spark with Minio in Kubernetes_1

회사에서 minio, spark, kuberenetes를 사용하게 되면서 유투브에서 좋은 자료를 찾아 따라했어요 유투브는 크게 2가지로 나누어지는데 이 게시물에서는 1번만 따라했어요. 1. minio > pyenv > spark-submit 로

mightytedkim.tistory.com

이 포스팅에서는 kubernetes pod 안에서 minio의 결과를 호출하는지 확인해봤어요

 

참고: https://www.youtube.com/watch?v=ZzFdYm_DqEM&t=307s

유투브는 크게 2가지로 나누어지는데 이 게시물에서는 sparkoperator를 따라했어요.

1. minio on spark: minio >  pyenv >  spark-submit 로 결과 출력

2. kubernetes: sparkoperator를 이용해 kubernetes pod로 결과 출력


1.  POD에서 pyspark로 minio 데이터 읽어오기

1. spark operator POD 실행하기

> spark operator는 구글에서 공개한 프로젝트로, spark를 사용하는데 필요한 환경을 구성해줍니다.(쉬워요)

https://github.com/GoogleCloudPlatform/spark-on-k8s-operator

$ helm install sparkoperator spark-operator/spark-operator
NAME: sparkoperator
LAST DEPLOYED: Tue Sep 14 15:54:30 2021
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None

# k get pods -o wide
NAME                                            READY   STATUS      RESTARTS   AGE     IP              NODE     NOMINATED NODE   READINESS GATES
minio-789b6fbc4c-tbtbw                          1/1     Running     0          12d     19.233.98.111   kube01   <none>           <none>
sparkoperator-spark-operator-58d4d9b449-fwrwp   1/1     Running     0          8d      19.233.89.12    kube04   <none>           <none>

2.  pyspark 실행할  이미지 만들기

> 구글에서 공개하는 이미지를 사용해도 되지만, hadoop2.7과의 호환성 때문에 에러가 있다고 하네요. 하지만 사용자의 환경에 맞춰서 직접 베이스 이미지를 만들 수 잇어요

> m1 macbook(arm)으로 이미지를 빌드할 경우에는 centos나 ubuntu에서 오류가 날 수 있어요

> 프록시 설정이 되어 있으면 debian 쪽 https에 접근하지 못해서 오류가 날 수 있어요

 

- 프록시 설정 전

더보기

$ ./bin/docker-image-tool.sh -r spark-base -t 1.0.0 -p ./kubernetes/dockerfiles/spark/bindings/python/Dockerfile build

Sending build context to Docker daemon 356.8MB

Step 1/18 : ARG java_image_tag=11-jre-slim

Step 2/18 : FROM openjdk:${java_image_tag} ---> e4beed9b17a3

Step 3/18 : ARG spark_uid=185 ---> Using cache ---> b098f4c33b7f

Step 4/18 : RUN set -ex && sed -i 's/http:\/\/deb.\(.*\)/https:\/\/deb.\1/g' /etc/apt/sources.list && apt-get update && ln -s /lib /lib64 && apt install -y bash tini libc6 libpam-modules krb5-user libnss3 procps && mkdir -p /opt/spark && mkdir -p /opt/spark/examples && mkdir -p /opt/spark/work-dir && to uch /opt/spark/RELEASE && rm /bin/sh && ln -sv /bin/bash /bin/sh && echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && chgrp root /etc/passwd && chmod ug+rw /etc/passwd && rm -rf /var/cache/apt/* ---> Running in b7962b7dc9af

+ sed -i s/http:\/\/deb.\(.*\)/https:\/\/deb.\1/g /etc/apt/sources.list

+ apt-get update

Err:1 https://deb.debian.org/debian bullseye InRelease Could not handshake: Error in the pull function. [IP: 146.75.50.132 443]

Err:2 http://security.debian.org/debian-security bullseye-security InRelease Connection failed [IP: 151.101.2.132 80]

Err:3 https://deb.debian.org/debian bullseye-updates InRelease Could not handshake: Error in the pull function. [IP: 146.75.50.132 443] Reading package lists...

W: Failed to fetch https://deb.debian.org/debian/dists/bullseye/InRelease Could not handshake: Error in the pull function. [IP: 146.75.50.132 443]

W: Failed to fetch http://security.debian.org/debian-security/dists/bullseye-security/InRelease Connection failed [IP: 151.101.2.132 80] W: Failed to fetch https://deb.debian.org/debian/dists/bullseye-updates/InRelease Could not handshake: Error in the pull function. [IP: 146.75.50.132 443]

W: Some index files failed to download. They have been ignored, or old ones used instead.

+ ln -s /lib /lib64

+ apt install -y bash tini libc6 libpam-modules krb5-user libnss3 procps

 

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

- 프록시 설정 후

  • 주석한 코드
    • set -ex && sed -i 's/http:\/\/deb.\(.*\)/https:\/\/deb.\1/g' /etc/apt/sources.list && 
  • 추가한 코드
    • ENV DEBIAN_FRONTEND=noninteractive \
          TZ=Asia/Seoul \
          https_proxy=http://1**.2**.**.2**:80** \
          http_proxy=http://1**.2**.**.2**:80**
      # Before building the docker image, first build and make a Spark distribution following
      # the instructions in http://spark.apache.org/docs/latest/building-spark.html.
      # If this docker file is being used in the context of building your images from a Spark
      # distribution, the docker build command should be invoked from the top level directory
      # of the Spark distribution. E.g.:
      # docker build -t spark:latest -f kubernetes/dockerfiles/spark/Dockerfile .

      #RUN set -ex && \
      #    sed -i 's/http:\/\/deb.\(.*\)/https:\/\/deb.\1/g' /etc/apt/sources.list && \
      RUN apt-get update && \
          ln -s /lib /lib64 && \
          apt-get install -y bash tini libc6 libpam-modules krb5-user libnss3 procps && \
          mkdir -p /opt/spark && \
          mkdir -p /opt/spark/examples && \
          mkdir -p /opt/spark/work-dir && \
          touch /opt/spark/RELEASE && \
          rm /bin/sh && \
          ln -sv /bin/bash /bin/sh && \
          echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && \
          chgrp root /etc/passwd && chmod ug+rw /etc/passwd && \
          rm -rf /var/cache/apt/*
더보기

$ ./bin/docker-image-tool.sh -r spark-base -t 1.0.0 -p ./kubernetes/dockerfiles/spark/bindings/python/Dockerfile build

Sending build context to Docker daemon 356.8MB

Step 1/18 : ARG java_image_tag=11-jre-slim

Step 2/18 : FROM openjdk:${java_image_tag} ---> e4beed9b17a3

Step 3/18 : ARG spark_uid=185 ---> Using cache ---> b098f4c33b7f

Step 4/18 : RUN apt-get update && ln -s /lib /lib64 && apt install -y bash tini libc6 libpam-modules krb5-user libnss3 procps && mkdir -p /opt/spark && mkdir -p /opt/spark/examples && mkdir -p /opt/spark/work-dir && to uch /opt/spark/RELEASE && rm /bin/sh && ln -sv /bin/bash /bin/sh && echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && chgrp root /etc/passwd && chmod ug+rw /etc/passwd && rm -rf /var/cache/apt/* ---> Running in b7962b7dc9af

+ sed -i s/http:\/\/deb.\(.*\)/https:\/\/deb.\1/g /etc/apt/sources.list

+ apt-get update

Err:1 https://deb.debian.org/debian bullseye InRelease Could not handshake: Error in the pull function. [IP: 146.75.50.132 443]

Err:2 http://security.debian.org/debian-security bullseye-security InRelease Connection failed [IP: 151.101.2.132 80]

Err:3 https://deb.debian.org/debian bullseye-updates InRelease Could not handshake: Error in the pull function. [IP: 146.75.50.132 443] Reading package lists...

W: Failed to fetch https://deb.debian.org/debian/dists/bullseye/InRelease Could not handshake: Error in the pull function. [IP: 146.75.50.132 443]

W: Failed to fetch http://security.debian.org/debian-security/dists/bullseye-security/InRelease Connection failed [IP: 151.101.2.132 80] W: Failed to fetch https://deb.debian.org/debian/dists/bullseye-updates/InRelease Could not handshake: Error in the pull function. [IP: 146.75.50.132 443]

W: Some index files failed to download. They have been ignored, or old ones used instead.

+ ln -s /lib /lib64

+ apt install -y bash tini libc6 libpam-modules krb5-user libnss3 procps

 

WARNING: apt does not have

$ ./bin/docker-image-tool.sh -r spark-base -t 1.0.0 -p ./kubernetes/dockerfiles/spark/bindings/python/Dockerfile build

Sending build context to Docker daemon 356.8MB

Step 1/18 : ARG java_image_tag=11-jre-slim

Step 2/18 : FROM openjdk:${java_image_tag} ---> e4beed9b17a3

Step 3/18 : ARG spark_uid=185 ---> Using cache ---> b098f4c33b7f

Step 4/18 : RUN set -ex && sed -i 's/http:\/\/deb.\(.*\)/https:\/\/deb.\1/g' /etc/apt/sources.list && apt-get update && ln -s /lib /lib64 && apt install -y bash tini libc6 libpam-modules krb5-user libnss3 procps && mkdir -p /opt/spark && mkdir -p /opt/spark/examples && mkdir -p /opt/spark/work-dir && to uch /opt/spark/RELEASE && rm /bin/sh && ln -sv /bin/bash /bin/sh && echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && chgrp root /etc/passwd && chmod ug+rw /etc/passwd && rm -rf /var/cache/apt/* ---> Running in b7962b7dc9af

+ sed -i s/http:\/\/deb.\(.*\)/https:\/\/deb.\1/g /etc/apt/sources.list

+ apt-get update

Err:1 https://deb.debian.org/debian bullseye InRelease Could not handshake: Error in the pull function. [IP: 146.75.50.132 443]

Err:2 http://security.debian.org/debian-security bullseye-security InRelease Connection failed [IP: 151.101.2.132 80]

Err:3 https://deb.debian.org/debian bullseye-updates InRelease Could not handshake: Error in the pull function. [IP: 146.75.50.132 443] Reading package lists...

W: Failed to fetch https://deb.debian.org/debian/dists/bullseye/InRelease Could not handshake: Error in the pull function. [IP: 146.75.50.132 443]

W: Failed to fetch http://security.debian.org/debian-security/dists/bullseye-security/InRelease Connection failed [IP: 151.101.2.132 80] W: Failed to fetch https://deb.debian.org/debian/dists/bullseye-updates/InRelease Could not handshake: Error in the pull function. [IP: 146.75.50.132 443]

W: Some index files failed to download. They have been ignored, or old ones used instead.

+ ln -s /lib /lib64

+ apt install -y bash tini libc6 libpam-modules krb5-user libnss3 procps

 

WARNING: apt does not have

$ wget https://mirror.navercorp.com/apache/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
$ tar zxvf spark-3.1.2-bin-hadoop3.2.tgz
$ cd ./spark-3.1.2-bin-hadoop3.2

# ./bin/docker-image-tool.sh -r <repo> -t my-tag -p ./kubernetes/dockerfiles/spark/bindings/python/Dockerfile build
## -r registry 이름으로 이미지가 만들어짐 ex) [spark-base]/spark-py:1.0.0]
$ ./bin/docker-image-tool.sh \ 
  -r deet1107 \ 
  -t 1.0.0 \ 
  -p ./kubernetes/dockerfiles/spark/bindings/python/Dockerfile build
  
## 제 docker_hub id가 deet1107이라서 아래처럼 이미지가 생성
$ docker ps
deet1107/spark-py                               1.0.0               86aed3057138        4 hours ago         880MB

$ docker push deet1107/sparkpy:1.0.0

 

3. spark_base 이미지로 main.py 포함해서 빌드

$ tree sparkjob/
sparkjob/
├── Dockerfile
├── main.py
-------------

$ vi Dockerfile
FROM deet1107/spark-py:1.0.0
USER root
WORKDIR /app
COPY main.py .
-------------

$ vi main.py
from pyspark import SparkContext
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

def load_config(spark_context: SparkContext):
    spark_context._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'myaccesskey')
    spark_context._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'mysecretkey')
    spark_context._jsc.hadoopConfiguration().set('fs.s3a.path.style.access', 'true')
    spark_context._jsc.hadoopConfiguration().set('fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')
    spark_context._jsc.hadoopConfiguration().set('fs.s3a.endpoint', 'http://19.233.98.111:9000') # k get pods -o wide 의 IP
    spark_context._jsc.hadoopConfiguration().set('fs.s3a.connection.ssl.enabled', 'false')

load_config(spark.sparkContext)
dataframe = spark.read.json('s3a://test/*')
average = dataframe.agg({'amount': 'avg'})
average.show()
-------------
# build
$ docker build -t deet1107/test-k8s:v1.0.0 .

# push
$ docker push deet1107/test-k8s:v1.0.0

 

3. 이미지로 파드 생성하기

$ vi k8s.yml
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: sparkjob #pyspark-pi
  namespace: default
spec:
  type: Python
  pythonVersion: "3"
  mode: cluster
  imagePullSecrets:
   - regcred
  image: "deet1107/test-spark-k8s:v1.1.0"  #spark-py:v3.1.1
  imagePullPolicy: Always
  mainApplicationFile: local:///app/main.py #local:///opt/spark/examples/src/main/python/pi.py
  sparkVersion: "3.1.2" #"3.1.1"
  restartPolicy:
    type: Never
  driver:
    cores: 1
    coreLimit: "1000m"
    memory: "512m"
    labels:
      version: 3.1.2
    serviceAccount: sparkoperator-spark-operator #sparkoperator
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    labels:
      version: 3.1.2
  deps:
    jars:
      - https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.2.0/hadoop-aws-3.2.0.jar
      - https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.375/aws-java-sdk-bundle-1.11.375.jar
# 내부망에서 ssl 오류나서 tomcat 경로에 넣음
#      - http://10.***.29.77/files/sparkjob/hadoop-aws-3.2.0.jar
#      - http://10.***.29.77/files/sparkjob/aws-java-sdk-bundle-1.11.375.jar
-------------

$ kubectl apply -f k8s.yml

# spark-job 확인
$ kubectl get pods
NAME                                            READY   STATUS      RESTARTS   AGE
minio-789b6fbc4c-tbtbw                          1/1     Running     0          12d
sparkjob-driver                                 0/1     Completed   0          5d16h
sparkoperator-spark-operator-58d4d9b449-fwrwp   1/1     Running     0          8d
-------------

# 10초뒤 spark-job의 로그 확인
$ kubectl logs sparkjob-driver

(결과)
+-----------+
|avg(amount)|
+-----------+
|     2800.0|
+-----------+

 

 

pod 안에서 서로 통신하면 minio에 있는 데이터를 spark-submit으로 실행하는 예시를 확인했습니다.

이제 hello world를 마쳤으니, 외부에 있는 minio 들과 연결해봐야겠네요 ㅎ

 

Brad Sheppard 덕분에 수월하게 테스트했어요 (Thanks bro!)


1. minio on spark: minio >  pyenv >  spark-submit 로 결과 출력

 

https://mightytedkim.tistory.com/27?category=922753 

 

Kubernetes) How to run Spark with Minio in Kubernetes_1

회사에서 minio, spark, kuberenetes를 사용하게 되면서 유투브에서 좋은 자료를 찾아 따라했어요 유투브는 크게 2가지로 나누어지는데 이 게시물에서는 1번만 따라했어요. 1. minio > pyenv > spark-submit 로

mightytedkim.tistory.com

 

728x90
반응형