'데이터 엔지니어'로 성장하기

정리하는 걸 좋아하고, 남이 읽으면 더 좋아함

Storage/Ceph

kubernetes)rook-ceph mgr 2 설정_안됨

MightyTedKim 2022. 3. 14. 14:50
728x90
반응형

 

 

rook-ceph mgr을 1에서 2로 변경해봄

mgr pod가 running인데 갑자기 object storage가 멈춘적이 있어서 그럼

 

custer.yaml의 mgr을 1 -> 2로 바꾸고 재실행해봄

잘 실행되는데 ceph dashboard가 안됨.

curl로 하면 어쩔때는 나오고 어쩔때는 안됨

 

일단 mgr의 에러로그는 아래와 같음

debug 2022-03-14T04:57:14.890+0000 7f4561489700  0 [dashboard ERROR exception] Internal Server Error
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/dashboard/services/exception.py", line 46, in dashboard_exception_handler
    return handler(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/cherrypy/_cpdispatch.py", line 54, in __call__
    return self.callable(*self.args, **self.kwargs)
  File "/usr/share/ceph/mgr/dashboard/controllers/home.py", line 135, in __call__
    return serve_file(full_path)
  File "/usr/lib/python3.6/site-packages/cherrypy/lib/static.py", line 77, in serve_file
    cptools.validate_since()
  File "/usr/lib/python3.6/site-packages/cherrypy/lib/cptools.py", line 116, in validate_since
    raise cherrypy.HTTPRedirect([], 304)
cherrypy._cperror.HTTPRedirect: ([], 304)

curl을 때리면 총 3개의 경우의 수가 나옴. 

redirect, 정상, redirect 아이콘이 있는 화면

$ curl http://10.***.**.**:30010/
This resource can be found at <a href="http://0.0.0.0:8443/">http://0.0.0.0:8443/</a>

$ curl http://10..***.**.**:30010/
<!doctype html>
<html lang="en-US">
<head>
  <meta charset="utf-8">
  <title>Ceph</title>

  <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
  <link rel="icon" type="image/x-icon" id="cdFavicon" href="favicon.ico">
<link rel="stylesheet" href="styles.7918cb8dc788b3eedc95.css"></head>
<body>
  <noscript>
    <div class="noscript container"
         ng-if="false">
      <div class="jumbotron alert alert-danger">
        <h2 i18n>JavaScript required!</h2>
        <p i18n>A browser with JavaScript enabled is required in order to use this service.</p>
        <p i18n>When using Internet Explorer, please check your security settings and add this address to your trusted sites.</p>
      </div>
    </div>
  </noscript>

  <cd-root></cd-root>
<script src="runtime.fcd694c3eff5ef104b53.js" defer></script><script src="polyfills.b66d1515aae6fe3887b1.js" defer></script><script src="scripts.6bda3fa7e09a87cd4228.js" defer></script><script src="main.b78c1bf5c30e15315e18.js" defer></script></body>
</html>

$ curl http://10..***.**.**:30010/
This resource can be found at <a href="http://0.0.0.0:8443/">http://0.0.0.0:8443/</a>.bigdata@prehq01:~:]$ curl http://10.240.35.32:30010/
This resource can be found at <a href="http://0.0.0.0:8443/">http://0.0.0.0:8443/</a>.bigdata@prehq01:~:]$ curl http://10.240.35.32:30010/
<!doctype html>
<html lang="en-US">
<head>
  <meta charset="utf-8">
  <title>Ceph</title>

  <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
  <link rel="icon" type="image/x-icon" id="cdFavicon" href="favicon.ico">
<link rel="stylesheet" href="styles.7918cb8dc788b3eedc95.css"></head>
<body>
  <noscript>
    <div class="noscript container"
         ng-if="false">
      <div class="jumbotron alert alert-danger">
        <h2 i18n>JavaScript required!</h2>
        <p i18n>A browser with JavaScript enabled is required in order to use this service.</p>
        <p i18n>When using Internet Explorer, please check your security settings and add this address to your trusted sites.</p>
      </div>
    </div>
  </noscript>

  <cd-root></cd-root>
<script src="runtime.fcd694c3eff5ef104b53.js" defer></script><script src="polyfills.b66d1515aae6fe3887b1.js" defer></script><script src="scripts.6bda3fa7e09a87cd4228.js" defer></script><script src="main.b78c1bf5c30e15315e18.js" defer></script></body>
</html>

아마 switch 해주는 로직이 들어갔고, nodeport로 연결하는 내 설정에서 오류가 난 것으로 추정

1였을 때와는 다르게 mgr가 watch-active가 두개 생겼다. 302/303 http인것으로 봐서 redirect가 맞는듯

$k logs rook-ceph-mgr-a-698cddf956-wtcpt -n rook-ceph --follow
error: a container name must be specified for pod rook-ceph-mgr-a-698cddf956-wtcpt, choose one of: 
[mgr watch-active] or one of the init containers: [chown-container-data-dir]

$k logs rook-ceph-mgr-a-698cddf956-wtcpt -n rook-cephmgr --follow
g 2022-03-14T05:02:08.971+0000 7efd2e534700  0 [dashboard INFO root] Redirecting to active 'http://0.0.0.0:8443/'
debug 2022-03-14T05:02:08.972+0000 7efd2e534700  0 [dashboard INFO request] [10.244.0.0:46867] [GET] [303] [0.002s] [86.0B] [6ae59ae5-084a-4791-bf44-a58b3eba29ed] /
debug 2022-03-14T05:02:11.793+0000 7efd2dd33700  0 [dashboard INFO root] Redirecting to active 'http://0.0.0.0:8443/'
debug 2022-03-14T05:02:11.794+0000 7efd2dd33700  0 [dashboard INFO request] [10.244.0.0:1655] [GET] [303] [0.002s] [86.0B] [53cf9e5b-d862-4f1e-b27f-3b7f3ba41d09] /
^C

$k logs rook-ceph-mgr-a-698cddf956-wtcpt -n rook-ceph watch-active --follow

2022-03-14 05:05:15.886318 I | op-mgr: mgr service currently set to "a", checking if need to update to "b"
2022-03-14 05:05:16.225504 I | op-mgr: no need for the mgr update since the active mgr is "a", rather than the local mgr "b"
2022-03-14 05:05:16.225532 I | cephcmd: successfully reconciled services. checking again in 15s
2022-03-14 05:05:31.231644 I | op-mgr: mgr service currently set to "a", checking if need to update to "b"
2022-03-14 05:05:31.484025 I | op-mgr: no need for the mgr update since the active mgr is "a", rather than the local mgr "b"
2022-03-14 05:05:31.484042 I | cephcmd: successfully reconciled services. checking again in 15s

음 구글링해보니, 같은 문제로 다시 1개로 돌렸다는 글이 있음


After the Rook-Community meeting held past 11/15/2019 , we decided to not continue adding two managers in the rook cluster to avoid problems between k8 load balancer, k8 resiliency policies and Ceph system.
However, we continue with the modification to improve the time needed to have a new mgr pod running when the node where this pod runs is "notReady"
This modification is in:
Improve restarting time for mgr,mon,toolbox pods running in a k8's "NotReady" node

참고 : https://github.com/rook/rook/issues/1796
----------------------------------------------
FYI -- I was having issues with the dashboard:

  • lots of http failures in pieces of the dashboard
  • tmeouts -- redirecting to an internal cluster IP
  • generally slow responsiveness when it did show something
I am using a loadbalancer service to access the dashboard from a host separage from the k8s cluster, and the comments above about MGRs switching triggered an 'aha!' moment: I had just increased the MGR count from 1 to 2 when it started breaking. Not sure why, but it seems that I was actually talking to BOTH MGRs -- but only one of them was acutally using the service?
I reduced back to 1 MGR and the dashboard started working again.
Rook v1.7.4
Ceph v16.2.6

참고 : 
https://github.com/rook/rook/issues/7988

 

해결 방법을 아시는 분은 공유좀요,,

 

참고

https://github.com/rook/rook/issues/1796

https://github.com/rook/rook/issues/7988

728x90
반응형