'데이터 엔지니어'로 성장하기

정리하는 걸 좋아하고, 남이 읽으면 더 좋아함

기타/Python

Python) pyspark dataframe overwrite

MightyTedKim 2021. 10. 12. 09:16
728x90
반응형

pandas dataframe만 사용하다가, overwrite가 필요해서 pyspark dataframe을 활용함

내가 필요한건 upsert인데, delta table도 살펴봐야겠음

 

import json
from pyspark import SparkContext, SQLContext
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType,IntegerType
import os 

#java_home
os.environ['JAVA_HOME'] = '/home/java/jdk1.8.0_301'

columns = ['amount', 'id']

spark = SparkSession.builder.getOrCreate()
#df1 = spark.read.json("./orders.json")
vals = [(111, 1), (222, 2)]
df1 = spark.createDataFrame(vals, columns)
print(type(df1))
df1.printSchema()

print("show origing")
df1.show()

print("append")
newRow = spark.createDataFrame([(444,4)], columns)
newRow.show()

print("**upsert**")
df2 = df2.union(newRow)

print("show appended")
df2.show()

## 결과 ##

<class 'pyspark.sql.dataframe.DataFrame'>
root
 |-- amount: long (nullable = true)
 |-- id: long (nullable = true)

show origing
+------+---+
|amount| id|
+------+---+
|   111|  1|
|   222|  2|
+------+---+

append
+------+---+
|amount| id|
+------+---+
|   444|  4|
+------+---+

**upsert**
show appended
+------+---+
|amount| id|
+------+---+
|   111|  1|
|   222|  2|
|   444|  4|
+------+---+

 

728x90
반응형

'기타 > Python' 카테고리의 다른 글

Python) code 내에서 변수 초기화  (0) 2022.11.22
Python) parquet upsert with delta table  (0) 2021.10.19
Python) pyspark dataframe append  (0) 2021.10.13