Pyspark has capacity to handle big data well. Many a times file can be present in multiple smaller files and not as one single file. Appending helps in creation of single file from multiple available files. Pyspark has function available to append multiple Dataframes together. This article discusses in detail how to append multiple Dataframe in Pyspark.
John has multiple transaction tables available. He has 4 month transactional data April, May, Jun and July. Each month dataframe has 6 columns present. The columns are in same order and same format. He is looking forward to create single Dataframe from the available tables.
Below are the key steps to follow.
- Step 1: Import all the necessary modules and set SPARK/SQLContext.
import pandas as pd import findspark findspark.init() import pyspark from pyspark import SparkContext from pyspark.sql import SQLContext sc = SparkContext("local", "App Name") sql = SQLContext(sc)
- Step 2: Use union function to append all the Dataframes together. Each dataframe is added one by one to the base Dataframe. One file is listed in one union command. Please see the function below.
Trx_Data_4Months_Pyspark=Trx_Data_Apr20_Pyspark.union(Trx_Data_May20_Pyspark)\ .union(Trx_Data_Jun20_Pyspark)\ .union(Trx_Data_Jul20_Pyspark)
- Step 3: Check the output data quality to ensure results are as per expectation. Number of rows in the output file should be 400. Because the number of rows in each of the 4 transaction file is 100. Use show() command to show top rows in Pyspark Dataframe.
Trx_Data_4Months_Pyspark.show(10) Print Shape of the file, i.e. number of rows and number of columns print((Trx_Data_4Months_Pyspark.count(), len(Trx_Data_4Months_Pyspark.columns)))
To get top certifications in Pyspark and build your resume visit here. Additionally, you can read books listed here to build strong knowledge around Pyspark.
Visit us below for video tutorial:
Looking to practice more with this example? Drop us a note, we will email you the Code file:
📬 Stay Ahead in Data Science & AI – Subscribe to Newsletter!
- 🎯 Interview Series: Curated questions and answers for freshers and experienced candidates.
- 📊 Data Science for All: Simplified articles on key concepts, accessible to all levels.
- 🤖 Generative AI for All: Easy explanations on Generative AI trends transforming industries.
💡 Why Subscribe? Gain expert insights, stay ahead of trends, and prepare with confidence for your next interview.
👉 Subscribe here: