How to left join two Dataframes in Pyspark

As the number of fields is growing in each industry, in each Data sources. It is almost impossible to store all the variables in single Data table. So ideally we received Data tables in multiple files. In these situation, whenever there is a need to bring variables together in one table, Merge or Join is helpful. Left join creates a table with all rows from left table and corresponding fields from both the tables. The below article discusses how to left join Dataframe in Pyspark.

How to Left join Dataframe in Python

Amy has two Dataframes, Customer Data 1 with 10 observation. This Data has Customer ID, First Name, Last Name and Gender. Customer ID is the primary key. Customer Data 2 has 12 observation. This Data has Customer ID as primary key, First Name, Last Name, Country Name and Total Spend in an year. Amy wants to get information of Country and total spend for all the customers in Customer Data 1. i.e. She is looking forward to create another Data frame where for each record of Customer Data 1 she also gets information of Country and Total Spend from Customer Data 2.

How to left join two Dataframes in Pyspark
How to left join two Dataframes in Pyspark

Below are the key steps to follow to left join Pyspark Dataframe:

  • Step 1: Import all the necessary modules.
import pandas as pd
 import findspark
 findspark.init()
 import pyspark
from pyspark import SparkContext
 from pyspark.sql import SQLContext 
 sc = SparkContext("local", "App Name")
 sql = SQLContext(sc) 
  • Step 2: Use join function from Pyspark module to merge dataframes. To do the left join, “left_outer” parameter helps. Further for defining the column which will be used as a key for joining the two Dataframes, “Table 1 key” = “Table 2 key” helps.
Merged_Data=Customer_Data_1.join(Customer_Data_2,\
                                  Customer_Data_1.ID ==  Customer_Data_2.ID,"left_outer")
  • Step 3: Check the output data quality to assess the observations in final Dataframe. Please note that as the Customer Data 1 has 10 observations, so the final Dataframe also has 10 observation. Use show() command to show top rows in Pyspark Dataframe.
Merged_Data.show()
#Print Shape of the file, i.e. number of rows and number of columns
print((Merged_Data.count(), len(Merged_Data.columns)))
How to left join two Dataframes in Pyspark
To get top certifications in Pyspark and build your resume visit here. Additionally, you can read books listed here to build strong knowledge around Pyspark. 

Visit us below for video tutorial:

 Looking to practice more with this example? Drop us a note, we will email you the Code file: 

    📬 Stay Ahead in Data Science & AI – Subscribe to Newsletter!

    • 🎯 Interview Series: Curated questions and answers for freshers and experienced candidates.
    • 📊 Data Science for All: Simplified articles on key concepts, accessible to all levels.
    • 🤖 Generative AI for All: Easy explanations on Generative AI trends transforming industries.

    💡 Why Subscribe? Gain expert insights, stay ahead of trends, and prepare with confidence for your next interview.

    👉 Subscribe here:

    Related Posts