How to drop duplicates Pyspark

Pyspark programming language enables easy deployment of complex ML algorithm on Big Data. Before working on larger dataframes, it becomes crucial to process data well. To process data, removing duplicate records is one important aspect. Many a time data quality is not very good. Duplicate rows can cause error in analysis result. This article discusses how to drop duplicates Pyspark.

Emma has Customer data available for her business. In the early assessment she figures out some duplicates in the datasets. She is looking to clean the data first before proceeding on further analysis.

Below are the step by step approach to remove duplicates in Pyspark.

Step 1: Import all the necessary modules. Setup SparkContext and SQLContext.

import pandas as pd
 import findspark
 findspark.init()
 import pyspark
from pyspark import SparkContext
 from pyspark.sql import SQLContext 
 sc = SparkContext("local", "App Name")
 sql = SQLContext(sc)

Step 2: Use dropDuplicates function to drop the duplicate rows in Pyspark Dataframe. As we are looking to drop duplicates at row level (i.e. considering all columns), no additional parameters needs to be entered. Press Ctr+Enter, or run the cell to create clean dataset. To illustrate, below is the syntax for the example:

Customer_data_Pysparkdf2 = Customer_data_Pysparkdf.dropDuplicates()

Step 3: Check the number of rows to confirm if everything looks ok. Original dataframe had 1003 rows, new dataframe has 1000 rows post duplicate removal. Use count() command to count number of rows in Pyspark Dataframe.

Customer_data_Pysparkdf2.count()

As the output Dataframe contains 1000 rows, which is 3 less than the original file. Hence the duplicates are no more present in the final file.

To get top certifications in Pyspark and build your resume visit here. Additionally, you can read books listed here to build strong knowledge around Pyspark.

Visit us below for video tutorial:

 Looking to practice more with this example? Drop us a note, we will email you the Code file:

📬 Stay Ahead in Data Science & AI – Subscribe to Newsletter!

🎯 Interview Series: Curated questions and answers for freshers and experienced candidates.
📊 Data Science for All: Simplified articles on key concepts, accessible to all levels.
🤖 Generative AI for All: Easy explanations on Generative AI trends transforming industries.

💡 Why Subscribe? Gain expert insights, stay ahead of trends, and prepare with confidence for your next interview.

👉 Subscribe here:

How to drop duplicates Pyspark

Related Posts

How to sum by group in Pyspark

How to calculate Median value by group in Pyspark