How to calculate top 5 max values in Pyspark

Aggregation of fields is one of the basic necessity for data analysis and data science. Pyspark provide easy ways to do aggregation and calculate metrics. Finding Top 5 maximum value for each group can also be achieved while doing the group by. The function that is helpful for finding the Top 5 maximum value is nlargest(). The below article explains with the help of an example How to calculate Top 5 max values by Group in Pyspark.

John has store sales data available for analysis. There are five columns present in the data, Geography (country of store), Department (Industry category of the store), StoreID (Unique ID of each store), Time Period (Month of sales), Revenue (Total Sales for the month). John is looking forward to calculate Top 5 maximum revenue for each Geography.

How to calculate top 5 max values in Pyspark

Top 5 max values in Pandas

  • Step 1: Firstly, Import all the necessary modules.
import pandas as pd
import findspark
findspark.init()
import pyspark
from pyspark import SparkContext
from pyspark.sql import SQLContext 
sc = SparkContext("local", "App Name")
sql = SQLContext(sc)
from pyspark.sql.window import Window
from pyspark.sql.functions import rank, col
  • Step 2: Sort the data at geography level by revenue field. Use Window.partitionBy to partition by geography and orderBy to sort the Dataframe.
#Sort by Revenue in Geography Group
window = Window.partitionBy(df1['Geography']).orderBy(df1['Revenue'].desc())
  • Step 3: Use rank() over window to rank and get top 5 values.
#Find top 5
df1.select('*', rank().over(window).alias('rank')).filter(col('rank') <= 5).show() 
How to calculate top 5 max values in Pyspark

Example 2: Top 5 max values for each Month / Time Period

  • Here we are looking forward to calculate the top 5 max value across each time period. So, the field in groupby operation will be “Time Period”
#Sort by Revenue in Time Period Group
window = Window.partitionBy(df1['Time Period']).orderBy(df1['Revenue'].desc())
#Find top 5
df1.select('*', rank().over(window).alias('rank')).filter(col('rank') <= 5).show() 
How to calculate top 5 max values in Pyspark

Thus, John is able to calculate value as per his requirement in Pyspark. This kind of extraction can be a requirement in many scenarios and use cases. This example talks about one of the use case.

To get top certifications in Pyspark and build your resume visit here. Additionally, you can read books listed here to build strong knowledge around Pyspark. 

Visit us below for video tutorial:

📬 Stay Ahead in Data Science & AI – Subscribe to Newsletter!

  • 🎯 Interview Series: Curated questions and answers for freshers and experienced candidates.
  • 📊 Data Science for All: Simplified articles on key concepts, accessible to all levels.
  • 🤖 Generative AI for All: Easy explanations on Generative AI trends transforming industries.

💡 Why Subscribe? Gain expert insights, stay ahead of trends, and prepare with confidence for your next interview.

👉 Subscribe here:

Related Posts