Top 5 max values in Pandas
import pandas as pd import findspark findspark.init() import pyspark from pyspark import SparkContext from pyspark.sql import SQLContext sc = SparkContext("local", "App Name") sql = SQLContext(sc) from pyspark.sql.window import Window from pyspark.sql.functions import rank, col
#Sort by Revenue in Geography Group window = Window.partitionBy(df1['Geography']).orderBy(df1['Revenue'].desc())
#Find top 5 df1.select('*', rank().over(window).alias('rank')).filter(col('rank') <= 5).show()
Example 2: Top 5 max values for each Month / Time Period
#Sort by Revenue in Time Period Group window = Window.partitionBy(df1['Time Period']).orderBy(df1['Revenue'].desc()) #Find top 5 df1.select('*', rank().over(window).alias('rank')).filter(col('rank') <= 5).show()
Thus, John is able to calculate value as per his requirement in Pyspark. This kind of extraction can be a requirement in many scenarios and use cases. This example talks about one of the use case.
To get top certifications in Pyspark and build your resume visit here. Additionally, you can read books listed here to build strong knowledge around Pyspark.
📬 Stay Ahead in Data Science & AI – Subscribe to Newsletter!
- 🎯 Interview Series: Curated questions and answers for freshers and experienced candidates.
- 📊 Data Science for All: Simplified articles on key concepts, accessible to all levels.
- 🤖 Generative AI for All: Easy explanations on Generative AI trends transforming industries.
💡 Why Subscribe? Gain expert insights, stay ahead of trends, and prepare with confidence for your next interview.
👉 Subscribe here: