Pyspark programming language enables easy deployment of complex ML algorithm on Big Data. Before working on larger dataframes, it becomes crucial to process data well. To process data, removing duplicate records is one important aspect. Many a time data quality

Read More

Quality of data can be good or can some time not be good enough as per expectations. There may be some data cleaning requirement for many cases. Sometime the column names are not up to the mark and can have

Read More

As the world of data is growing, corporation are maintaining detailed datasets. Number of columns are increasing day by day. It becomes sometime very difficult to work with data having multiple columns in it. So there exist a need of

Read More

Comma Separated Value files (CSV) remains one of the main format to store data. It can store smaller number of rows, as well as large datasets. Most of the analysis starts with reading data into the coding environment. Reading CSV

Read More

Pandas offers some great functions to process a dataset. In a data file there can be duplicates available at row level. Droping duplicates becomes very important, as the rows will create noise in any analysis. Some time the duplicates can

Read More

Python scripts saved in Jupyter notebooks are of ipynb formats. This is an interactive file, with charts data images all captured along with the codes. Due to its interactive nature, ipynb files is gathering popularity. Now python codes are mostly

Read More