Python Libraries – Pandas

Introduction to Python Libraries – Pandas

Python is great tool to make sense of all this data efficiently. Python was not built for data science but is now quite popular in that field. The reason being over the year’s developers started adding libraries to it so as to complete data tasks quickly which helped save time. The four essential libraries that you should know about are:

  1. Pandas
  2. Numpy
  3. Matplotlib
  4. Scikit-Learn

For this article I will focus on Pandas. Off all the libraries this is my favorite for the simple reason that I like to see data in tables, split row and column wise.  For SQL users this library will be like meeting an old friend though the functionalities might differ. You can load data from a csv or excel and output it to the format that you require. We will go over a couple of commands that help in formatting Dataframe. Like with all software getting the functionality right of the libraries is half the battle won.

To import Pandas one needs to run the below command:

>>import pandas as pd 

Then to read in a csv file you would need to write in this command

df= pd.read_csv(‘xxxx.csv’) 
print(df)

The data gets loaded into a dataframe automatically and the format would look a lot like the csv file which can be viewed by the below command which returns the first 5 rows of the dataset, allowing you to inspect a small sample of the data.

df.head()
df.drop([‘xxx’])

The above command drop values from rows and if were to add axis = 1 then it would point to the columns that you want to drop;

df.drop([‘xxx’],axis =1)

One can also get summary statistics like sum,mean,variance

df.sum() OR >> df.mean()

You could also call the describe() function on the dataframe and get the various measures like the mean, variance , median in a single table which gives you a snapshot of the data that you are working on.

df.describe()

One could also concatenate Dataframe using the concat function which basically joins Dataframe.

pd.concat([df,df1])

You could also merge Dataframes by

df.merge(df,df2,on=’common column name’); df and df2 being two different Dataframes

df.merge(df,df2,on=’common column name’) #df and df2 being two different Dataframes

On a lot many occasions you would get data in format of a list and you could convert that into a dataframe by using the command

pd.dataframe(list)

If you need to select only certain columns from a dataframe then this command would help. A iloc indexer is used for selecting rows or columns by position. In this case all rows denoted by “:” are selected while the first 3 columns are selected. The Loc indexer is also used which selects columns using their label names.

df.iloc[:,0:3]

These are a set of basic commands that will help you navigate the world of unstructured/structured data. Our next blog in this series would cover some other important libraries, stay tuned !

Contributors:

We would like to thank Mr Anil Abraham who is a data evangelist. He is passionate about writing blogs/articles around datascience.

Klaymatrix is an analytics solution/training provider, please visit our website www.klaymatrix.com for more details.