Exploratory Data Analysis In Python Using Pandas, Matplotlib And Numpy - HPC ASIA
Features

Exploratory Data Analysis In Python Using Pandas, Matplotlib And Numpy

research-66365_1280

01You already know that Pandas is a power tool for  data munging. In this tutorial, I will show you how to explore a data set using Pandas, Numpy and Matplotlib.

My goal for this project is to determine if the gap between Africa/Latin America/Asia and Europe/North America has increased, decreased or stayed the same during the last two decades.

So lets get started,

 

Loading Files Into IPython Notebook

Using the list of countries by continent from World Atlas data, I am loading countries.csvfile into a Pandas DataFrame using pd.read_csv, and I name this data frame as count_df. 

load csv file into pandas data frame

I am loading gapminder.xlsx file as a pandas Data Frame.

reading a excel file into a pandas data frame

Transforming  the data

In this section, I am going to transform complete_excel data frame to have years as the rows and countries as the columns.

transforming a pandas data frame

I will explain what is happening in the code line by line:

complete_excel[complete_excel.columns[0]] will return the first column ofcomplete_excel data frame, and then  I am setting the column gdpc2011 as the index of my data frame.  But I dont want my index and the first column to be the same, so I am going to delete this column. I am deleting this column using drop command.

transfrom = complete_excel.drop(complete_excel.columns[0], axis = 1)

pandas python tutorial

After deleting gdp pc column, I  am converting year values from float to integers.  If you want to know how map statement applies to a data frame, you can read my detailed explanation here.

Now I transpose this data frame:

transfrom.columns = map(lambda x: int(x), transfrom.columns)

converting year values in pandas data frame

Plotting a Histogram

I am plotting a histogram for the year 2000. Here I am  using dropna to exclude missing values for the year 2000. Also, .ix enables me to select a subset of the rows and columns from a DataFrame.

7

plotting a histogram in pandas

I am using log scale to plot the values.

using histogram to plot

9

Merging data frames

I am using merge function to merge two data frames(data1 and count_df).

merging two data frame in pandas

merging two data frames in pandas

Using Box plot for further exploration

I am generating box plots to explore the trends for the years 1900, 1990 and 2003. I encourage you to  explore the trends for the years 1950, 1960, 1970, 1980, 1990, 2000 and 2010; you can use years = np.arange(1950, 2010, 10) statement to do that .

plotting a box plot in pandas

box plots in pandas

plotting a box plotbox plot in pandas

If you explore the changes from 1950 to 2010, you can see that in most continents (especially Africa and Asia)  the distribution of incomes is very skewed: most countries are in a group of low-income states with a fat tail of high-income countries that remains approximately constant throughout the 20th century.

Conclusion

Now that you know how to explore data using Python, you are ready to start. You know everything from how to load data into python to how to clean and visualize, and draw insights from data.

Here is a simple exercise for you to improve your  data exploration skills.

Consider the distribution of income per person from two regions: Asia and South America. Estimate the average income per person across the countries in those two regions. Which region has the larger average of income per person across the countries in that region?  (Use the year 2012). Also create boxplots to see the income distribution of the two continents on the dollar scale and log10(dollar) scale. 

If you have any additional questions please let me know.

About Author-  Manu Jeevan writes about digital analytics, data science and growth hacking. His love for data science and analytics started when he was doing his MBA. To sharpen his digital analytics and data analysis skills, he took special interest in subjects like business analytics, market research and statistics. Currently he spends all of his time on Big-data-examiner.com, his number one goal is to share what he had learned with others.

 



  

Subscribe to our mailing list

 

* indicates required

 

 

 

 

 

 

 

 

 

 

 Newsletter Frequency

 

 

 

 

 

 Email Format

 

 

 

 

 

   

 

 

 

 


Comments

comments

Click to comment

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

To Top