Features

Exploratory Data Analysis In Python Using Pandas, Matplotlib And Numpy

By Priya Rana

Posted on April 27, 2015

You already know that Pandas is a power tool for data munging. In this tutorial, I will show you how to explore a data set using Pandas, Numpy and Matplotlib.

My goal for this project is to determine if the gap between Africa/Latin America/Asia and Europe/North America has increased, decreased or stayed the same during the last two decades.

So lets get started,

Loading Files Into IPython Notebook

Using the list of countries by continent from World Atlas data, I am loading countries.csvfile into a Pandas DataFrame using pd.read_csv, and I name this data frame as count_df.

load csv file into pandas data frame

I am loading gapminder.xlsx file as a pandas Data Frame.

reading a excel file into a pandas data frame

Transforming the data

In this section, I am going to transform complete_excel data frame to have years as the rows and countries as the columns.

transforming a pandas data frame

I will explain what is happening in the code line by line:

complete_excel[complete_excel.columns[0]] will return the first column ofcomplete_excel data frame, and then I am setting the column gdpc2011 as the index of my data frame. But I dont want my index and the first column to be the same, so I am going to delete this column. I am deleting this column using drop command.

transfrom = complete_excel.drop(complete_excel.columns[0], axis = 1)

pandas python tutorial

After deleting gdp pc column, I am converting year values from float to integers. If you want to know how map statement applies to a data frame, you can read my detailed explanation here.

Now I transpose this data frame:

transfrom.columns = map(lambda x: int(x), transfrom.columns)

converting year values in pandas data frame

Plotting a Histogram

I am plotting a histogram for the year 2000. Here I am using dropna to exclude missing values for the year 2000. Also, .ix enables me to select a subset of the rows and columns from a DataFrame.

plotting a histogram in pandas

I am using log scale to plot the values.

using histogram to plot

Merging data frames

I am using merge function to merge two data frames(data1 and count_df).

merging two data frame in pandas

merging two data frames in pandas

Using Box plot for further exploration

I am generating box plots to explore the trends for the years 1900, 1990 and 2003. I encourage you to explore the trends for the years 1950, 1960, 1970, 1980, 1990, 2000 and 2010; you can use years = np.arange(1950, 2010, 10) statement to do that .

plotting a box plot in pandas

box plots in pandas

plotting a box plot box plot in pandas

If you explore the changes from 1950 to 2010, you can see that in most continents (especially Africa and Asia) the distribution of incomes is very skewed: most countries are in a group of low-income states with a fat tail of high-income countries that remains approximately constant throughout the 20th century.

Conclusion

Now that you know how to explore data using Python, you are ready to start. You know everything from how to load data into python to how to clean and visualize, and draw insights from data.

Here is a simple exercise for you to improve your data exploration skills.

Consider the distribution of income per person from two regions: Asia and South America. Estimate the average income per person across the countries in those two regions. Which region has the larger average of income per person across the countries in that region? (Use the year 2012). Also create boxplots to see the income distribution of the two continents on the dollar scale and log10(dollar) scale.

If you have any additional questions please let me know.

About Author- Manu Jeevan writes about digital analytics, data science and growth hacking. His love for data science and analytics started when he was doing his MBA. To sharpen his digital analytics and data analysis skills, he took special interest in subjects like business analytics, market research and statistics. Currently he spends all of his time on Big-data-examiner.com, his number one goal is to share what he had learned with others.

Follow @hpcasia

Exploratory Data Analysis In Python Using Pandas, Matplotlib And Numpy

Leave a Reply

Featured Ad

Leading Solution Providers

Submit Guest Article

Subscribe To Our Newsletter

Latest Tweets

Global Telecoms – Key Trends for 2020: 5G, Mobile Satellite, Fixed Broadband, Smart Cities, IoT/M2M and Artificial Intelligence

For AI To Change Business, It Needs To Be Fueled With Quality Data

Morten Middelfart – Big Data Solutions for Tumor Sequencing

What Are The Opportunities For High Performance Computing In India?

“First Thing We Tell Them Is That When You Go On A Public Cloud And Put Your Workloads There, Make That Secure”

Subscribe to our mailing list