Exploratory data analysis in python using pandas, matplotlib and numpy

Published January 4, 2016   |   

You already know that Pandas is a power tool for data munging. In this tutorial, I will show you how to explore a data set using Pandas, Numpy and Matplotlib.
My goal for this project is to determine if the gap between Africa/Latin America/Asia and Europe/North America has increased, decreased or stayed the same during the last two decades.
So let’s get started.

Loading Files Into IPython Notebook

Using the list of countries by continent from World Atlas data, I am loading countries.csv file into a Pandas DataFrame using pd.read_csv, and I name this data frame as count_df.
Loading files into I python notebook
I am loading gapminder.xlsx file as a pandas Data Frame.
Loading files into I python notebook as a pandas dataframe

Transforming the data

In this section, I am going to transform complete_excel data frame to have years as the rows and countries as the columns.
transforming data using pandas data frame
I will explain what is happening in the code line by line:
complete_excel[complete_excel.columns[0]] will return the first column of complete_excel data frame, and then  I am setting the column gdpc2011 as the index of my data frame.  But I dont want my index and the first column to be the same, so I am going to delete this column. I am deleting this column using drop command.
transfrom = complete_excel.drop(complete_excel.columns[0], axis = 1)
transforming data using pandas data frame 1
After deleting gdp pc column, I  am converting year values from float to integers.  If you want to know how map statement applies to a data frame, you can read my detailed explanation here.
Now I transpose this data frame:
transfrom.columns = map(lambda x: int(x), transfrom.columns)
transforming data using pandas data frame 2

Plotting a Histogram

I am plotting a histogram for the year 2000. Here I am  using dropna to exclude missing values for the year 2000. Also, .ix enables me to select a subset of the rows and columns from a DataFrame.
plotting a histogram
plotting a histogram using a log scale
I am using log scale to plot the values.
plotting a histogram using matplotlib
plotting a histogram an income per person

Merging data frames

I am using merge function to merge two data frames(data1 and count_df).
merging data frames in Python

Using Box plot for further exploration

I am generating box plots to explore the trends for the years 1900, 1990 and 2003. I encourage you to  explore the trends for the years 1950, 1960, 1970, 1980, 1990, 2000 and 2010; you can use years = np.arange(1950, 2010, 10) statement to do that .
box plot for further exploration
boxplot grouped by region
boxplot group by 1990
boxplot 2003
If you explore the changes from 1950 to 2010, you can see that in most continents (especially Africa and Asia) the distribution of incomes is much skewed: most countries are in a group of low-income states with a fat tail of high-income countries that remains approximately constant throughout the 20th century.


Now that you know how to explore data using Python, you are ready to start. You know everything from how to load data into python to how to clean and visualize, and draw insights from data.
Here is a simple exercise for you to improve your data exploration skills.
Consider the distribution of income per person from two regions: Asia and South America. Estimate the average income per person across the countries in those two regions. Which region has the larger average of income per person across the countries in that region?  (Use the year 2012). Also create boxplots to see the income distribution of the two continents on the dollar scale and log10(dollar) scale. 
If you have any additional questions please let me know.