If you have decided to learn Python as your programming language.
“What are the different Python libraries available to perform data analysis?”
This will be the next question in your mind. There are many libraries available to perform data analysis in Python. Don’t worry; you don’t have to learn all of those libraries. You have to know only five Python libraries to do most of the data analysis tasks. I will give a short introduction to each of these libraries, and I will point you to some of the best tutorials to learn them.
So let’s get started,
It is the foundation on which all higher level tools for scientific Python are built. Here are some of the functionalities it provides:
- N- Dimensional array, a fast and memory efficient multidimensional array providing vectorized arithmetic operations.
- You can apply standard mathematical operations on arrays of entire data without writing loops.
- It is very easy to transfer data to external libraries written in a low-level language (such as C or C++), and also for external libraries to return data to Python as Numpy arrays.Linear algebra, Fourier transforms and random number generation
NumPy does not provide high-level data analysis functionality, having an understanding of NumPy arrays and array-oriented computing will help you use tools like Pandas much more effectively.
- Scipy.org provides a brief description to Numpy package.
- Here is an amazing tutorial that completely focuses on usability of Numpy
The SciPy library depends on NumPy, which provides convenient and fast N-dimensional array manipulation. The SciPy library is built to work with NumPy arrays, and provides many user-friendly and efficient numerical routines , such as routines for numerical integration and optimization. SciPy has modules for optimization, linear algebra, integration and other common tasks in data science.
Tutorial- I couldn’t find any good tutorial other than Scipy.org. This is the best tutorial for learning Scipy.
It contains high-level data structures and tools designed to make data analysis fast and easy. Pandas are built on top of NumPy, and makes it easy to use in NumPy-centric applications.
- Data structures with labeled axes, supporting automatic or explicit data alignment. This prevents common errors resulting from misaligned data and working with differently-indexed data coming from different sources.
- Using Pandas it is easier to handle missing data.
- Merge other relational operations found in popular databases (SQLbased, for example)
Pandas is the best tool for doing data munging.
- Quick intro to pandas
- Alfred Essa has a series of videos on Pandas. These videos should give you a good idea of basic concepts.
- Also don’t miss this tutorial by Shane Neeley, this video gives you a comprehensive intro to Numpy, Scipy and Matplotlib.
Matlplotlib is a Python module for visualization. Matplotlib allows you to easily make line graphs, pie chart, histogram and other professional grade figures. Using Matplotlib you can customize every aspect of a figure. When used within IPython, Matplotlib has interactive features like zooming and panning. It supports different GUI back ends on all operating systems, and can also export graphics to common vector and graphics formats: PDF, SVG, JPG, PNG, BMP, GIF, etc.
- Show me do has a good tutorial on Matplotlib
- I also recommend the cook book from pack publishers. This is an amazing book for someone getting started in Matplotlib.
Scikit-learn is a Python module for Machine learning built on top of Scipy. It provides a set of common Machine learning algorithms to users through a consistent interface. Scikit-learn helps to quickly implement popular algorithms on your dataset. Have a look at the list of algorithims available in scikit-learn, and you can quickly realize that it includes tools for many standard machine-learning tasks (such as clustering, classification, regression, etc).
- Introduction to Scikit-learn
- Tutorials from Scikit-learn.org
There are also other libraries such as Nltk(Natural language Tool kit), Scrappy for web scraping, Pattern for web mining, Theano for deep learning. But if you are getting started in python, I would recommend you to first get familiar with these 5 libraries. I have mentioned the tutorials that are beginner friendly, before going through these tutorials ensure that you are familiar with basics of python programming.
Manu Jeevan is a Big Data blogger at BigDataExaminer, where he writes about Data Science, Python and Digital analytics.