Are you looking for Data visualization using Pandas? Here we will plot various graphs using the pandas module.
Pandas is an open-source library in Python. It provides ready-to-use high-performance data structures and data analysis tools. Pandas module runs on top of NumPy and it is popularly used for data science and data analytics. However, data visualization using pandas can be very useful as well. It provides a built-in function that helps us to visualize complex data in simple and useful plots by just calling them. In this article, we will discuss how we can use pandas for data visualization by plotting various useful graphs.
You might also be interested in how to visualize data sets through hexagons on a Google map, how to create heatmaps of your data, and how to visualize 3D plots using Python.
Exploring datasets using pandas
Exploring and preprocessing datasets is really important in data science and Machine Learning because it helps to know the dataset clearly and make it suitable for the learning models. You can get access to the dataset and the source code from my GitHub account.
Let us first import the dataset from a CSV file using the panda’s module.
# importing pandas module import pandas as pd # importing the dataset data = pd.read_csv('house.csv') # printing few rows of the dataset data.head()
As the data set contains information about the prices of houses which depends on five different input variables as shown above.
As you can see, there are many null values in the dataset. Pandas provide a built-in method to remove these null values. Let us remove all the null values from our dataset.
# removing null values data.dropna(axis=0, inplace=True)
One of the useful methods of pandas is
info() method, which provides many useful details about the dataset. Let us now use this method to know some useful details about the dataset.
# using info method to get details data.info()
As you can see, there is a total of 3730 observations and all the data points are numeric values.
Another useful method of available in pandas is the
describe(), which returns the max, min, average, etc values of each of the attributes in the dataset.
# describe function of pandas data.describe()
As you can see, the describe function has returned much useful information about the dataset.
data visualization using pandas module
Data visualization is the most important step in the life cycle of data science and data analytics. It is more impressive and interesting when we represent our study or analysis with the help of colors and graphics. Using visualization elements like graphs, charts, maps, etc., It becomes easier for clients to understand the underlying structure, trends, patterns, and relationships among variables within the dataset.
Although there are various modules in Python which help us to visualize data in various plots, for example, heatmaps, 3d-plots using python, and plotting data on a google map. And these plotting need special modules. But if you know only pandas, you can still visualize your data through various plots and some of which we will discuss in this section.
Data visualization using Pandas – Line plots
A line plot is a linear graph that shows data frequencies along a number line. It can be used to analyze data that has a single defined value. Line plots are more useful when visualizing time series data.
Visualizing line plots using the pandas module is very easy. We just need to call the
plot() function. For example, see the line plot below where we will plot the line graph of the prices.
# plotting line plots using pandas data['price'].plot(figsize=(10, 6), c='m')
As you can see, the height of the line plots shows the price of the house. We can also plot more than one variable using the line plot. See the example below:
# plotting multi-line plots uing pandas data.drop('price', axis=1).plot(figsize=(10, 6))
As you can see we have plotted different independent variables from the dataset on the same graph. But the problem is that all the variables have been plotted on the same scale which is not good because we cannot actually see, how the latitude and longitude are changing. We can plot them on separate plots using subplots.
# plotting multi-line plots uing pandas data.drop('price', axis=1).plot(figsize=(10, 6), subplots=True)
This time as you can see, each plot has scaled differently.
Data visualization using Pandas – Bar plots
A bar plot shows the relationship between a numeric and a categoric variable. Each entity of the categoric variable is represented as a bar. The size of the bar represents its numeric value. Let us now use the pandas module to plot bar charts. But before going to plot the bar charts, we will take a random 0.9% of the dataset and plot a bar chart from them as we have a huge dataset.
We will first plot the bar chart for the number of floors.
# taking 10% of the data randomly dataset = data.sample(frac = 0.009) # plotting bar charts dataset['floor'].plot(kind='bar', figsize=(10, 6))
As you can see, the x-axis shows the bars and the y-axis shows the height of each bar which is actually the number of floors.
We can also plot the bar chart for more than one variable at the same time. For example, let us plot the bar chart for the number of floors and a number of rooms together.
# copying the dataset bar_plot = dataset.copy() # dropping unwanted variables bar_plot.drop('area', axis=1, inplace=True) bar_plot.drop('latitude', axis=1, inplace=True) bar_plot.drop('longitude', axis=1, inplace=True) bar_plot.drop('price', axis=1, inplace=True) # plotting the bar chart bar_plot.plot(kind='bar', figsize=(10, 6))
As you can see, the orange bars show the number of rooms and the blue bars show the number of floors. Another functionality of the bar plot is that we can also plot the stacked bar plots in pandas as shown below:
# plotting the bar chart bar_plot.plot(kind='bar', figsize=(10, 6), stacked=True)
As you can see, we have plotted a stacked bar chart using pandas. We can also plot the same bar chart on the other axis by using
barh the function.
# plotting the bar chart bar_plot.plot.barh(stacked=True, figsize=(6, 8))
Data visualization using Pandas – Histogram plots
A histogram plot is a frequency distribution that shows how often each different value in a set of data occurs. A histogram is a most commonly used graph to show frequency distributions. Let us now plot the histogram chart of the above dataset.
# copying the dataset hist_plot =bar_plot.copy() # plotting histogram chart hist_plot.plot.hist(figsize=(10, 6))
In a similar way to the bar plots, we can also plot stacked histograms as well.
# plotting histogram chart hist_plot.plot.hist(figsize=(10, 6), stacked=True)
We can also plot the commutative histogram as well. Let us plot the commutative histogram on the y-axis.
# plotting histogram chart hist_plot.plot.hist(figsize=(10, 6), stacked=True)
Apart from the cumulative histogram chart, we can also plot the histogram chart for each of the columns separately.
# histogram for each of the column dataset.diff().hist(figsize=(10, 6))
As you can see, we have successfully plotted histograms for each of the columns.
Data visualization using Pandas – Area plots
An area chart combines the line chart and bars chart to show how one or more groups’ numeric values change over the progression of a second variable, typically that of time. An area chart is distinguished from a line chart by the addition of shading between lines and a baseline, like in a bar chart. Let us first plot the area plot of the price variable.
# plotting the area plot dataset['price'].plot.area(figsize=(10, 6))
As you can see, the area under the line chart has been shaded. Also, the above plot is very irregular because the data is random and there is no fixed trend. Let us now, create a random dataset and visualize it using an area plot.
# importing the module import numpy as np # creating a dataset df = pd.DataFrame(np.random.rand(20, 5), columns=['A', 'B', 'C', 'D', 'E']) # plotting Data visualization using Pandas df.plot.area(figsize=(10,6))
As you can see, the area plots are stacked by default. We can also plot them without stack as shown below:
# plotting area plot df.plot.area(figsize=(10,6), stacked=False)
As you can see, this time the plots are not stacked.
Data visualization using Pandas – Scatter plots
Scatter plots are used to plot data points on a horizontal and a vertical axis in an attempt to show how much one variable is affected by another. Let us now plot the scatter plot of the dataset using pandas. Unlike other plots, in scatter plots, we have to specify the variables on axes.
# plotting the scatter plot data.plot.scatter(x='price', y='area', figsize=(10, 6), c='m')
Now, let us apply a little styling and visualize two different datasets on one plot.
# creating sctter plot ax=data.plot.scatter(x="price", y="area", color="m", marker="*", s=50, figsize=(10, 6)) # adding one more scattered plot on the same graph data.plot.scatter(x="price", y="latitude", color="g", s=100, ax=ax)
As you can see, the purple-colored dots show the relation between price and area while the green dots show the relation between the price and the latitude. Another way of coloring the data points is using any of the column values. For example, see below:
# plotting the scatter plot based on coloring data.plot.scatter(x="area", y="price", c='floor', s=100, figsize=(10, 6))
As you can see, the fully black shows the number of floors to be 20 and we the intensity of black decreases, the number of floors also decreases.
Data visualization using Pandas – Box plots
A box plot is a graphical rendition of statistical data based on the minimum, first quartile, median, third quartile, and maximum. The term box plot comes from the fact that the graph looks like a rectangle with lines extending from the top and bottom. Pandas can also be used to visualize box plots. Let us first find the box plot of the price variable.
# plotting the box plot data['price'].plot.box(figsize=(10, 6))
So all the points outside those horizontal lines are outliers. We can also plot the box plot for multiple variables as well.
# plotting the box plot data.drop('price', axis=1).plot.box(figsize=(10, 6))
As you can see, we have plotted the box plot for multiple variables using pandas.
Data visualization using Pandas – Pie plots
Pie charts are useful when we have a small number of categorical values that we need to compare. The readability of pie charts goes way down with the slightest increase in the number of categorical values. Let us first create a Pie chart for the number of floors.
# plotting a pie chart dataset['floor'].plot.pie(figsize=(10, 6))
Each color represents a different number of rooms. We can also visualize the subplots for various numbers of categorical classes. Now, we will generate random variables to visualize subplots.
#creating a DataFrame Data = pd.DataFrame(np.random.rand(6, 3), columns=('A', 'B', 'C')) # plottig Data visualization using Pandas subplots Data.plot.pie(subplots=True, figsize=(10, 6))
As you can see, we have visualized subplots of pie.
Visualization is effective because it harnesses the power of our subconscious mind. When we visualize goals as complete, it creates a conflict in our subconscious mind between what we are visualizing and what we currently have. In this article, we learn how we can use the pandas module to visualize our data in various different plots.
12 thoughts on “How to do Data visualization using Pandas in simple ways”