Feeds

Planet Python

Hynek Schlawack: Python Application Dependency Management in 2018
Thursday, 29 November 2018

We have more ways to manage dependencies in Python applications than ever. But how do they fare in production? Unfortunately this topic turned out to be quite polarizing and was at the center of a lot of heated debates. This is my attempt at an opinionated review through a DevOps lens.
Read more
Stack Abuse: Python Data Visualization with Matplotlib
Thursday, 29 November 2018

Introduction Visualizing data trends is one of the most important tasks in data science and machine learning. The choice of data mining and machine learning algorithms depends heavily on the patterns identified in the dataset during data visualization phase. In this article, we will see how we can perform different types of data visualizations in Python. We will use Python's Matplotlib library which is the de facto standard for data visualization in Python. The article A Brief Introduction to Matplotlib for Data Visualization provides a very high level introduction to the Matplot library and explains how to draw scatter plots, bar plots, histograms etc. In this article, we will explore more Matplotlib functionalities. Changing Default Plot Size The first thing we will do is change the default plot size. By default, the size of the Matplotlib plots is 6 x 4 inches. The default size of the plots can be checked using this command: import matplotlib.pyplot as plt print(plt.rcParams.get('figure.figsize')) For a better view, may need to change the default size of the Matplotlib graph. To do so you can use the following script: fig_size = plt.rcParams["figure.figsize"] fig_size[0] = 10 fig_size[1] = 8 plt.rcParams["figure.figsize"] = fig_size The above script changes the default size of the Matplotlib plots to 10 x 8 inches. Let's start our discussion with a simple line plot. Line Plot Line plot is the most basic plot in Matplotlib. It can be used to plot any function. Let's plot line plot for the cube function. Take a look at the following script: import matplotlib.pyplot as plt import numpy as np x = np.linspace(-10, 9, 20) y = x ** 3 plt.plot(x, y, 'b') plt.xlabel('X axis') plt.ylabel('Y axis') plt.title('Cube Function') plt.show() In the script above we first import the pyplot class from the Matplotlib library. We have two numpy arrays x and y in our script. We used the linspace method of the numpy library to create list of 20 numbers between -10 to positive 9. We then take cube root of all the number and assign the result to the variable y. To plot two numpy arrays, you can simply pass them to the plot method of the pyplot class of the Matplotlib library. You can use the xlabel, ylabel and title attributes of the pyplot class in order to label the x axis, y axis and the title of the plot. The output of the script above looks likes this: Output: Creating Multiple Plots You can actually create more than one plots on one canvas using Matplotlib. To do so, you have to use the subplot function which specifies the location and the plot number. Take a look at the following example: import matplotlib.pyplot as plt import numpy as np x = np.linspace(-10, 9, 20) y = x ** 3 plt.subplot(2,2,1) plt.plot(x, y, 'b*-') plt.subplot(2,2,2) plt.plot(x, y, 'y--') plt.subplot(2,2,3) plt.plot(x, y, 'b*-') plt.subplot(2,2,4) plt.plot(x, y, 'y--') The first attribute to the subplot function is the rows that the subplots will have and the second parameter species the number of columns for the subplot. A value of 2,2 species that there will be four graphs. The third argument is the position at which the graph will be displayed. The positions start from top-left. Plot with position 1 will be displayed at first row and first column. Similarly, plot with position 2 will be displayed in first row and second column. Take a look at the third argument of the plot function. This argument defines the shape and color of the marker on the graph. Output: Plotting in Object-Oriented Way In the previous section we used the plot method of the pyplot class and pass it values for x and y coordinates along with the labels. However, in Python the same plot can be drawn in object-oriented way. Take a look at the following script: import matplotlib.pyplot as plt import numpy as np x = np.linspace(-10, 9, 20) y = x ** 3 figure = plt.figure() axes = figure.add_axes([0.2, 0.2, 0.8, 0.8]) The figure method called using pyplot class returns figure object. You can call add_axes method using this object. The parameters passed to the add_axes method are the distance from the left and bottom of the default axis and the width and height of the axis, respectively. The value for these parameters should be mentioned as a fraction of the default figure size. Executing the above script creates an empty axis as shown in the following figure: The output of the script above looks like this: We have our axis, now we can add data and labels to this axis. To add the data, we need to call the plot function and pass it our data. Similarly, to create labels for x-axis, y-axis and for the title, we can use the set_xlabel, set_ylabel and set_title functions as shown below: import matplotlib.pyplot as plt import numpy as np x = np.linspace(-10, 9, 20) y = x ** 3 figure = plt.figure() axes = figure.add_axes([0.2, 0.2, 0.8, 0.8]) axes.plot(x, y, 'b') axes.set_xlabel('X Axis') axes.set_ylabel('Y Axis') axes.set_title('Cube function') You can see that the output is similar to the one we got in the last section but this time we used the object-oriented approach. You can add as many axes as you want on one plot using the add_axes method. Take a look at the following example: import matplotlib.pyplot as plt import numpy as np x = np.linspace(-10, 9, 20) y = x ** 3 z = x ** 2 figure = plt.figure() axes = figure.add_axes([0.0, 0.0, 0.9, 0.9]) axes2 = figure.add_axes([0.07, 0.55, 0.35, 0.3]) # inset axes axes.plot(x, y, 'b') axes.set_xlabel('X Axis') axes.set_ylabel('Y Axis') axes.set_title('Cube function') axes2.plot(x, z, 'r') axes2.set_xlabel('X Axis') axes2.set_ylabel('Y Axis') axes2.set_title('Square function') Take a careful look at the script above. In the script above we have two axes. The first axis contains graphs of the cube root of the input while the second axis draws the graph of the square root of the same data within the other graph for cube axis. In this example, you will better understand the role of the parameters for left, bottom, width and height. In the first axis, the values for left and bottom are set to zero while the value for width and height are set to 0.9 which means that our outer axis will have 90% width and height of the default axis. For the second axis, the value of the left is set to 0.07, for the bottom it is set to 0.55, while width and height are 0.35 and 0.3 respectively. If you execute the script above, you will see a big graph for cube function while a small graph for a square function which lies inside the graph for the cube. The output looks like this: Subplots Another way to create more than one plots at a time is to use subplot method. You need to pass the values for the nrow and ncols parameters. The total number of plots generated will be nrow x ncols. Let's take a look at a simple example. Execute the following script: import matplotlib.pyplot as plt import numpy as np x = np.linspace(-10, 9, 20) y = x ** 3 z = x ** 2 fig, axes = plt.subplots(nrows=2, ncols=3) In the output you will see 6 plots in 2 rows and 3 columns as shown below: Next, we will use a loop to add the output of the square function to each of these graphs. Take a look at the following script: import matplotlib.pyplot as plt import numpy as np x = np.linspace(-10, 9, 20) z = x ** 2 figure, axes = plt.subplots(nrows=2, ncols=3) for rows in axes: for ax1 in rows: ax1.plot(x, z, 'b') ax1.set_xlabel('X - axis') ax1.set_ylabel('Y - axis') ax1.set_title('Square Function') In the script above, we iterate over the axes returned by the subplots function and display the output of the square function on each axis. Remember, since we have axes in 2 rows and three columns, we have to execute a nested loop to iterate through all the axes. The outer for loop iterates through axes in rows while the inner for loop iterates through the axis in columns. The output of the script above looks likes this: In the output, you can see all the six plots with square functions. Changing Figure Size for a Plot In addition to changing the default size of the graph, you can also change the figure size for specific graphs. To do so, you need to pass a value for the figsize parameter of the subplots function. The value for the figsize parameter should be passed in the form of a tuple where the first value corresponds to the width while the second value corresponds to the hight of the graph. Look at the following example to see how to change the size of a specific plot: import matplotlib.pyplot as plt import numpy as np x = np.linspace(-10, 9, 20) y = x ** 3 z = x ** 2 figure, axes = plt.subplots(figsize = (6,8)) axes.plot(x, z, 'r') axes.set_xlabel('X-Axis') axes.set_ylabel('Y-Axis') axes.set_title('Square Function') In the script above draw a plot for the square function that is 6 inches wide and 8 inches high. The output looks likes this: Adding Legends Adding legends to a plot is very straightforward using Matplotlib library. All you have to do is to pass the value for the label parameter of the plot function. Then after calling the plot function, you just need to call the legend function. Take a look at the following example: import matplotlib.pyplot as plt import numpy as np x = np.linspace(-10, 9, 20) y = x ** 3 z = x ** 2 figure = plt.figure() axes = figure.add_axes([0,0,1,1]) axes.plot(x, z, label="Square Function") axes.plot(x, y, label="Cube Function") axes.legend() In the script above we define two functions: square and cube using x, y and z variables. Next, we first plot the square function and for the label parameter, we pass the value Square Function. This will be the value displayed in the label for square function. Next, we plot the cube function and pass Cube Function as value for the label parameter. The output looks likes this: In the output, you can see a legend at the top left corner. The position of the legend can be changed by passing a value for loc parameter of the legend function. The possible values can be 1 (for the top right corner), 2 (for the top left corner), 3 (for the bottom left corner) and 4 (for the bottom right corner). Let's draw a legend at the bottom right corner of the plot. Execute the following script: import matplotlib.pyplot as plt import numpy as np x = np.linspace(-10, 9, 20) y = x ** 3 z = x ** 2 figure = plt.figure() axes = figure.add_axes([0,0,1,1]) axes.plot(x, z, label="Square Function") axes.plot(x, y, label="Cube Function") axes.legend(loc=4) Output: Color Options There are several options to change the color and styles of the plots. The simplest way is to pass the first letter of the color as the third argument as shown in the following script: import matplotlib.pyplot as plt import numpy as np x = np.linspace(-10, 9, 20) y = x ** 3 z = x ** 2 figure = plt.figure() axes = figure.add_axes([0,0,1,1]) axes.plot(x, z, "r" ,label="Square Function") axes.plot(x, y, "g", label="Cube Function") axes.legend(loc=4) In the script above, a string "r" has been passed as the third parameter for the first plot. For the second plot, the string "g" has been passed at the third parameter. In the output, the first plot will be printed with a red solid line while the second plot will be printed with a green solid line as shown below: Another way to change the color of the plot is to make use of the color parameter. You can pass the name of the color or the hexadecimal value of the color to the color parameter. Take a look at the following example: import matplotlib.pyplot as plt import numpy as np x = np.linspace(-10, 9, 20) y = x ** 3 z = x ** 2 figure = plt.figure() axes = figure.add_axes([0,0,1,1]) axes.plot(x, z, color = "purple" ,label="Square Function") axes.plot(x, y, color = "#FF0000", label="Cube Function") axes.legend(loc=4) Output: Stack Plot Stack plot is an extension of bar chart or line chart which breaks down data from different categories and stack them together so that comparison between the values from different categories can easily be made. Suppose, you want to compare the goals scored by three different football players per year over the course of the last 8 years, you can create a stack plot using Matplot using the following script: import matplotlib.pyplot as plt year = [2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018] player1 = [8,10,17,15,23,18,24,29] player2 = [10,14,19,16,25,20,26,32] player3 = [12,17,21,19,26,22,28,35] plt.plot([],[], color='y', label = 'player1') plt.plot([],[], color='r', label = 'player2') plt.plot([],[], color='b', label = 'player3 ') plt.stackplot(year, player1, player2, player3, colors = ['y','r','b']) plt.legend() plt.title('Goals by three players') plt.xlabel('year') plt.ylabel('Goals') plt.show() Output: To create a stack plot using Python, you can simply use the stackplot class of the Matplotlib library. The values that you want to display are passed as the first parameter to the class and the values to be stacked on the horizontal axis are displayed as the second parameter, third parameter and so on. You can also set the color for each category using the colors attribute. Pie Chart A pie type is a circular chart where different categories are marked as part of the circle. The larger the share of the category, larger will be the portion that it will occupy on the chart. Let's draw a simple pie chart of the goals scored by a football team from free kicks, penalties and field goals. Take a look at the following script: import matplotlib.pyplot as plt goal_types = 'Penalties', 'Field Goals', 'Free Kicks' goals = [12,38,7] colors = ['y','r','b'] plt.pie(goals, labels = goal_types, colors=colors ,shadow = True, explode = (0.05, 0.05, 0.05), autopct = '%1.1f%%') plt.axis('equal') plt.show() Output: To create a pie chart in Matplot lib, the pie class is used. The first parameter to the class constructor is the list of numbers for each category. Comma-separated list of categories is passed as the argument to the labels attribute. List of colors for each category is passed to the colors attribute. If set to true, shadow attribute creates shadows around different categories on the pie chart. Finally, the explode attribute breaks the pie chart into individual parts. It is important to mention here that you do not have to pass the percentage for each category; rather you just have to pass the values and percentage for pie charts will automatically be calculated. Saving a Graph Saving a graph is very easy in Matplotlib. All you have to do is to call the savefig method from the figure object and pass it the path of the file that you want your graph to be saved with. Take a look at the following example: import matplotlib.pyplot as plt import numpy as np x = np.linspace(-10, 9, 20) y = x ** 3 z = x ** 2 figure, axes = plt.subplots(figsize = (6,8)) axes.plot(x, z, 'r') axes.set_xlabel('X-Axis') axes.set_ylabel('Y-Axis') axes.set_title('Square Function') figure.savefig(r'E:/fig1.jpg') The above script will save your file with name fig1.jpg at the root of the E directory. Conclusion Matplotlib is one of the most commonly used Python libraries for data visualization and plotting. The article explains some of the most frequently used Matplotlib functions with the help of different examples. Though the article covers most of the basic stuff, this is just the tip of the iceberg. I would suggest that you explore the official documentation for the Matplotlib library and see what more you can do with this amazing library.
Read more
PythonClub - A Brazilian collaborative blog about Python: Algoritmos de Ordenação
Thursday, 29 November 2018

Fala pessoal, tudo bom? Nos vídeos abaixo, vamos aprender como implementar alguns dos algoritmos de ordenação usando Python. Bubble Sort Como o algoritmo funciona: Como implementar o algoritmo usando Python: https://www.youtube.com/watch?v=Doy64STkwlI. Como implementar o algoritmo usando Python: https://www.youtube.com/watch?v=B0DFF0fE4rk. Código do algoritmo def sort(array): for final in range(len(array), 0, -1): exchanging = False for current in range(0, final - 1): if array[current] > array[current + 1]: array[current + 1], array[current] = array[current], array[current + 1] exchanging = True if not exchanging: break Selection Sort Como o algoritmo funciona: Como implementar o algoritmo usando Python: https://www.youtube.com/watch?v=vHxtP9BC-AA. Como implementar o algoritmo usando Python: https://www.youtube.com/watch?v=0ORfCwwhF_I. Código do algoritmo def sort(array): for index in range(0, len(array)): min_index = index for right in range(index + 1, len(array)): if array[right] < array[min_index]: min_index = right array[index], array[min_index] = array[min_index], array[index] Insertion Sort Como o algoritmo funciona: Como implementar o algoritmo usando Python: https://www.youtube.com/watch?v=O_E-Lj5HuRU. Como implementar o algoritmo usando Python: https://www.youtube.com/watch?v=Sy_Z1pqMgko. Código do algoritmo def sort(array): for p in range(0, len(array)): current_element = array[p] while p > 0 and array[p - 1] > current_element: array[p] = array[p - 1] p -= 1 array[p] = current_element Merge Sort Como o algoritmo funciona: Como implementar o algoritmo usando Python: https://www.youtube.com/watch?v=Lnww0ibU0XM. Como implementar o algoritmo usando Python - Parte I: https://www.youtube.com/watch?v=cXJHETlYyVk. Código do algoritmo def sort(array): sort_half(array, 0, len(array) - 1) def sort_half(array, start, end): if start >= end: return middle = (start + end) // 2 sort_half(array, start, middle) sort_half(array, middle + 1, end) merge(array, start, end) def merge(array, start, end): array[start: end + 1] = sorted(array[start: end + 1])
Read more
Catalin George Festila: Python Qt5 - submenu example.
Thursday, 29 November 2018

Using my old example I will create a submenu with PyQt5.First, you need to know the submenu works like the menu.Let's see the result:The source code is very simple:# -*- coding: utf-8 -*-"""@author: catafest"""import sysfrom PyQt5.QtWidgets import QMainWindow, QAction, qApp, QApplication, QDesktopWidget, QMenufrom PyQt5.QtGui import QIconclass Example(QMainWindow): #init the example class to draw the window application def __init__(self): super().__init__() self.initUI() #create the def center to select the center of the screen def center(self): # geometry of the main window qr = self.frameGeometry() # center point of screen cp = QDesktopWidget().availableGeometry().center() # move rectangle's center point to screen's center point qr.moveCenter(cp) # top left of rectangle becomes top left of window centering it self.move(qr.topLeft()) #create the init UI to draw the application def initUI(self): #create the action for the exit application with shortcut and icon #you can add new action for File menu and any actions you need exitAct = QAction(QIcon('exit.png'), '&Exit', self) exitAct.setShortcut('Ctrl+Q') exitAct.setStatusTip('Exit application') exitAct.triggered.connect(qApp.quit) #create the status bar for menu self.statusBar() #create the menu with the text File , add the exit action #you can add many items on menu with actions for each item menubar = self.menuBar() fileMenu = menubar.addMenu('&File') fileMenu.addAction(exitAct) # add submenu to menu submenu = QMenu('Submenu',self) # some dummy actions submenu.addAction('Submenu 1') submenu.addAction('Submenu 2') # add to the top menu menubar.addMenu(submenu) #resize the window application self.resize(640, 480) #draw on center of the screen self.center() #add title on windows application self.setWindowTitle('Simple menu') #show the application self.show() #close the UI class if __name__ == '__main__': #create the application app = QApplication(sys.argv) #use the UI with new class ex = Example() #run the UI sys.exit(app.exec_())
Read more
PyCharm: PyCharm 2018.3.1 RC Out Now
Thursday, 29 November 2018

PyCharm 2018.3.1 Release Candidate is now available, with various bug fixes. Get it now from our Confluence page. Improved in This Version A fix for the recently added WSL support in PyCharm 2018.3 A few fixes for Docker and Docker Compose Fixes for the embedded terminal Many fixes coming from WebStorm, DataGrip and IntelliJ IDEA; see the release notes for details Interested? Download the RC from our confluence page If you’re on Ubuntu 16.04 or later, you can use snap to get PyCharm RC versions and stay up to date. You can find the installation instructions on our website. The release candidate (RC) is not an early access program (EAP) build, and does not bundle an EAP license. To use PyCharm Professional Edition RC, you will need a currently active PyCharm subscription. If none is available, a free 30-day trial will start.
Read more
Codementor: The Python API for Juniper Networks
Thursday, 29 November 2018

Learn about Juniper networks and PyEZ in this guest post by Eric Chou, the author of Mastering Python Networking – Second Edition...
Read more
Erik Marsja: Explorative Data Analysis with Pandas, SciPy, and Seaborn
Thursday, 29 November 2018

In this post we are going to learn to explore data using Python, Pandas, and Seaborn. The data we are going to explore is data from a Wikipedia article. In this post we are actually going to learn how to parse data from a URL, exploring this data by grouping it and data visualization. More specifically, we will learn how to count missing values, group data to calculate the mean, and then visualize relationships between two variables, among other things. In previous posts we have used Pandas to import data from Excel and CSV files. Here we are going to use Pandas read_html because it has support for reading data from HTML from URLs (https or http). To read HTML Pandas use one of the Python libraries LXML, Html5Lib, or BeautifulSoup4. This means that you have to make sure that at least one of these libraries are installed. In the specific Pandas read_html example here, we use BeautifulSoup4 to parse the html tables from the Wikipedia article. Installing the Libraries Before proceeding to the Pandas read_html example we are going to install the required libraries. In this post we are going to use Pandas, Seaborn, NumPy, SciPy, and BeautifulSoup4. We are going to use Pandas to parse HTML and plotting, Seaborn for data visualization, NumPy and SciPy for some calculations, and BeautifulSoup4 as the parser for the read_html method. Installing Anaconda is the absolutely easiest method to install all packages needed. If your Anaconda distribution you can open up your terminal and type: conda install <packagename>. That is, if you need to install all packages:conda install numpy scipy pandas seaborn beautifulsoup4It’s also possible to install using Pip:pip install numpy scipy pandas seaborn beautifulsoup4 How to Use Pandas read_html In this section we will work with Pandas read_html to parse data from a Wikipedia article. The article we are going to parse have 6 tables and there are some data we are going to explore in 5 of them. We are going to look at Scoville Heat Units and Pod size of different chili pepper species.import pandas as pd url = 'https://en.wikipedia.org/wiki/List_of_Capsicum_cultivars' data = pd.read_html(url, flavor='bs4', header=0, encoding='UTF8')In the code above we are, as usual, starting by importing pandas. After that we have a string variable (i.e., URL) that is pointing to the URL. We are then using Pandas read_html to parse the HTML from the URL. As with the read_csv and read_excel methods, the parameter header is used to tell Pandas read_html on which row the headers are. In this case, it’s the first row. The parameter flavor is used, here, to make use of beatifulsoup4 as HTML parser. If we use LXML, some columns in the dataframe will be empty. Anyway, what we get is all tables from the URL. These tables are, in turn, stored in a list (data). In this Panda read_html example the last table is not of interest: Thus we are going to remove this dataframe from the list:# Let's remove the last table del data[-1] Merging Pandas Dataframes The aim with this post is to explore the data and what we need to do now is to add a column in each dataframe in the list. This columns will have information about the species and we create a list with strings. In the following for-loop we are adding a new column, named “Species”, and we add the species name from the list.species = ['Capsicum annum', 'Capsicum baccatum', 'Capsicum chinense', 'Capsicum frutescens', 'Capsicum pubescens'] for i in range(len(species)): data[i]['Species'] = species[i]Finally, we are going to concatenate the list of dataframes using Pandas concat:df = pd.concat(data, sort=False) df.head() The data we obtained using Pandas read_html can, of course, be saved locally using either Pandas to_csv or to_excel, among other methods. See the two following tutorials on how to work with these methods and file formats: Pandas Read CSV Tutorial Pandas Excel Tutorial Preparing the Data Now that we have used Pandas read_html and merged the dataframes we need to clean up the data a bit. We are going to use the method map together with lambda and regular expressions (i.e., sub, findall) to remove and extract certain things from the cells. We are also using the split and rstrip methods to split the strings into pieces. In this example we want the centimeter values. Because of the missing values in the data we have to see if the value from a cell (x, in this case) is a string. If not, we will us NumPy’s NaN to code that it is a missing value.# Remove brackets and whats between them (e.g. [14]) df['Name'] = df['Name'].map(lambda x: re.sub("[\(\[].*?[\)\]]", "", x) if isinstance(x, str) else np.NaN) # Pod Size get cm df['Pod size'] = df['Pod size'].map(lambda x: x.split(' ', 1)[0].rstrip('cm') if isinstance(x, str) else np.NaN) # Taking the largest number in a range and convert all values to float df['Pod size'] = df['Pod size'].map(lambda x: x.split('–', 1)[-1] if isinstance(x, str) else np.NaN) # Convert to float df['Pod size'] = df['Pod size'].map(lambda x: float(x)) # Taking the largest SHU df['Heat'] = df['Heat'].map(lambda x: re.sub("[\(\[].*?[\)\]]", "", x) if isinstance(x, str) else np.NaN) df['Heat'] = df['Heat'].str.replace(',', '') df['Heat'] = df['Heat'].map(lambda x: float(re.findall(r'\d+(?:,\d+)?', x)[-1]) if isinstance(x, str) else np.NaN) Explorative Data Analysis in Python In this section we are going to explore the data using Pandas and Seaborn. First we are going to see how many missing values we have, count how many occurrences we have of one factor, and then group the data and calculate the mean values for the variables. Counting Missing Values First thing we are going to do is to count the number of missing values in the different columns. We are going to do this using the isna and sum methods:df.isna().sum() Later in the post we are going to explore the relationship between the heat and the pod size of chili peppers. Note, there are a lot of missing data in both of these columns. Counting categorical Data in a Column We can also count how many factors (or categorical data; i.e., strings) we have in a column by selecting that column and using the Pandas Series method value_counts:df['Species'].value_counts() Aggregating by Group We can also calculate the mean Heat and Pod size for each species using Pandas groupby and mean methods:df_aggregated = df.groupby('Species').mean().reset_index() df_aggregated There are of course many other ways to explore your data using Pandas methods (e.g., value_counts, mean, groupby). See the posts Descriptive Statistics using Python and Data Manipulation with Pandas for more information. Data Visualization using Pandas and Seaborn In this section we are going to visualize the data using Pandas and Seaborn. We are going to start to explore whether there is a relationship between the size of the chili pod (‘Pod size’) and the heat of the chili pepper (Scoville Heat Units). Pandas Scatter Plot In the first scatter plot, we are going to use Pandas built-in method ‘scatter’. In this basic example we are going to have pod size on the x-axis and heat on the y-axis. We are also getting the blue points by using the parameter c.ax1 = df.plot.scatter(x='Pod size', y='Heat', c='DarkBlue') There seems to be a linear relationship between heat and pod size. However, we have an outlier in the data and the pattern may be more clear if we remove it. Thus, in the next Pandas scatter plot example we are going to subset the dataframe taking only values under 1,400,000 SHU:ax1 = df.query('Heat < 1400000').plot.scatter(x='Pod size', y='Heat', c='DarkBlue', figsize=(8, 6))We used pandas query to select the rows were the value in the column ‘Heat’ is lower than preferred value. The resulting scatter plot shows a more convincing pattern: We still have some possible outliers (around 300,000 – 35000 SHU) but we are going to leave them. Note that I used the parameter figsize=(8, 6) in both plots above to get the dimensions of the posted images. That is, if you want to change the dimensions of the Pandas plots you should use figsize. Now we would like to plot a regression line on the Pandas scatter plot. As far as I know, this is not possible (please comment below if you know a solution and I will add it). Therefore, we are now going to use Seaborn to visualize data as it gives us more control and options over our graphics. Data Visualization using Seaborn In this section we are going to continue exploring the data using the Python package Seaborn. We start with scatter plots and continue with Seaborn Scatter Plot Creating a scatter plot using Seaborn is very easy. In the basic scatter plot example below we are, as in the Pandas example, using the parameters x and y (x-axis and y-axis, respectively). However, we have use the parameter data and our dataframe.import seaborn as sns ax = sns.regplot(x="Pod size", y="Heat", data=df.query('Heat < 1400000')) Correlation in Python Judging from above there seems to be a relationship between the variables of interest. Next thing we are going to do is to see if this visual pattern also shows up as a statistical association (i.e., correlation). To this aim, we are going to use SciPy and the pearsonr method. We start by importing pearsonr from scipy.stats.from scipy.stats import pearsonrAs we found out when exploring the data using Pandas groupby there was a lot of missing data (both for heat and pod size). When calculating the correlation coefficient using Python we need to remove the missing values. Again, we are also removing the strongest chili pepper using Pandas query.df_full = df[['Heat', 'Pod size']].dropna() df_full = df_full.query('Heat < 1400000') print(len(df_full)) # Output: 31Note, in the example above we are selecting the columns “Heat” and “Pod size” only. If we want to keep the other variables but only have complete cases we can use the subset parameter (df_full = df.dropna(subset=[‘Heat’, ‘Pod size’])). That said, we now have a subset of our dataframe with 31 complete cases and it’s time to carry out the correlation. It’s quite simple, we just put in the variables of interest. We are going to display the correlation coefficient and p-value on the scatter plot later so we use NumPy’s round to round the values.corr = pearsonr(df_full['Heat'], df_full['Pod size']) corr = [np.round(c, 2) for c in corr] print(corr) # Output: [-0.37, 0.04] Seaborn Correlation Plot with Trend Line It’s time to stitch everything together! First, we are creating a text string for displaying the correlation coefficient (r=-0.37) and the p-value (p=0.04). Second, we are creating the correlation plot using Seaborn regplot, as in the previous example. To display the text we use the text method; the first parameter is the x coordinate and the second is the y coordinate. After the coordinates we have our text and the size of the font. We are also sing set_title to add a title to the Seaborn plot and we are changing the x- and y-labels using the set method.text = 'r=%s, p=%s' % (corr[0], corr[1]) ax = sns.regplot(x="Pod size", y="Heat", data=df_full) ax.text(10, 300000, text, fontsize=12) ax.set_title('Capsicum') ax.set(xlabel='Pod size (cm)', ylabel='Scoville Heat Units (SHU)') Pandas Boxplot Example Now we are going to visualize some other aspects of the data. We are going to use the aggregated data (grouped by using Pandas groupby) to visualize the mean heat across species. We start by using Pandas boxplot method:df_aggregated = df.groupby('Species').mean().reset_index() df_aggregated.plot.bar(x='Species', y='Heat') In the image above, we can see that the mean heat is highest for the Capsicum Chinense species. However, the bar graph my hide important information (remember, the scatter plot revealed some outliers). We are therefore continuing with a categorical scatter plot using Seaborn: Grouped Scatter Plot with Seaborn Here, we don’t add that much compared to the previous Seaborn scatter plots examples. However, we need to rotate the tick labels on the x-axis using set_xticklabels and the parameter rotation.ax = sns.catplot(x='Species', y='Heat', data=df) ax.set(xlabel='Capsicum Species', ylabel='Scoville Heat Units (SHU)') ax.set_xticklabels(rotation=70) Conclusion Now we have learned how to explore data using Python, Pandas, NumPy, SciPy, and Seaborn. Specifically, we have learned how to us Pandas read_html to parse HTML from a URL, clean up the data in the columns (e.g., remove unwanted information), create scatter plots both in Pandas and Seaborn, visualize grouped data, and create categorical scatter plots in Seaborn. We have now an idea how to change the axis ticks labels rotation, change the y- and x-axis labels, and adding a title to Seaborn plots. The post Explorative Data Analysis with Pandas, SciPy, and Seaborn appeared first on Erik Marsja.
Read more

Feeds

thePHPfactory

Rss Factory Menu