Feeds
Planet Python

ListenData: NumPy Tutorial with ExercisesFriday, 19 April 2019NumPy (acronym for 'Numerical Python' or 'Numeric Python') is one of the most essential package for speedy mathematical computation on arrays and matrices in Python. It is also quite useful while dealing with multidimensional data. It is a blessing for integrating C, C++ and FORTRAN tools. It also provides numerous functions for Fourier transform (FT) and linear algebra.Python : Numpy TutorialWhy NumPy instead of lists?One might think of why one should prefer arrays in NumPy instead we can create lists having the same data type. If this statement also rings a bell then the following reasons may convince you:Numpy arrays have contiguous memory allocation. Thus if a same array stored as list will require more space as compared to arrays.They are more speedy to work with and hence are more efficient than the lists.They are more convenient to deal with.NumPy vs. PandasPandas is built on top of NumPy. In other words,Numpy is required by pandas to make it work. So Pandas is not an alternative to Numpy. Instead pandas offers additionalmethod or provides more streamlined way of working with numerical and tabular data in Python.Importing numpyFirstly you need to import the numpy library. Importing numpy can be done by running the following command:import numpy as npIt is a general approach to import numpy with alias as 'np'. If alias is not provided then to access the functions from numpy we shall write numpy.function. To make it easier an alias 'np' is introduced so we can write np.function. Some of the common functions of numpy are listed below  Functions Tasks array Create numpy array ndim Dimension of the array shape Size of the array (Number of rows and Columns) size Total number of elements in the array dtype Type of elements in the array, i.e., int64, character reshape Reshapes the array without changing the original shape resize Reshapes the array. Also change the original shape arange Create sequence of numbers in array Itemsize Size in bytes of each item diag Create a diagonal matrix vstack Stacking vertically hstack Stacking horizontally 1D arrayUsing numpy an array is created by using np.array:a = np.array([15,25,14,78,96])aprint(a)aOutput: array([15, 25, 14, 78, 96])print(a)Output: [15 25 14 78 96]Notice that in np.array square brackets are present. Absence of square bracket introduces an error. To print the array we can use print(a).Changing the datatypenp.array( ) has an additional parameter of dtype through which one can define whether the elements are integers or floating points or complex numbers.a.dtypea = np.array([15,25,14,78,96],dtype = "float")aa.dtypeInitially datatype of 'a' was 'int32' which on modifying becomes 'float64'.int32 refers to number without a decimal point. '32' means number can be in between2147483648 and 2147483647. Similarly, int16 implies number can be in range 32768 to 32767float64 refers to number with decimal place.Creating the sequence of numbersIf you want to create a sequence of numbers then using np.arange, we can get our sequence. To get the sequence of numbers from 20 to 29 we run the following command.b = np.arange(start = 20,stop = 30, step = 1) barray([20, 21, 22, 23, 24, 25, 26, 27, 28, 29])In np.arange the end point is always excluded.np.arange provides an option of step which defines the difference between 2 consecutive numbers. If step is not provided then it takes the value 1 by default.Suppose we want to create an arithmetic progression with initial term 20 and common difference 2, upto 30; 30 being excluded.c = np.arange(20,30,2) #30 is excluded.carray([20, 22, 24, 26, 28])It is to be taken care that in np.arange( ) the stop argument is always excluded.Indexing in arraysIt is important to note that Python indexing starts from 0. The syntax of indexing is as follows x[start:end:step]: Elements in array x start through the end (but the end is excluded), default step value is 1.x[start:end] : Elements in array x start through the end (but the end is excluded)x[start:] : Elements start through the endx[:end] : Elements from the beginning through the end (but the end is excluded)If we want to extract 3rd element we write the index as 2 as it starts from 0.x = np.arange(10)x[2]x[2:5]x[::2]x[1::2] xOutput: [0 1 2 3 4 5 6 7 8 9]x[2]Output: 2x[2:5]Output: array([2, 3, 4])x[::2]Output: array([0, 2, 4, 6, 8])x[1::2]Output: array([1, 3, 5, 7, 9])Note that in x[2:5] elements starting from 2nd index up to 5th index(exclusive) are selected.If we want to change the value of all the elements from starting upto index 7,excluding 7, with a step of 3 as 123 we write:x[:7:3] = 123x array([123, 1, 2, 123, 4, 5, 123, 7, 8, 9])To reverse a given array we write:x = np.arange(10)x[ : :1] # reversed xarray([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])Note that the above command does not modify the original array.Reshaping the arraysTo reshape the array we can use reshape( ).f = np.arange(101,113)f.reshape(3,4)f array([101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112])Note that reshape() does not alter the shape of the original array. Thus to modify the original array we can use resize( )f.resize(3,4)farray([[101, 102, 103, 104], [105, 106, 107, 108], [109, 110, 111, 112]])If a dimension is given as 1 in a reshaping, the other dimensions are automatically calculated provided that the given dimension is a multiple of total number of elements in the array.f.reshape(3,1)array([[101, 102, 103, 104], [105, 106, 107, 108], [109, 110, 111, 112]])In the above code we only directed that we will have 3 rows. Python automatically calculates the number of elements in other dimension i.e. 4 columns.Missing DataThe missing data is represented by NaN (acronym for Not a Number). You can use the command np.nanval = np.array([15,10, np.nan, 3, 2, 5, 6, 4]) val.sum()Out: nanTo ignore missing values, you can use np.nansum(val) which returns 45To check whether array contains missing value, you can use the functionisnan( )np.isnan(val) 2D arraysA 2D array in numpy can be created in the following manner:g = np.array([(10,20,30),(40,50,60)])#Alternativelyg = np.array([[10,20,30],[40,50,60]])gThe dimension, total number of elements and shape can be ascertained by ndim, size and shape respectively:g.ndimg.sizeg.shapeg.ndimOutput: 2g.sizeOutput: 6g.shapeOutput: (2, 3)Creating some usual matricesnumpy provides the utility to create some usual matrices which are commonly used for linear algebra.To create a matrix of all zeros of 2 rows and 4 columns we can use np.zeros( ):np.zeros( (2,4) )array([[ 0., 0., 0., 0.], [ 0., 0., 0., 0.]])Here the dtype can also be specified. For a zero matrix the default dtype is 'float'. To change it to integer we write 'dtype = np.int16'np.zeros([2,4],dtype=np.int16) array([[0, 0, 0, 0], [0, 0, 0, 0]], dtype=int16)To get a matrix of all random numbers from 0 to 1 we write np.empty.np.empty( (2,3) ) array([[ 2.16443571e312, 2.20687562e312, 2.24931554e312], [ 2.29175545e312, 2.33419537e312, 2.37663529e312]])Note: The results may vary everytime you run np.empty.To create a matrix of unity we write np.ones( ). We can create a 3 * 3 matrix of all ones by:np.ones([3,3])array([[ 1., 1., 1.], [ 1., 1., 1.], [ 1., 1., 1.]])To create a diagonal matrix we can write np.diag( ). To create a diagonal matrix where the diagonal elements are 14,15,16 and 17 we write:np.diag([14,15,16,17])array([[14, 0, 0, 0], [ 0, 15, 0, 0], [ 0, 0, 16, 0], [ 0, 0, 0, 17]])To create an identity matrix we can use np.eye( ) .np.eye(5,dtype = "int")array([[1, 0, 0, 0, 0], [0, 1, 0, 0, 0], [0, 0, 1, 0, 0], [0, 0, 0, 1, 0], [0, 0, 0, 0, 1]])By default the datatype in np.eye( ) is 'float' thus we write dtype = "int" to convert it to integers.Reshaping 2D arraysTo get a flattened 1D array we can use ravel( )g = np.array([(10,20,30),(40,50,60)])g.ravel() array([10, 20, 30, 40, 50, 60])To change the shape of 2D array we can use reshape. Writing 1 will calculate the other dimension automatically and does not modify the original array.g.reshape(3,1) # returns the array with a modified shape#It does not modify the original arrayg.shape (2, 3)Similar to 1D arrays, using resize( ) will modify the shape in the original array.g.resize((3,2))g #resize modifies the original arrayarray([[10, 20], [30, 40], [50, 60]])Time for some matrix algebraLet us create some arrays A,b and B and they will be used for this section:A = np.array([[2,0,1],[4,3,8],[7,6,9]])b = np.array([1,101,14])B = np.array([[10,20,30],[40,50,60],[70,80,90]])In order to get the transpose, trace and inverse we use A.transpose( ) , np.trace( ) and np.linalg.inv( ) respectively.A.T #transposeA.transpose() #transposenp.trace(A) # tracenp.linalg.inv(A) #InverseA.transpose() #transposeOutput: array([[2, 4, 7], [0, 3, 6], [1, 8, 9]])np.trace(A) # traceOutput: 14np.linalg.inv(A) #InverseOutput: array([[ 0.53846154, 0.15384615, 0.07692308], [0.51282051, 0.28205128, 0.30769231], [0.07692308, 0.30769231, 0.15384615]])Note that transpose does not modify the original array.Matrix addition and subtraction can be done in the usual way:A+BABA+BOutput: array([[12, 20, 31], [44, 53, 68], [77, 86, 99]])ABOutput: array([[ 8, 20, 29], [36, 47, 52], [63, 74, 81]])Matrix multiplication of A and B can be accomplished by A.dot(B). Where A will be the 1st matrix on the left hand side and B will be the second matrix on the right side.A.dot(B)array([[ 90, 120, 150], [ 720, 870, 1020], [ 940, 1160, 1380]])To solve the system of linear equations: Ax = b we use np.linalg.solve( )np.linalg.solve(A,b)array([13.92307692, 24.69230769, 28.84615385])The eigen values and eigen vectors can be calculated using np.linalg.eig( )np.linalg.eig(A)(array([ 14.0874236 , 1.62072127, 1.70814487]), array([[0.06599631, 0.78226966, 0.14996331], [0.59939873, 0.54774477, 0.81748379], [0.7977253 , 0.29669824, 0.55608566]]))The first row are the various eigen values and the second matrix denotes the matrix of eigen vectors where each column is the eigen vector to the corresponding eigen value.Some Mathematics functionsWe can have various trigonometric functions like sin, cosine etc. using numpy:B = np.array([[0,20,36],[40,50,1]])np.sin(B)array([[ 0. , 0.91294525, 0.99177885], [ 0.74511316, 0.26237485, 0.84147098]])The resultant is the matrix of all sin( ) elements.In order to get the exponents we use **B**2array([[ 0, 400, 1296], [1600, 2500, 1]], dtype=int32)We get the matrix of the square of all elements of B.In order to obtain if a condition is satisfied by the elements of a matrix we need to write the criteria. For instance, to check if the elements of B are more than 25 we write:B>25array([[False, False, True], [ True, True, False]], dtype=bool)We get a matrix of Booleans where True indicates that the corresponding element is greater than 25 and False indicates that the condition is not satisfied.In a similar manner np.absolute, np.sqrt and np.exp return the matrices of absolute numbers, square roots and exponentials respectively.np.absolute(B)np.sqrt(B)np.exp(B)Now we consider a matrix A of shape 3*3:A = np.arange(1,10).reshape(3,3)Aarray([[1, 2, 3], [4, 5, 6], [7, 8, 9]])To find the sum, minimum, maximum, mean, standard deviation and variance respectively we use the following commands:A.sum()A.min()A.max()A.mean()A.std() #Standard deviationA.var() #VarianceA.sum()Output: 45A.min()Output: 1A.max()Output: 9A.mean()Output: 5.0A.std() #Standard deviationOutput: 2.5819888974716112A.var()Output: 6.666666666666667In order to obtain the index of the minimum and maximum elements we use argmin( ) and argmax( ) respectively.A.argmin()A.argmax()A.argmin()Output: 0A.argmax()Output: 8If we wish to find the above statistics for each row or column then we need to specify the axis:A.sum(axis=0) A.mean(axis = 0)A.std(axis = 0)A.argmin(axis = 0)A.sum(axis=0) # sum of each column, it will move in downward directionOutput: array([12, 15, 18])A.mean(axis = 0)Output: array([ 4., 5., 6.])A.std(axis = 0)Output: array([ 2.44948974, 2.44948974, 2.44948974])A.argmin(axis = 0)Output: array([0, 0, 0], dtype=int64)By defining axis = 0, calculations will move in downward direction i.e. it will give the statistics for each column. To find the min and index of maximum element for each row, we need to move in rightwise direction so we write axis = 1:A.min(axis=1)A.argmax(axis = 1)A.min(axis=1) # min of each row, it will move in rightwise directionOutput: array([1, 4, 7])A.argmax(axis = 1)Output: array([2, 2, 2], dtype=int64)To find the cumulative sum along each row we use cumsum( )A.cumsum(axis=1)array([[ 1, 3, 6], [ 4, 9, 15], [ 7, 15, 24]], dtype=int32)Creating 3D arraysNumpy also provides the facility to create 3D arrays. A 3D array can be created as:X = np.array( [[[ 1, 2,3], [ 4, 5, 6]],[[7,8,9],[10,11,12]]])X.shapeX.ndimX.sizeX contains two 2D arrays Thus the shape is 2,2,3. Totol number of elements is 12.To calculate the sum along a particular axis we use the axis parameter as follows:X.sum(axis = 0)X.sum(axis = 1)X.sum(axis = 2)X.sum(axis = 0)Output: array([[ 8, 10, 12], [14, 16, 18]])X.sum(axis = 1)Output: array([[ 5, 7, 9], [17, 19, 21]])X.sum(axis = 2)Output: array([[ 6, 15], [24, 33]])axis = 0 returns the sum of the corresponding elements of each 2D array. axis = 1 returns the sum of elements in each column in each matrix while axis = 2 returns the sum of each row in each matrix.X.ravel() array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])ravel( ) writes all the elements in a single array.Consider a 3D array:X = np.array( [[[ 1, 2,3], [ 4, 5, 6]],[[7,8,9],[10,11,12]]])To extract the 2nd matrix we write:X[1,...] # same as X[1,:,:] or X[1]array([[ 7, 8, 9], [10, 11, 12]])Remember python indexing starts from 0 that is why we wrote 1 to extract the 2nd 2D array.To extract the first element from all the rows we write:X[...,0] # same as X[:,:,0]array([[ 1, 4], [ 7, 10]])Find out position of elements that satisfy a given conditiona = np.array([8, 3, 7, 0, 4, 2, 5, 2])np.where(a > 4) array([0, 2, 6]np.where locates the positions in the array where element of array is greater than 4. Indexing with Arrays of IndicesConsider a 1D array.x = np.arange(11,35,2)xarray([11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33])We form a 1D array i which subsets the elements of x as follows:i = np.array( [0,1,5,3,7,9 ] )x[i]array([11, 13, 21, 17, 25, 29])In a similar manner we create a 2D array j of indices to subset x.j = np.array( [ [ 0, 1], [ 6, 2 ] ] )x[j]array([[11, 13], [23, 15]])Similarly we can create both i and j as 2D arrays of indices for xx = np.arange(15).reshape(3,5)xi = np.array( [ [0,1], # indices for the first dim[2,0] ] )j = np.array( [ [1,1], # indices for the second dim[2,0] ] )To get the ith index in row and jth index for columns we write:x[i,j] # i and j must have equal shapearray([[ 1, 6], [12, 0]])To extract ith index from 3rd column we write:x[i,2]array([[ 2, 7], [12, 2]])For each row if we want to find the jth index we write:x[:,j] array([[[ 1, 1], [ 2, 0]], [[ 6, 6], [ 7, 5]], [[11, 11], [12, 10]]])Fixing 1st row and jth index,fixing 2nd row jth index, fixing 3rd row and jth index.You can also use indexing with arrays to assign the values:x = np.arange(10)xx[[4,5,8,1,2]] = 0xarray([0, 0, 0, 3, 0, 0, 6, 7, 0, 9])0 is assigned to 4th, 5th, 8th, 1st and 2nd indices of x.When the list of indices contains repetitions then it assigns the last value to that index:x = np.arange(10)xx[[4,4,2,3]] = [100,200,300,400]xarray([ 0, 1, 300, 400, 200, 5, 6, 7, 8, 9])Notice that for the 5th element(i.e. 4th index) the value assigned is 200, not 100.Caution: If one is using += operator on repeated indices then it carries out the operator only once on repeated indices.x = np.arange(10)x[[1,1,1,7,7]]+=1x array([0, 2, 2, 3, 4, 5, 6, 8, 8, 9])Although index 1 and 7 are repeated but they are incremented only once.Indexing with Boolean ArraysWe create a 2D array and store our condition in b. If we the condition is true it results in True otherwise False.a = np.arange(12).reshape(3,4)b = a > 4b array([[False, False, False, False], [False, True, True, True], [ True, True, True, True]], dtype=bool)Note that 'b' is a Boolean with same shape as that of 'a'.To select the elements from 'a' which adhere to condition 'b' we write:a[b] array([ 5, 6, 7, 8, 9, 10, 11])Now 'a' becomes a 1D array with the selected elementsThis property can be very useful in assignments:a[b] = 0 aarray([[0, 1, 2, 3], [4, 0, 0, 0], [0, 0, 0, 0]])All elements of 'a' higher than 4 become 0As done in integer indexing we can use indexing via Booleans:Let x be the original matrix and 'y' and 'z' be the arrays of Booleans to select the rows and columns.x = np.arange(15).reshape(3,5)y = np.array([True,True,False]) # first dim selectionz = np.array([True,True,False,True,False]) # second dim selectionWe write the x[y,:] which will select only those rows where y is True.x[y,:] # selecting rowsx[y] # same thingWriting x[:,z] will select only those columns where z is True.x[:,z] # selecting columnsx[y,:] # selecting rowsOutput: array([[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]])x[y] # same thingOutput: array([[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]])x[:,z] # selecting columnsOutput: array([[ 0, 1, 3], [ 5, 6, 8], [10, 11, 13]])Statistics on Pandas DataFrameLet's create dummy data frame for illustration : np.random.seed(234)mydata = pd.DataFrame({"x1" : np.random.randint(low=1, high=100, size=10), "x2" : range(10) })1. Calculate mean of each column of data framenp.mean(mydata)2. Calculate median of each column of data framenp.median(mydata, axis=0)axis = 0 means the median function would be run on each column. axis = 1 implies the function to be run on each row.Stacking various arraysLet us consider 2 arrays A and B:A = np.array([[10,20,30],[40,50,60]])B = np.array([[100,200,300],[400,500,600]])To join them vertically we use np.vstack( ).np.vstack((A,B)) #Stacking verticallyarray([[ 10, 20, 30], [ 40, 50, 60], [100, 200, 300], [400, 500, 600]])To join them horizontally we use np.hstack( ).np.hstack((A,B)) #Stacking horizontallyarray([[ 10, 20, 30, 100, 200, 300], [ 40, 50, 60, 400, 500, 600]])newaxis helps in transforming a 1D row vector to a 1D column vector.from numpy import newaxisa = np.array([4.,1.])b = np.array([2.,8.])a[:,newaxis]array([[ 4.], [ 1.]])#The function np.column_stack( ) stacks 1D arrays as columns into a 2D array. It is equivalent to hstack only for 1D arrays:np.column_stack((a[:,newaxis],b[:,newaxis]))np.hstack((a[:,newaxis],b[:,newaxis])) # same as column_stacknp.column_stack((a[:,newaxis],b[:,newaxis]))Output: array([[ 4., 2.], [ 1., 8.]])np.hstack((a[:,newaxis],b[:,newaxis]))Output: array([[ 4., 2.], [ 1., 8.]])Splitting the arraysConsider an array 'z' of 15 elements:z = np.arange(1,16)Using np.hsplit( ) one can split the arraysnp.hsplit(z,5) # Split a into 5 arrays[array([1, 2, 3]), array([4, 5, 6]), array([7, 8, 9]), array([10, 11, 12]), array([13, 14, 15])]It splits 'z' into 5 arrays of eqaual length.On passing 2 elements we get:np.hsplit(z,(3,5)) [array([1, 2, 3]), array([4, 5]), array([ 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])]It splits 'z' after the third and the fifth element.For 2D arrays np.hsplit( ) works as follows:A = np.arange(1,31).reshape(3,10)Anp.hsplit(A,5) # Split a into 5 arrays[array([[ 1, 2], [11, 12], [21, 22]]), array([[ 3, 4], [13, 14], [23, 24]]), array([[ 5, 6], [15, 16], [25, 26]]), array([[ 7, 8], [17, 18], [27, 28]]), array([[ 9, 10], [19, 20], [29, 30]])]In the above command A gets split into 5 arrays of same shape.To split after the third and the fifth column we write:np.hsplit(A,(3,5)) [array([[ 1, 2, 3], [11, 12, 13], [21, 22, 23]]), array([[ 4, 5], [14, 15], [24, 25]]), array([[ 6, 7, 8, 9, 10], [16, 17, 18, 19, 20], [26, 27, 28, 29, 30]])]CopyingConsider an array xx = np.arange(1,16)We assign y as x and then say 'y is x'y = x y is x Let us change the shape of yy.shape = 3,5Note that it alters the shape of xx.shape(3, 5)Creating a view of the dataLet us store z as a view of x by:z = x.view()z is x FalseThus z is not x.Changing the shape of zz.shape = 5,3 Creating a view does not alter the shape of xx.shape(3, 5)Changing an element in zz[0,0] = 1234 Note that the value in x also get alters:xarray([[1234, 2, 3, 4, 5], [ 6, 7, 8, 9, 10], [ 11, 12, 13, 14, 15]])Thus changes in the display does not hamper the original data but changes in values of view will affect the original data.Creating a copy of the data:Now let us create z as a copy of x:z = x.copy() Note that z is not xz is xChanging the value in zz[0,0] = 9999No alterations are made in x.xarray([[1234, 2, 3, 4, 5], [ 6, 7, 8, 9, 10], [ 11, 12, 13, 14, 15]])Python sometimes may give 'setting with copy' warning because it is unable to recognize whether the new dataframe or array (created as a subset of another dataframe or array) is a view or a copy. Thus in such situations user needs to specify whether it is a copy or a view otherwise Python may hamper the results.Exercises : Numpy1. How to extract even numbers from array?arr = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])Desired Output :array([0, 2, 4, 6, 8])Show Solutionarr[arr % 2 == 0]2. How to find out the position where elements of x and y are samex = np.array([5,6,7,8,3,4]) y = np.array([5,3,4,5,2,4]) Desired Output :array([0, 5]Show Solutionnp.where(x == y)3. How to standardize values so that it lies between 0 and 1k = np.array([5,3,4,5,2,4]) Hint :kmin(k)/(max(k)min(k))Show Solutionkmax, kmin = k.max(), k.min()k_new = (k  kmin)/(kmax  kmin) 4. How to calculate the percentile scores of an arrayp = np.array([15,10, 3,2,5,6,4]) Show Solutionnp.percentile(p, q=[5, 95]) 5. Print the number of missing values in an arrayp = np.array([5,10, np.nan, 3, 2, 5, 6, np.nan]) Show Solutionprint("Number of missing values =", np.isnan(p).sum()) About Author: Deepanshu founded ListenData with a simple objective  Make analytics easy to understand and follow. He has over 7 years of experience in data science and predictive modeling. During his tenure, he has worked with global clients in various domains.Let's Get Connected: LinkedIn

Low Kian Seong: The Human in DevopsFriday, 19 April 2019What was significant this week ?This week a mild epiphany came to me right after a somewhat heated and tense meeting with a team of developers plus project owner of a web project. They were angry and they were not afraid to show it. They were somewhat miffed about the fact that the head wrote them an email pretty much forcing them to participate to make our DevOps initiative a success. All kinds of expletive words were running through my head in relation to describing this team of flabby, tired looking individuals in front of me, which belied the cool demeanour and composure that I was trying so hard to maintain.It happened. In the spur of the moment I too got engulfed in a sea of negativity and for a few minutes lost site of what is the most important component or pillar in a successful DevOps initiative. The people. "What a bunch of mule heads !" I thought. It's as plain as day, once this initiative is a success everybody can go home earlier and everything will be more predictable and we can do much much more than we could before. "Why are you fighting this ?!" I was ready to throw my hands up in defeat when it finally dawned on me."Codes that power DevOps projects don't write themselves. People write those code" "Without people powering our initiative now, we are just a few guys with a bunch of code and tools that are irrelevant"Boom! These thoughts hit me like lightning and in that moment I felt and equal measure of wisdom brought by this realisation as well as disgust at my stupidity of forgetting one of the main tenants and requirements to make the dream of a successful DevOps project a success.It was then I realised 2 very important mistakes I had made so far:I was reaching out horizontally to push our agenda across. Developers loved what we proposed and that was pretty much it. It's cool and it's cutting edge. It stopped there. "Hey thanks for sharing that cool tool ! I will try it in my project when I get the chance!" is pretty much the maximum you can expect to get from such an exchange. For you to gain any traction, you have got to sell your proposed solution or improvement to the stakeholders or the decision makers. Efforts that usually require people to do the right thing or go out of their way to do some unplanned kindness or rightness usually results in zilch. I did not try to see the tool that I was proposing from the eyes of the beholders. It was too much of a leap. Much like how Abraham it's impossible for you to frog leap from sadness to happiness, so it was how the developers felt. They knew it was good for them, they can see it was good for them, they felt it could have the potential to improve their lives but alas they did not internalise it. The proverbial light bulb did not turn on inside of them, more correctly said, I did not do enough to turn that light on. I could see some people opening up, but when this realisation hit me, I just ended the meeting. I have not done enough of understanding where these people that I hoped to implement DevOps were. I had to do that first. Do I miss coding ? Do I miss hunkering down and prototyping my way to showcase a tool or to get something to work ? Of course! Who wouldn't but main thing I keep on going back to is ... what is the main goal and expectation of the people who hired me to lead their DevOps push ? Is it to wire together some tools and configure something so they can use it ? At small enough scale probably that is enough of value, but when you want the horses you lead to the puddle to drink you need to give them a reason and just because you are drinking, you can't expect them to follow suit. I am going to reach out more, I am going to understand more and I am going to engage more. All the people pieces needs to be in place before the pieces start falling automatically. Stay tuned if this is interesting ...

ListenData: Python for Data Science : Learn in 3 DaysFriday, 19 April 2019This tutorial helps you to learn Data Science with Python with examples. Python is an open source language and it is widely used as a highlevel programming language for generalpurpose programming. It has gained high popularity in data science world. As data science domain is rising these days, IBM recently predicted demand for data science professionals would rise by more than 25% by 2020. In the PyPL Popularity of Programming language index, Python scored second rank with a 14 percent share. In advanced analytics and predictive analytics market, it is ranked among top 3 programming languages for advanced analytics.Data Science with Python TutorialTable of ContentsGetting Started with PythonPython 2.7 vs. 3.6Python for Data Science : IntroductionHow to install Python?Spyder Shortcut keysBasic programs in PythonComparison, Logical and Assignment OperatorsData Structures and Conditional StatementsPython Data StructuresPython Conditional StatementsPython LibrariesList of popular packages (comparison with R)Popular python commandsHow to import a packageData Manipulation using PandasPandas Data Structures  Series and DataFrameImportant Pandas Functions (vs. R functions)Examples  Data analysis with PandasData Science with Python Logistic RegressionDecision TreeRandom ForestGrid Search  Hyper Parameter TuningCross ValidationPreprocessing StepsPython 2.7 vs 3.6Google yields thousands of articles on this topic. Some bloggers opposed and some in favor of 2.7. If you filter your search criteria and look for only recent articles (late 2016 onwards), you would see majority of bloggers are in favor of Python 3.6. See the following reasons to support Python 3.6.1. The official end date for the Python 2.7 is year 2020. Afterward there would be no support from community. It does not make any sense to learn 2.7 if you learn it today.2. Python 3.6 supports 95% of top 360 python packages and almost 100% of top packages for data science.What's new in Python 3.6It is cleaner and faster. It is a language for the future. It fixed major issues with versions of Python 2 series. Python 3 was first released in year 2008. It has been 9 years releasing robust versions of Python 3 series.Key TakeawayYou should go for Python 3.6. In terms of learning Python, there are no major differences in Python 2.7 and 3.6. It is not too difficult to move from Python 3 to Python 2 with a few adjustments. Your focus should go on learning Python as a language.Python for Data Science : IntroductionPython is widely used and very popular for a variety of software engineering tasks such as website development, cloudarchitecture, backend etc. It is equally popular in data science world. In advanced analytics world, there has been several debates on R vs. Python. There are some areas such as number of libraries for statistical analysis, where R wins over Python but Python is catching up very fast. With popularity of big data and data science, Python has become first programming language of data scientists.There are several reasons to learn Python. Some of them are as follows Python runs well in automating various steps of a predictive model. Python has awesome robust libraries for machine learning, natural language processing, deep learning, big data and artificial Intelligence. Python wins over R when it comes to deploying machine learning models in production.It can be easily integrated with big data frameworks such as Spark and Hadoop.Python has a great online community support.Do you know these sites are developed in Python?YouTubeInstagramRedditDropboxDisqusHow to Install PythonThere are two ways to download and install PythonDownload Anaconda. It comes with Python software along with preinstalled popular libraries.Download Python from its official website. You have to manually install libraries.Recommended : Go for first option and download anaconda. It saves a lot of time in learning and coding PythonCoding EnvironmentsAnaconda comes with two popular IDE :Jupyter (Ipython) NotebookSpyderSpyder. It is like RStudio for Python. It gives an environment wherein writing python code is userfriendly. If you are a SAS User, you can think of it as SAS Enterprise Guide / SAS Studio. It comes with a syntax editor where you can write programs. It has a console to check each and every line of code. Under the 'Variable explorer', you can access your created data files and function. I highly recommend Spyder!Spyder  Python Coding EnvironmentJupyter (Ipython) NotebookJupyter is equivalent to markdown in R. It is useful when you need to present your work to others or when you need to create step by step project report as it can combine code, output, words, and graphics. Spyder Shortcut KeysThe following is a list of some useful spyder shortcut keys which makes you more productive.Press F5 to run the entire scriptPress F9 to run selection or line Press Ctrl + 1 to comment / uncommentGo to front of function and then press Ctrl + I to see documentation of the functionRun %reset f to clean workspaceCtrl + Left click on object to see source code Ctrl+Enter executes the current cell.Shift+Enter executes the current cell and advances the cursor to the next cellList of arithmetic operators with examples Arithmetic Operators Operation Example + Addition 10 + 2 = 12 – Subtraction 10 – 2 = 8 * Multiplication 10 * 2 = 20 / Division 10 / 2 = 5.0 % Modulus (Remainder) 10 % 3 = 1 ** Power 10 ** 2 = 100 // Floor 17 // 3 = 5 (x + (d1)) // d Ceiling (17 +(31)) // 3 = 6 Basic ProgramsExample 1#Basicsx = 10y = 3print("10 divided by 3 is", x/y)print("remainder after 10 divided by 3 is", x%y)Result :10 divided by 3 is 3.33remainder after 10 divided by 3 is 1Example 2x = 100x > 80 and x <=95x > 35 or x < 60x > 80 and x <=95Out[45]: Falsex > 35 or x < 60Out[46]: True Comparison & Logical Operators Description Example > Greater than 5 > 3 returns True < Less than 5 < 3 returns False >= Greater than or equal to 5 >= 3 returns True <= Less than or equal to 5 <= 3 return False == Equal to 5 == 3 returns False != Not equal to 5 != 3 returns True and Check both the conditions x > 18 and x <=35 or If atleast one condition hold True x > 35 or x < 60 not Opposite of Condition not(x>7) Assignment OperatorsIt is used to assign a value to the declared variable. For e.g. x += 25 means x = x +25.x = 100y = 10x += yprint(x)print(x)110In this case, x+=y implies x=x+y which is x = 100 + 10.Similarly, you can use x=y, x*=y and x /=yPython Data StructureIn every programming language, it is important to understand the data structures. Following are some data structures used in Python.1. ListIt is a sequence of multiple values. It allows us to store different types of data such as integer, float, string etc. See the examples of list below. First one is an integer list containing only integer. Second one is string list containing only string values. Third one is mixed list containing integer, string and float values.x = [1, 2, 3, 4, 5]y = [‘A’, ‘O’, ‘G’, ‘M’]z = [‘A’, 4, 5.1, ‘M’]Get List ItemWe can extract list item using Indexes. Index starts from 0 and end with (number of elements1).x = [1, 2, 3, 4, 5]x[0]x[1]x[4]x[1]x[2]x[0]Out[68]: 1x[1]Out[69]: 2x[4]Out[70]: 5x[1]Out[71]: 5x[2]Out[72]: 4x[0] picks first element from list. Negative sign tells Python to search list item from right to left. x[1] selects the last element from list.You can select multiple elements from a list using the following methodx[:3] returns [1, 2, 3]2. TupleA tuple is similar to a list in the sense that it is a sequence of elements. The difference between list and tuple are as follows A tuple cannot be changed once constructed whereas list can be modified.A tuple is created by placing commaseparated values inside parentheses ( ). Whereas, list is created inside square brackets [ ]ExamplesK = (1,2,3)State = ('Delhi','Maharashtra','Karnataka')Perform for loop on Tuplefor i in State: print(i)DelhiMaharashtraKarnatakaDetailed Tutorial : Python Data StructuresFunctionsLike print(), you can create your own custom function. It is also called userdefined functions. It helps you in automating the repetitive task and calling reusable code in easier way.Rules to define a functionFunction starts with def keyword followed by function name and ( )Function body starts with a colon (:) and is indentedThe keyword return ends a function and give value of previous expression.def sum_fun(a, b): result = a + b return result z = sum_fun(10, 15)Result : z = 25Suppose you want python to assume 0 as default value if no value is specified for parameter b.def sum_fun(a, b=0): result = a + b return resultz = sum_fun(10)In the above function, b is set to be 0 if no value is provided for parameter b. It does not mean no other value than 0 can be set here. It can also be used as z = sum_fun(10, 15)Conditional Statements (if else)Conditional statements are commonly used in coding. It is IF ELSE statements. It can be read like : " if a condition holds true, then execute something. Else execute something else"Note : The if and else statements ends with a colon :Examplek = 27if k%5 == 0: print('Multiple of 5')else: print('Not a Multiple of 5')Result : Not a Multiple of 5Popular python packages for Data Analysis & VisualizationSome of the leading packages in Python along with equivalent libraries in R are as followspandas. For data manipulation and data wrangling. A collections of functions to understand and explore data. It is counterpart of dplyr and reshape2 packages in R.NumPy. For numerical computing. It's a package for efficient array computations. It allows us to do some operations on an entire column or table in one line. It is roughly approximate to Rcpp package in R which eliminates the limitation of slow speed in R. Numpy TutorialScipy. For mathematical and scientific functions such as integration, interpolation, signal processing, linear algebra, statistics, etc. It is built on Numpy.Scikitlearn. A collection of machine learning algorithms. It is built on Numpy and Scipy. It can perform all the techniques that can be done in R using glm, knn, randomForest, rpart, e1071 packages.Matplotlib. For data visualization. It's a leading package for graphics in Python. It is equivalent to ggplot2 package in R.Statsmodels. For statistical and predictive modeling. It includes various functions to explore data and generate descriptive and predictive analytics. It allows users to run descriptive statistics, methods to impute missing values, statistical tests and take table output to HTML format.pandasql. It allows SQL users to write SQL queries in Python. It is very helpful for people who loves writing SQL queries to manipulate data. It is equivalent to sqldf package in R.Maximum of the above packages are already preinstalled in Spyder.Comparison of Python and R Packages by Data Mining Task Task Python Package R Package IDE Rodeo / Spyder Rstudio Data Manipulation pandas dplyr and reshape2 Machine Learning Scikitlearn glm, knn, randomForest, rpart, e1071 Data Visualization ggplot + seaborn + bokeh ggplot2 Character Functions BuiltIn Functions stringr Reproducibility Jupyter Knitr SQL Queries pandasql sqldf Working with Dates datetime lubridate Web Scraping beautifulsoup rvest Popular Python CommandsThe commands below would help you to install and update new and existing packages. Let's say, you want to install / uninstall pandas package.Run these commands from IPython console window. Don't forget to add ! before pip otherwise it would return syntax error.Install Package!pip install pandasUninstall Package!pip uninstall pandasShow Information about Installed Package!pip show pandasList of Installed Packages!pip listUpgrade a package!pip install upgrade pandasHow to import a packageThere are multiple ways to import a package in Python. It is important to understand the difference between these styles.1. import pandas as pdIt imports the package pandas under the alias pd. A function DataFrame in package pandas is then submitted with pd.DataFrame.2. import pandasIt imports the package without using alias but here the function DataFrame is submitted with full package name pandas.DataFrame3. from pandas import * It imports the whole package and the function DataFrame is executed simply by typing DataFrame. It sometimes creates confusion when same function name exists in more than one package.Pandas Data Structures : Series and DataFrameIn pandas package, there are two data structures  series and dataframe. These structures are explained below in detail Series is a onedimensional array. You can access individual elements of a series using position. It's similar to vector in R.In the example below, we are generating 5 random values.import pandas as pdimport numpy as nps1 = pd.Series(np.random.randn(5))s10 2.4120151 0.4517522 1.1742073 0.7663484 0.361815dtype: float64Extract first and second valueYou can get a particular element of a series using index value. See the examples below s1[0]2.412015s1[1]0.451752s1[:3]0 2.4120151 0.4517522 1.1742072. DataFrameIt is equivalent to data.frame in R. It is a 2dimensional data structure that can store data of different data types such as characters, integers, floating point values, factors. Those who are wellconversant with MS Excel, they can think of data frame as Excel Spreadsheet.Comparison of Data Type in Python and PandasThe following table shows how Python and pandas package stores data. Data Type Pandas Standard Python For character variable object string For categorical variable category  For Numeric variable without decimals int64 int Numeric characters with decimals float64 float For date time variables datetime64  Important Pandas FunctionsThe table below shows comparison of pandas functions with R functions for various data wrangling and manipulation tasks. It would help you to memorize pandas functions. It's a very handy information for programmers who are new to Python. It includes solutions for most of the frequently used data exploration tasks. Functions R Python (pandas package) Installing a package install.packages('name') !pip install name Loading a package library(name) import name as other_name Checking working directory getwd() import osos.getcwd() Setting working directory setwd() os.chdir() List files in a directory dir() os.listdir() Remove an object rm('name') del object Select Variables select(df, x1, x2) df[['x1', 'x2']] Drop Variables select(df, (x1:x2)) df.drop(['x1', 'x2'], axis = 1) Filter Data filter(df, x1 >= 100) df.query('x1 >= 100') Structure of a DataFrame str(df) df.info() Summarize dataframe summary(df) df.describe() Get row names of dataframe "df" rownames(df) df.index Get column names colnames(df) df.columns View Top N rows head(df,N) df.head(N) View Bottom N rows tail(df,N) df.tail(N) Get dimension of data frame dim(df) df.shape Get number of rows nrow(df) df.shape[0] Get number of columns ncol(df) df.shape[1] Length of data frame length(df) len(df) Get random 3 rows from dataframe sample_n(df, 3) df.sample(n=3) Get random 10% rows sample_frac(df, 0.1) df.sample(frac=0.1) Check Missing Values is.na(df$x) pd.isnull(df.x) Sorting arrange(df, x1, x2) df.sort_values(['x1', 'x2']) Rename Variables rename(df, newvar = x1) df.rename(columns={'x1': 'newvar'}) Data Manipulation with pandas  Examples1. Import Required PackagesYou can import required packages using import statement. In the syntax below, we are asking Python to import numpy and pandas package. The 'as' is used to alias package name.import numpy as npimport pandas as pd2. Build DataFrameWe can build dataframe using DataFrame() function of pandas package.mydata = {'productcode': ['AA', 'AA', 'AA', 'BB', 'BB', 'BB'], 'sales': [1010, 1025.2, 1404.2, 1251.7, 1160, 1604.8], 'cost' : [1020, 1625.2, 1204, 1003.7, 1020, 1124]}df = pd.DataFrame(mydata) In this dataframe, we have three variables  productcode, sales, cost.Sample DataFrameTo import data from CSV fileYou can use read_csv() function from pandas package to get data into python from CSV file.mydata= pd.read_csv("C:\\Users\\Deepanshu\\Documents\\file1.csv")Make sure you use double backslash when specifying path of CSV file. Alternatively, you can use forward slash to mention file path inside read_csv() function.Detailed Tutorial : Import Data in Python3. To see number of rows and columnsYou can run the command below to find out number of rows and columns.df.shape Result : (6, 3). It means 6 rows and 3 columns.4. To view first 3 rowsThe df.head(N) function can be used to check out first some N rows.df.head(3) cost productcode sales0 1020.0 AA 1010.01 1625.2 AA 1025.22 1204.0 AA 1404.25. Select or Drop VariablesTo keep a single variable, you can write in any of the following three methods df.productcodedf["productcode"]df.loc[: , "productcode"]To select variable by column position, you can use df.iloc function. In the example below, we are selecting second column. Column Index starts from 0. Hence, 1 refers to second column.df.iloc[: , 1]We can keep multiple variables by specifying desired variables inside [ ]. Also, we can make use of df.loc() function.df[["productcode", "cost"]]df.loc[ : , ["productcode", "cost"]]Drop VariableWe can remove variables by using df.drop() function. See the example below df2 = df.drop(['sales'], axis = 1)6. To summarize data frameTo summarize or explore data, you can submit the command below. df.describe() cost salescount 6.000000 6.00000mean 1166.150000 1242.65000std 237.926793 230.46669min 1003.700000 1010.0000025% 1020.000000 1058.9000050% 1072.000000 1205.8500075% 1184.000000 1366.07500max 1625.200000 1604.80000To summarise all the character variables, you can use the following script.df.describe(include=['O'])Similarly, you can use df.describe(include=['float64']) to view summary of all the numeric variables with decimals.To select only a particular variable, you can write the following code df.productcode.describe()ORdf["productcode"].describe()count 6unique 2top BBfreq 3Name: productcode, dtype: object7. To calculate summary statisticsWe can manually find out summary statistics such as count, mean, median by using commands belowdf.sales.mean()df.sales.median()df.sales.count()df.sales.min()df.sales.max()8. Filter DataSuppose you are asked to apply condition  productcode is equal to "AA" and sales greater than or equal to 1250.df1 = df[(df.productcode == "AA") & (df.sales >= 1250)]It can also be written like :df1 = df.query('(productcode == "AA") & (sales >= 1250)')In the second query, we do not need to specify DataFrame along with variable name.9. Sort DataIn the code below, we are arrange data in ascending order by sales.df.sort_values(['sales'])10. Group By : Summary by Grouping VariableLike SQL GROUP BY, you want to summarize continuous variable by classification variable. In this case, we are calculating average sale and cost by product code.df.groupby(df.productcode).mean() cost salesproductcode AA 1283.066667 1146.466667BB 1049.233333 1338.833333Instead of summarising for multiple variable, you can run it for a single variable i.e. sales. Submit the following script.df["sales"].groupby(df.productcode).mean()11. Define Categorical VariableLet's create a classification variable  id which contains only 3 unique values  1/2/3.df0 = pd.DataFrame({'id': [1, 1, 2, 3, 1, 2, 2]})Let's define as a categorical variable.We can use astype() function to make id as a categorical variable.df0.id = df0["id"].astype('category')Summarize this classification variable to check descriptive statistics.df0.describe() idcount 7unique 3top 2freq 3Frequency DistributionYou can calculate frequency distribution of a categorical variable. It is one of the method to explore a categorical variable.df['productcode'].value_counts()BB 3AA 312. Generate HistogramHistogram is one of the method to check distribution of a continuous variable. In the figure shown below, there are two values for variable 'sales' in range 10001100. In the remaining intervals, there is only a single value. In this case, there are only 5 values. If you have a large dataset, you can plot histogram to identify outliers in a continuous variable.df['sales'].hist()Histogram13. BoxPlotBoxplot is a method to visualize continuous or numeric variable. It shows minimum, Q1, Q2, Q3, IQR, maximum value in a single graph.df.boxplot(column='sales')BoxPlotDetailed Tutorial : Data Analysis with Pandas TutorialData Science using Python  ExamplesIn this section, we cover how to perform data mining and machine learning algorithms with Python. sklearn is the most frequently used library for running data mining and machine learning algorithms. We will also cover statsmodels library for regression techniques. statsmodels library generates formattable output which can be used further in project report and presentation.1. Install the required librariesImport the following libraries before reading or exploring data#Import required librariesimport pandas as pdimport statsmodels.api as smimport numpy as np2. Download and import data into PythonWith the use of python library, we can easily get data from web into python.# Read data from webdf = pd.read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")Variables Type Descriptiongre Continuous Graduate Record Exam scoregpa Continuous Grade Point Averagerank Categorical Prestige of the undergraduate institutionadmit Binary Admission in graduate schoolThe binary variable admit is a target variable.3. Explore DataLet's explore data. We'll answer the following questions How many rows and columns in the data file?What are the distribution of variables?Check if any outlier(s)If outlier(s), treat themCheck if any missing value(s)Impute Missing values (if any)# See no. of rows and columnsdf.shapeResult : 400 rows and 4 columnsIn the code below, we rename the variable rank to 'position' as rank is already a function in python.# rename rank columndf = df.rename(columns={'rank': 'position'}) Summarize and plot all the columns.# Summarizedf.describe()# plot all of the columnsdf.hist()Categorical variable AnalysisIt is important to check the frequency distribution of categorical variable. It helps to answer the question whether data is skewed.# Summarizedf.position.value_counts(ascending=True)1 614 673 1212 151Generating Crosstab By looking at cross tabulation report, we can check whether we have enough number of events against each unique values of categorical variable.pd.crosstab(df['admit'], df['position'])position 1 2 3 4admit 0 28 97 93 551 33 54 28 12Number of Missing ValuesWe can write a simple loop to figure out the number of blank values in all variables in a dataset.for i in list(df.columns) : k = sum(pd.isnull(df[i])) print(i, k)In this case, there are no missing values in the dataset.4. Logistic Regression ModelLogistic Regression is a special type of regression where target variable is categorical in nature and independent variables be discrete or continuous. In this post, we will demonstrate only binary logistic regression which takes only binary values in target variable. Unlike linear regression, logistic regression model returns probability of target variable.It assumes binomial distribution of dependent variable. In other words, it belongs to binomial family.In python, we can write Rstyle model formula y ~ x1 + x2 + x3 using patsy and statsmodels libraries. In the formula, we need to define variable 'position' as a categorical variable by mentioning it inside capital C(). You can also define reference category using reference= option.#Reference Categoryfrom patsy import dmatrices, Treatmenty, X = dmatrices('admit ~ gre + gpa + C(position, Treatment(reference=4))', df, return_type = 'dataframe')It returns two datasets  X and y. The dataset 'y' contains variable admit which is a target variable. The other dataset 'X' contains Intercept (constant value), dummy variables for Treatment, gre and gpa. Since 4 is set as a reference category, it will be 0 against all the three dummy variables. See sample below P P_1 P_2 P_33 0 0 13 0 0 11 1 0 04 0 0 04 0 0 02 0 1 0Split Data into two parts80% of data goes to training dataset which is used for building model and 20% goes to test dataset which would be used for validating the model.from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)Build Logistic Regression ModelBy default, the regression without formula style does not include intercept. To include it, we already have added intercept in X_train which would be used as a predictor.#Fit Logit modellogit = sm.Logit(y_train, X_train)result = logit.fit()#Summary of Logistic regression modelresult.summary()result.params Logit Regression Results ==============================================================================Dep. Variable: admit No. Observations: 320Model: Logit Df Residuals: 315Method: MLE Df Model: 4Date: Sat, 20 May 2017 Pseudo Rsqu.: 0.03399Time: 19:57:24 LogLikelihood: 193.49converged: True LLNull: 200.30 LLR pvalue: 0.008627======================================================================================= coef std err z Pz [95.0% Conf. Int.]C(position)[T.1] 1.4933 0.440 3.392 0.001 0.630 2.356C(position)[T.2] 0.6771 0.373 1.813 0.070 0.055 1.409C(position)[T.3] 0.1071 0.410 0.261 0.794 0.696 0.910gre 0.0005 0.001 0.442 0.659 0.002 0.003gpa 0.4613 0.214 2.152 0.031 0.881 0.041======================================================================================Confusion Matrix and Odd RatioOdd ratio is exponential value of parameter estimates.#Confusion Matrixresult.pred_table()#Odd Rationp.exp(result.params)Prediction on Test DataIn this step, we take estimates of logit model which was built on training data and then later apply it into test data.#prediction on test datay_pred = result.predict(X_test)Calculate Area under Curve (ROC)# AUC on test datafalse_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)auc(false_positive_rate, true_positive_rate)Result : AUC = 0.6763Calculate Accuracy Scoreaccuracy_score([ 1 if p > 0.5 else 0 for p in y_pred ], y_test)Decision Tree ModelDecision trees can have a target variable continuous or categorical. When it is continuous, it is called regression tree. And when it is categorical, it is called classification tree. It selects a variable at each step that best splits the set of values. There are several algorithms to find best split. Some of them are Gini, Entropy, C4.5, ChiSquare. There are several advantages of decision tree. It is simple to use and easy to understand. It requires a very few data preparation steps. It can handle mixed data  both categorical and continuous variables. In terms of speed, it is a very fast algorithm.#Drop Intercept from predictors for tree algorithmsX_train = X_train.drop(['Intercept'], axis = 1)X_test = X_test.drop(['Intercept'], axis = 1)#Decision Treefrom sklearn.tree import DecisionTreeClassifiermodel_tree = DecisionTreeClassifier(max_depth=7)#Fit the model:model_tree.fit(X_train,y_train)#Make predictions on test setpredictions_tree = model_tree.predict_proba(X_test) #AUCfalse_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, predictions_tree[:,1])auc(false_positive_rate, true_positive_rate)Result : AUC = 0.664Important NoteFeature engineering plays an important role in building predictive models. In the above case, we have not performed variable selection. We can also select best parameters by using grid search fine tuning technique.Random Forest ModelDecision Tree has limitation of overfitting which implies it does not generalize pattern. It is very sensitive to a small change in training data. To overcome this problem, random forest comes into picture. It grows a large number of trees on randomised data. It selects random number of variables to grow each tree. It is more robust algorithm than decision tree. It is one of the most popular machine learning algorithm. It is commonly used in data science competitions. It is always ranked in top 5 algorithms. It has become a part of every data science toolkit.#Random Forestfrom sklearn.ensemble import RandomForestClassifiermodel_rf = RandomForestClassifier(n_estimators=100, max_depth=7)#Fit the model:target = y_train['admit']model_rf.fit(X_train,target)#Make predictions on test setpredictions_rf = model_rf.predict_proba(X_test)#AUCfalse_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, predictions_rf[:,1])auc(false_positive_rate, true_positive_rate)#Variable Importanceimportances = pd.Series(model_rf.feature_importances_, index=X_train.columns).sort_values(ascending=False)print(importances)importances.plot.bar()Result : AUC = 0.6974Grid Search  Hyper Parameters TuningThe sklearn library makes hyperparameters tuning very easy. It is a strategy to select the best parameters for an algorithm. In scikitlearn they are passed as arguments to the constructor of the estimator classes. For example, max_features in randomforest. alpha for lasso.from sklearn.model_selection import GridSearchCVrf = RandomForestClassifier()target = y_train['admit']param_grid = { 'n_estimators': [100, 200, 300], 'max_features': ['sqrt', 3, 4]}CV_rfc = GridSearchCV(estimator=rf , param_grid=param_grid, cv= 5, scoring='roc_auc')CV_rfc.fit(X_train,target)#Parameters with ScoresCV_rfc.grid_scores_#Best ParametersCV_rfc.best_params_CV_rfc.best_estimator_#Make predictions on test setpredictions_rf = CV_rfc.predict_proba(X_test)#AUCfalse_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, predictions_rf[:,1])auc(false_positive_rate, true_positive_rate)Cross Validation# Cross Validationfrom sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import cross_val_predict,cross_val_scoretarget = y['admit']prediction_logit = cross_val_predict(LogisticRegression(), X, target, cv=10, method='predict_proba')#AUCcross_val_score(LogisticRegression(fit_intercept = False), X, target, cv=10, scoring='roc_auc')Data Mining : PreProcessing Steps1. The machine learning package sklearn requires all categorical variables in numeric form. Hence, we need to convert all character/categorical variables to be numeric. This can be accomplished using the following script. In sklearn, there is already a function for this step.from sklearn.preprocessing import LabelEncoderdef ConverttoNumeric(df): cols = list(df.select_dtypes(include=['category','object'])) le = LabelEncoder() for i in cols: try: df[i] = le.fit_transform(df[i]) except: print('Error in Variable :'+i) return dfConverttoNumeric(df)Encoding2. Create Dummy VariablesSuppose you want to convert categorical variables into dummy variables. It is different to the previous example as it creates dummy variables instead of convert it in numeric form.productcode_dummy = pd.get_dummies(df["productcode"])df2 = pd.concat([df, productcode_dummy], axis=1)The output looks like below  AA BB0 1 01 1 02 1 03 0 14 0 15 0 1Create k1 CategoriesTo avoid multicollinearity, you can set one of the category as reference category and leave it while creating dummy variables. In the script below, we are leaving first category.productcode_dummy = pd.get_dummies(df["productcode"], prefix='pcode', drop_first=True)df2 = pd.concat([df, productcode_dummy], axis=1)3. Impute Missing ValuesImputing missing values is an important step of predictive modeling. In many algorithms, if missing values are not filled, it removes complete row. If data contains a lot of missing values, it can lead to huge data loss. There are multiple ways to impute missing values. Some of the common techniques  to replace missing value with mean/median/zero. It makes sense to replace missing value with 0 when 0 signifies meaningful. For example, whether customer holds a credit card product.Fill missing values of a particular variable# fill missing values with 0df['var1'] = df['var1'].fillna(0)# fill missing values with meandf['var1'] = df['var1'].fillna(df['var1'].mean())Apply imputation to the whole datasetfrom sklearn.preprocessing import Imputer # Set an imputer objectmean_imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)# Train the imputormean_imputer = mean_imputer.fit(df)# Apply imputationdf_new = mean_imputer.transform(df.values)4. Outlier TreatmentThere are many ways to handle or treat outliers (or extreme values). Some of the methods are as follows Cap extreme values at 95th / 99th percentile depending on distributionApply log transformation of variables. See below the implementation of log transformation in Python.import numpy as npdf['var1'] = np.log(df['var1'])5. StandardizationIn some algorithms, it is required to standardize variables before running the actual algorithm. Standardization refers to the process of making mean of variable zero and unit variance (standard deviation).#load datasetdataset = load_boston()predictors = dataset.datatarget = dataset.targetdf = pd.DataFrame(predictors, columns = dataset.feature_names)#Apply Standardizationfrom sklearn.preprocessing import StandardScalerk = StandardScaler()df2 = k.fit_transform(df)Next StepsPractice, practice and practice. Download free public data sets from Kaggle / UCLA websites and try to play around with data and generate insights from it with pandas package and build statistical models using sklearn package. I hope you would find this tutorial helpful. I tried to cover all the important topics which beginner must know about Python. Once completion of this tutorial, you can flaunt you know how to program it in Python and you can implement machine learning algorithms using sklearn package. About Author: Deepanshu founded ListenData with a simple objective  Make analytics easy to understand and follow. He has over 7 years of experience in data science and predictive modeling. During his tenure, he has worked with global clients in various domains.Let's Get Connected: LinkedIn

ListenData: Install Python PackageFriday, 19 April 2019Python is one of the most popular programming language for data science and analytics. It is widely used for a variety of tasks in startups and many multinational organizations. The beauty of this programming language is that it is opensource which means it is available for free and has very active community of developers across the world. Python developers share their solutions in the form of package or module with other python users. This tutorial explains various ways how to install python package.Ways to Install Python PackageMethod 1 : If Anaconda is already installed on your SystemAnaconda is the data science platform which comes with preinstalled popular python packages and powerful IDE (Spyder) which has userfriendly interface to ease writing of python programming scripts.If Anaconda is installed on your system (laptop), click on Anaconda Prompt as shown in the image below.Anaconda PromptTo install a python package or module, enter the code below in Anaconda Prompt pip install packagenameInstall Python Package using PIP WindowsMethod 2 : NO Need of Anaconda1. Open RUN box using shortcut Windows Key + R2. Enter cmd in the RUN boxCommand PromptOnce you press OK, it will show command prompt screen.3. Search for folder named Scripts where pip applications are stored.Scripts Folder4. In command prompt, type cd <file location of Scripts folder>cd refers to change directory.For example, folder location is C:\Users\DELL\Python37\Scripts so you need to enter the following line in command prompt :cd C:\Users\DELL\Python37\Scripts Change Directory5. Type pip install packagenameInstall Package via PIP command promptMethod 3 : Install Python Package from IPython consoleMake sure to use ! before pip when you enter the command below in IPython console window. Otherwise it would return syntax error.!pip install package_nameThe ! prefix tells Python to run a shell command.Syntax Error : Installing Package using PIPSome users face error "SyntaxError: invalid syntax" in installing packages. To workaround this issue, run the command line below in command prompt python m pip install packagenamepython m pip tells python to import a module for you, then run it as a script.Install Specific Versions of Python Packagepython m pip install Packagename==1.3 # specific versionpython m pip install "Packagename>=1.3" # version greater than or equal to 1.3How to load or import package or moduleOnce package is installed, next step is to make the package in use. In other words, it is required to import package once installed. There are several ways to load package or module in Python :1. import math loads the module math. Then you can use any function defined in math module using math.function. Refer the example below import mathmath.sqrt(4)2. from math import * loads the module math. Now we don't need to specify the module to use functions of this module.from math import *sqrt(4)3. from math import sqrt, cos imports the selected functions of the module math.4.import math as m imports the math module under the alias m.m.sqrt(4)Other Useful Commands Description Command To uninstall a package pip uninstall package To upgrade a package pip install upgrade package To search a package pip search "packagename" To check all the installed packages pip list About Author: Deepanshu founded ListenData with a simple objective  Make analytics easy to understand and follow. He has over 7 years of experience in data science and predictive modeling. During his tenure, he has worked with global clients in various domains.Let's Get Connected: LinkedIn

Codementor: Why Django Is The Popular Python Framework Among Web Developers?Friday, 19 April 2019The lot of advantages of web development using python Django framework can be easily accessed in small project, better security, less effort and less investment money into a projects.

Vasudev Ram: Python's dynamic nature: sticking an attribute onto an objectFriday, 19 April 2019 By Vasudev Ram  Online Python training / SQL training / Linux trainingHi, readers,[This is a beginnerlevel Python post.]Python, being a dynamic language, has some interesting features that some static languages may not have (and vice versa too, of course).One such feature, which I noticed a while ago, is that you can add an attribute to a Python object even after it has been created. (Conditions apply.) I had used this feature some time ago to work around some implementation issue in a rudimentary RESTful server that I created as a small teaching project. It was based on the BaseHTTPServer module.Here is a (different) simple example program, stick_attrs_onto_obj.py, that demonstrates this Python feature. My informal term for this feature is "sticking an attribute onto an object" after the object is created.Since the program is simple, and there are enough comments in the code, I will not explain it in detail.# stick_attrs_onto_obj.py# A program to show:# 1) that you can "stick" attributes onto a Python object after it is created, and# 2) one use of this technique, to count the number# of calls to a function.# Copyright 2019 Vasudev Ram# Web site: https://vasudevram.github.io# Blog: https://jugad2.blogspot.com# Training: https://jugad2.blogspot.com/p/training.html# Product store: https://gumroad.com/vasudevram# Twitter: https://twitter.com/vasudevramfrom __future__ import print_function# Define a function.def foo(arg): # Print something to show that the function has been called. print("in foo: arg = {}".format(arg)) # Increment the "stuckon" int attribute inside the function. foo.call_count += 1# A function is also an object in Python.# So we can add attributes to it, including after it is defined.# I call this "sticking" an attribute onto the function object.# The statement below defines the attribute with an initial value, # which is changeable later, as we will see.foo.call_count = 0# Print its initial value before any calls to the function.print("foo.call_count = {}".format(foo.call_count))# Call the function a few times.for i in range(5): foo(i)# Print the attribute's value after those calls.print("foo.call_count = {}".format(foo.call_count))# Call the function a few more times.for i in range(3): foo(i)# Print the attribute's value after those additional calls.print("foo.call_count = {}".format(foo.call_count))And here is the output of the program:$ python stick_attrs_onto_obj.pyfoo.call_count = 0in foo: arg = 0in foo: arg = 1in foo: arg = 2in foo: arg = 3in foo: arg = 4foo.call_count = 5in foo: arg = 0in foo: arg = 1in foo: arg = 2foo.call_count = 8There may be other ways to get the call count of a function, including using a profiler, and maybe by using a closure or decorator or other way. But this way is really simple. And as you can see from the code, it is also possible to use it to find the number of calls to the function, between any two points in the program code. For that, we just have to store the call count in a variable at the first point, and subtract that value from the call count at the second point. In the above program, that would be 8  5 = 3, which matches the 3 that is the number of calls to function foo made by the 2nd for loop.Enjoy. Vasudev Ram  Online Python training and consultingI conduct online courses on Python programming, Unix / Linux commands and shell scripting and SQL programming and database design, with course material and personal coaching sessions.The course details and testimonials are here.Contact me for details of course content, terms and schedule.Try FreshBooks: Create and send professional looking invoices in less than 30 seconds.Getting a new web site or blog, and want to help preserve the environment at the same time? Check out GreenGeeks.com web hosting.Sell your digital products via DPD: Digital Publishing for Ebooks and Downloads.Learning Linux? Hit the ground running with my vi quickstart tutorial. I wrote it at the request of two Windows system administrator friends who were given additional charge of some Unix systems. They later told me that it helped them to quickly start using vi to edit text files on Unix. Of course, vi/vim is one of the most ubiquitous text editors around, and works on most other common operating systems and on some uncommon ones too, so the knowledge of how to use it will carry over to those systems too.Check out WP Engine, powerful WordPress hosting.Creating online products for sale? Check out ConvertKit, email marketing for online creators.Teachable: featurepacked course creation platform, with unlimited video, courses and students.Posts about: Python * DLang * xtopdfMy ActiveState Code recipesFollow me on:My blog jugad2GumroadLinkedInTwitterVasudev Ram
Login on frontend as
Login on backend as
Test other products
from thePHPfactory
from thePHPfactory
 Rss Factory PROWhy does the processor experiment? Creatures walk with mind! Spacecrafts are the creatures of the greatly exaggerated coordinates.
 Love FactoryMetamorphosis, rumour, and advice. Where is the brave space suit?
 Advertisement FactoryShield at the alpha quadrant was the courage of energy, invaded to a small parasite.
 Auction FactoryCore at the galaxy that is when calm pathways warp?
 Chat FactoryThis advice has only been observed by a boldly creature?
 Blog FactoryMetamorphosis at the homeworld was the core of vision, accelerated to a colorful parasite.
 Raffle FactoryTransformators are the nanomachines of the apocalyptic collision course.
 You are here:
 Home
 Feeds