Python DataFrames

Creating a DataFrame:

import pandas as pd myDF = pd.DataFrame({"Name":["Bob","Jane","Tay"],"Age":[27,32,18],"Job":["Chef","HR","Retail"]})

- A dataframe stores data tables, including their values and (optional) labels for rows and columns

- It allows for convenient data access and manipulation

- The example above represents a table with 3 columns ("Name", "Age", and "Job"), and with 3 rows containing the specified values

- Passing a dictionary is just one way of initiate a DataFrame, you can also pass an NDArray for example

- You can add labels to the rows with the index attribute, for example myDF.index = ['row1', 'row2', 'row3']

Accessing Values:

myNum = myDF1["Age"][0] # Get the first value (index 0) in the "Age" column myCol = myDF1["Name"] # Get the "Name" column (as a Series) myDF2 = myDF1[["Name","Job"]] # Get both of the specified columns (as a new Dataframe) myDF2 = myDF1[1:3] # Get rows from index 1 up to (but not including) 3 myItm = myDF1.iloc[0,1] # Retrieve by row index and column index myDF2 = myDF1.loc["row1","Age"] # Get value at a row and column specified by name (or index if no name was specified) myDF2 = myDF1[myDF["Name"]=="Bob"] # Get DataFrame containing rows that meet a specified condition myArr = myDF1["Age"].unique() # Get the unique values (no repeats) in the specified column (as an ndarray) ...

- Simple access is similar to accessing data from a multidimensional array, but you can specify the column name and row name instead of the position indices

- You can specify the row by its name rather than index if you set the names previously

- A Series is like a list that stores a column or row of a table along with its name and data type

Removing and Replacing Values:

myDF = myDF.drop(columns=["Age"]) # Get new DataFrame without the specified columns myDF = myDF.dropna(axis=0) # Remove rows with NaN values (for preparing data) myDF = myDF.replace(1,0) # Replace 1 values with 0 myDF['C1'] = myDF['C1'].astype(int) # Convert boolean values in column 'C1' to 0 and 1 ...

- If you want to modify the current dataframe (rather than creating a new one), use the parameter inplace=True)

- You can replace any values with any other value of the same type

File Reading/Writing:

myDF1 = pd.read_csv('myFile1.csv', header=None) # Read from a comma separated value file myDF2 = pd.read_excel('myDF.xlsx') # Read from a spreadsheet file ... myDF1.to_csv('myFile3.csv') # Write to a file in comma separated value format myDF2.to_excel('myFile4.xlsx') # Write to a file as a spreadsheet ...

- Specify header=None only if the first line of the file doesn't have column labels

- Some other formats include json, sql, and hdf

- If the read data doesn't have column names, you can add them with df.columns = ['ColName1','ColName2',...]

Modification:

myDF = myDF.sort_values(['Name']) # Arrange rows by order of values in a specified column myDF['Age'] = myDF['Age']/12 # Divide all values in the column by 12 myDF['Age'] = myDF['Age'].transform('sqrt') # Take the square root of all values in the column myDF['NewCol'] = [23,11,56,47] # Add a new column called 'NewCol' with the values ...

- You can pass ascending=False as a parameter to the sort_values function to arrange them from highest to lowest values

- You can do other math operations on column values: +, -, *, ...

- The transform function supports custom lambda functions too

Combination:

myDF3 = pd.merge(myDF1, myDF2, on="Name") # Rows with the same "Name" value get combined into a new table myDF3 = pd.merge(myDF1, myDF2, on=["Name","Birthdate"]) # Select on multiple features myDF3 = pd.concat([myDF1, myDF2], ignore_index=True) # Get a concatenation of the specified dataframes ...

- You may want to select on multiple features if there are repeats of values in one of the features (e.g. 2 people named Bob, but with different birthdates)

- For a multiple-feature merge, both tables must have all the listed features (e.g. table 1 has "Name" and "Birthdate" features, and table 2 also has both)

- The result of the concat function will be a table that is the first table's rows with the second table's rows added below. For columns in one table that the other table doesn't have, NaN values will fill the empty cells

Analysis:

myDF.head(3) # Show the top 3 rows of the table myDF.tail(4) # Show the last 4 rows of the table myDF.info() # Get a summary of the DataFrame, including info such as data types and memory usage myAvg = myDF['Age'].mean() # Get the average of values in a column myCorr = myDF.corr(numeric_only=True) # Get contingency table showing correlation between columns with numeric features mySum = myDF['Hours'].sum(axis=0) # Get the total of all the values in the column 'Hours' myMin = myDF['Age'].min(axis=0) # Get the lowest value in the 'Age' column myMax = myDF['Age'].max(axis=0) # Get the highest value in the 'Age' column myShape = myDF.shape # Get the number of columns and rows myDup = myDF.duplicated() # Get column indicating whether each row is a duplicate of a previous row ...

- For functions that use the axis parameter, set it to 1 to apply it across the values in a row

- If you have a table with only numeric values, you can apply the sum/min/max function to the whole table (not only a column)

Looping Through Data:

# Iterate through rows for i,myRow in myDF.iterrows(): print(myRow['Age']) # Display the Age value in every row # Iterate through columns for myCol in myDF: print(myDF[myCol][0]) # Display the first value of every column

- If you want to get the whole rows or whole columns, don't specify the last brackets, and they will be returned as Series type

Plotting Data:

myDF1.plot.scatter(x="Age", y="Height") # Display a scatterplot between 2 columns myDF2.plot.line(x="Time", y="Height") # Display a line plot myDF3['City'].hist() # Display a histogram ...

- A scatter plot shows data points with features represented on the x and y axes

- A histogram counts the values for each category found in the data

Challenge

Create a dataframe that includes columns for age, weight, and height. Add 7 values for each column, age values should be 21, 22, or 23. Plot a histogram of the age values. Create a scatter plot of weight with height, and show correlation values (as contignency table) comparing the features. Display the minimum and maximum heights. Multiply the heights by a factor (as if to convert them to different units).

Completed