Python DataFrames
Creating a DataFrame:
- A dataframe stores data tables, including their values and (optional) labels for rows and columns
- It allows for convenient data access and manipulation
- The example above represents a table with 3 columns ("Name", "Age", and "Job"), and with 3 rows containing the specified values
- Passing a dictionary is just one way of initiate a DataFrame, you can also pass an NDArray for example
- You can add labels to the rows with the index attribute, for example myDF.index = ['row1', 'row2', 'row3']
Accessing Values:
- Simple access is similar to accessing data from a multidimensional array, but you can specify the column name and row name instead of the position indices
- You can specify the row by its name rather than index if you set the names previously
- A Series is like a list that stores a column or row of a table along with its name and data type
Removing and Replacing Values:
- If you want to modify the current dataframe (rather than creating a new one), use the parameter inplace=True)
- You can replace any values with any other value of the same type
File Reading/Writing:
- Specify header=None only if the first line of the file doesn't have column labels
- Some other formats include json, sql, and hdf
- If the read data doesn't have column names, you can add them with df.columns = ['ColName1','ColName2',...]
Modification:
- You can pass ascending=False as a parameter to the sort_values function to arrange them from highest to lowest values
- You can do other math operations on column values: +, -, *, ...
- The transform function supports custom lambda functions too
Combination:
- You may want to select on multiple features if there are repeats of values in one of the features (e.g. 2 people named Bob, but with different birthdates)
- For a multiple-feature merge, both tables must have all the listed features (e.g. table 1 has "Name" and "Birthdate" features, and table 2 also has both)
- The result of the concat function will be a table that is the first table's rows with the second table's rows added below. For columns in one table that the other table doesn't have, NaN values will fill the empty cells
Analysis:
- For functions that use the axis parameter, set it to 1 to apply it across the values in a row
- If you have a table with only numeric values, you can apply the sum/min/max function to the whole table (not only a column)
Looping Through Data:
- If you want to get the whole rows or whole columns, don't specify the last brackets, and they will be returned as Series type
Plotting Data:
- A scatter plot shows data points with features represented on the x and y axes
- A histogram counts the values for each category found in the data
Challenge
Create a dataframe that includes columns for age, weight, and height. Add 7 values for each column, age values should be 21, 22, or 23. Plot a histogram of the age values. Create a scatter plot of weight with height, and show correlation values (as contignency table) comparing the features. Display the minimum and maximum heights. Multiply the heights by a factor (as if to convert them to different units).