Store list in dataframe R
At times, you may need to convert Pandas DataFrame into a list in Python. Show But how would you do that? To accomplish this task, you can use tolist as follows: df.values.tolist()In this short guide, you’ll see an example of using tolist to convert Pandas DataFrame into a list. Example of using tolist to Convert Pandas DataFrame into a ListLet’s say that you have the following data about products and prices:
You then decided to capture that data in Python using Pandas DataFrame. At a certain point, you realize that you’d like to convert that Pandas DataFrame into a list. To accomplish this goal, you may use the following Python code in order to convert the DataFrame into a list, where:
Here is the full Python code: import pandas as pd data = {'product': ['Tablet','Printer','Laptop','Monitor'], 'price': [250,100,1200,300] } df = pd.DataFrame(data) products_list = df.values.tolist() print(products_list)And once you run the code, you’ll get the following multi-dimensional list (i.e., list of lists): [['Tablet', 250], ['Printer', 100], ['Laptop', 1200], ['Monitor', 300]]Optionally, you may further confirm that you got a list by adding print(type(products_list)) at the bottom of the code: import pandas as pd data = {'product': ['Tablet','Printer','Laptop','Monitor'], 'price': [250,100,1200,300] } df = pd.DataFrame(data) products_list = df.values.tolist() print(products_list) print(type(products_list))As you can see, the original DataFrame was indeed converted into a list (as highlighted in yellow): [['Tablet', 250], ['Printer', 100], ['Laptop', 1200], ['Monitor', 300]]Convert an Individual Column in the DataFrame into a ListLet’s say that you’d like to convert the ‘product‘ column into a list. You can then use the following template in order to convert an individual column in the DataFrame into a list: df['column_name'].values.tolist()Here is the complete Python code to convert the ‘product’ column into a list: import pandas as pd data = {'product': ['Tablet','Printer','Laptop','Monitor'], 'price': [250,100,1200,300] } df = pd.DataFrame(data) product = df['product'].values.tolist() print(product)Run the code, and you’ll get the following list: ['Tablet', 'Printer', 'Laptop', 'Monitor']What if you want to append an additional item (e.g., Keyboard) into the ‘product’ list? In that case, simply add the following syntax: product.append('Keyboard')So the complete Python code would look like this: import pandas as pd data = {'product': ['Tablet','Printer','Laptop','Monitor'], 'price': [250,100,1200,300] } df = pd.DataFrame(data) product = df['product'].values.tolist() product.append('Keyboard') print(product)You’ll now see the ‘Keyboard’ at the end of the list: ['Tablet', 'Printer', 'Laptop', 'Monitor', 'Keyboard']An Opposite ScenarioSometimes, you may face an opposite situation, where you’ll need to convert a list to a DataFrame. If that’s the case, you may want to check the following guide that explains how to convert a list to a DataFrame in Python.
# Add a list to data.frame in single 'element' slot # 1) Make data.frame from named list (names are optional) rowA <- as.list(c(1,2,3,4,5)) names(rowA) <- c("A", "B", "C", "D", "E") data.df <- data.frame(rowA, stringsAsFactors=FALSE) data.df # A B C D E # 1 1 2 3 4 5 # # 2) Make lists you want to add to data frame: numeric.X <- as.numeric(c(100,200,300,400)) # 3) Add List as new column to data.frame: data.df$X <- list(numeric.X) data.df # A B C D E X # 1 1 2 3 4 5 100, 200, 300, 400 Approximate time: 60 min Learning Objectives
DataframesDataframes (and matrices) have 2 dimensions (rows and columns), so if we want to select some specific data from it we need to specify the “coordinates” we want from it. We use the same square bracket notation but rather than providing a single index, there are two indices required. Within the square bracket, row numbers come first followed by column numbers (and the two are separated by a comma). Let’s explore the metadata dataframe, shown below are the first six samples: For example: metadata[1, 1] # element from the first row in the first column of the data frame metadata[1, 3] # element from the first row in the 3rd column Now if you only wanted to select based on rows, you would provide the index for the rows and leave the columns index blank. The key here is to include the comma, to let R know that you are accessing a 2-dimensional data structure: metadata[3, ] # vector containing all elements in the 3rd row If you were selecting specific columns from the data frame - the rows are left blank: metadata[ , 3] # vector containing all elements in the 3rd column Just like with vectors, you can select multiple rows and columns at a time. Within the square brackets, you need to provide a vector of the desired values: metadata[ , 1:2] # dataframe containing first two columns metadata[c(1,3,6), ] # dataframe containing first, third and sixth rows For larger datasets, it can be tricky to remember the column number that corresponds to a particular variable. (Is celltype in column 1 or 2? oh, right… they are in column 1). In some cases, the column number for a variable can change if the script you are using adds or removes columns. It’s therefore often better to use column names to refer to a particular variable, and it makes your code easier to read and your intentions clearer. metadata[1:3 , "celltype"] # elements of the celltype column corresponding to the first three samples You can do operations on a particular column, by selecting it using the $ sign. In this case, the entire column is a vector. For instance, to extract all the genotypes from our dataset, we can use: You can use colnames(metadata) or names(metadata) to remind yourself of the column names. We can then supply index values to select specific values from that vector. For example, if we wanted the genotype information for the first five samples in metadata: colnames(metadata) metadata$genotype[1:5] The $ allows you to select a single column by name. To select multiple columns by name, you need to concatenate a vector of strings that correspond to column names: metadata[, c("genotype", "celltype")] genotype celltype sample1 Wt typeA sample2 Wt typeA sample3 Wt typeA sample4 KO typeA sample5 KO typeA sample6 KO typeA sample7 Wt typeB sample8 Wt typeB sample9 Wt typeB sample10 KO typeB sample11 KO typeB sample12 KO typeB While there is no equivalent $ syntax to select a row by name, you can select specific rows using the row names. To remember the names of the rows, you can use the rownames() function: rownames(metadata) metadata[c("sample10", "sample12"),] Selecting using indices with logical operatorsWith dataframes, similar to vectors, we can use logical vectors for specific columns in the dataframe to select only the rows in a dataframe with TRUE values at the same position or index as in the logical vector. We can then use the logical vector to return all of the rows in a dataframe where those values are TRUE. idx <- metadata$celltype == "typeA" metadata[idx, ] Selecting indices with logical operators using the which() functionAs you might have guessed, we can also use the which() function to return the indices for which the logical expression is TRUE. For example, we can find the indices where the celltype is typeA within the metadata dataframe: idx <- which(metadata$celltype == "typeA") metadata[idx, ] Or we could find the indices for the metadata replicates 2 and 3: idx <- which(metadata$replicate > 1) metadata[idx, ] Let’s save this output to a variable: sub_meta <- metadata[idx, ] Exercise Subset the metadata dataframe to return only the rows of data with a genotype of KO.
ListsSelecting components from a list requires a slightly different notation, even though in theory a list is a vector (that contains multiple data structures). To select a specific component of a list, you need to use double bracket notation [[]]. Let’s use the list1 that we created previously, and index the second component: What do you see printed to the console? Using the double bracket notation is useful for accessing the individual components whilst preserving the original data structure. When creating this list we know we had originally stored a dataframe in the second component. With the class function we can check if that is what we retrieve: comp2 <- list1[[2]] class(comp2) You can also reference what is inside the component by adding an additional bracket. For example, in the first component we have a vector stored. list1[[1]] [1] "ecoli" "human" "corn" Now, if we wanted to reference the first element of that vector we would use: list1[[1]][1] [1] "ecoli" You can also do the same for dataframes and matrices, although with larger datasets it is not advisable. Instead, it is better to save the contents of a list component to a variable (as we did above) and further manipulate it. Also, it is important to note that when selecting components we can only access one at a time. To access multiple components of a list, see the note below.
Exercises Let’s practice inspecting lists. Create a list named random with the following components: metadata, age, list1, samplegroup, and number.
Assigning names to the components in a list can help identify what each list component contains, as well as, facilitating the extraction of values from list components. Adding names to components of a list uses the same function as adding names to the columns of a dataframe, names(). Let’s check and see if the list1 has names for the components: When we created the list we had combined the species vector with a dataframe df and the number variable. Let’s assign the original names to the components: names(list1) <- c("species", "df", "number") names(list1) Now that we have named our list components, we can extract components using the $ similar to extracting columns from a dataframe. To obtain a component of a list using the component name, use list_name$component_name: To extract the df dataframe from the list1 list: Now we have three ways that we could extract a component from a list. Let’s extract the species vector from list1: list1[[1]] list1[["species"]] list1$species Exercise Let’s practice combining ways to extract data from the data structures we have covered so far:
Writing to fileEverything we have done so far has only modified the data in R; the files have remained unchanged. Whenever we want to save our datasets to file, we need to use a write function in R. To write our matrix to file in comma separated format (.csv), we can use the write.csv function. There are two required arguments: the variable name of the data structure you are exporting, and the path and filename that you are exporting to. By default the delimiter is set, and columns will be separated by a comma: write.csv(sub_meta, file="data/subset_meta.csv") Similar to reading in data, there are a wide variety of functions available allowing you to export data in specific formats. Another commonly used function is write.table, which allows you to specify the delimiter you wish to use. This function is commonly used to create tab-delimited files.
Writing a vector of values to file requires a different function than the functions available for writing dataframes. You can use write() to save a vector of values to file. For example: write(glengths, file="data/genome_lengths.txt", ncolumns=1)
This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
|