Change the name of one column
colnames(df)[colnames(df)=="old_name"] <- "new_name"
Where df
in the name of the data frame. old_name
has to match the original name.
Change the names of all columns at once
colnames(df) <- c("col1", "col2")
Be sure to have the same number of items in c()
and columns in df
.
Change the type of a variable
var <- as.data.frame(var)
Change a string to a float
If var
is a string that contains floats, as.numeric(var)
won’t work. Why? I have absolutely no idea. But basically, you well lose the decimal part of var
, and even the main part will be messed up.
To avoid that, you have to use
as.numeric(as.character(var))
Why? No idea ¯\_(ツ)_/¯
Check if a variable is NA
To check if a variable is empty, do not use df$var == "NA"
or something similar, but is.na(df$var)
.
Create an empty data frame
df <- data.frame(var1 = NA, var2 = NA)
Each variable will be a new column.
Drop one or more columns
Vanilla solution
df[2] <- NULL
This code will drop the second column.
To drop more than one column, use:
df[2:5] <- NULL
The columns from 2 to 5 will be dropped.
Source (with even more options).
With dplyr
Require the dplyr
package.
df = select(df, -col1., -col23, -col47)
Instead of col1
, use the actual name of the column in the data frame.
Extract data of a given cell
Two ways to do so.
The first one is the most elegant. It is:
df$colname[x]
x
is the row number.
The second one is by knowing the numbers of the row and the column of the cell:
df[x, y]
x
is the row number and y
is the column number. Omitting y
will return the values of the row x
, and conversely if omitting x
.
Extract data of a given cell based on the content of another cell
Let’s say you want to know the value of the cell price
when the cell currency
is EUR
. The previous code could be tricky, especially because it requires you to look at the data frame, something you don’t want to.
To extract price
when currency is EUR
, use the following code:
df$price[df$currency == "EUR"]
Also, see this for more details on the conditions
Isolate part of a data frame based on certain values
Let say you have a data frame df
with two prices: EUR
and USD
. And you want to extract the observations with only one of those currencies. Use this:
df <- df[ df$currency %in% c("EUR"), ]
subset()
might also help (more).
Percents
In the scales
package, use the percent
(or percent_format
) function:
percent(x, accuracy = NULL, scale = 100, prefix = "", suffix = "%", big.mark = " ", decimal.mark = ".", trim = TRUE)
This exemple will produce a percent with 1 decimal digit:
percent(x = (var1_2018 - var1_2017)/var1_2017, accuracy = .1)
Arguments (copied from source):
accuracy
: Number to round to, NULL for automatic guess.scale
: A scaling factor: x will be multiply by scale before formating (useful if the un- derlying data is on another scale, e.g. for computing percentages or thousands).prefix
,suffix
: Symbols to display before and after value.big.mark
: Character used between every 3 digits to separate thousands.decimal.mark
: The character to be used to indicate the numeric decimal point.trim
: Logical, if FALSE, values are right-justified to a common width (see base::format())...
: Other arguments passed on to base::format().x
: A numeric vector to format.digits
: Deprecated, use accuracy instead.unit
: The units to append.sep
: The separator between the number and the unit label.
Remove empty rows/lines
Sometime, because of formatting or for others reasons, some rows may be empty while still present in the data frame. To remove them, it’s simple:
df <- read.csv2("file.csv", sep = ";", header = TRUE, na.strings = c("", "NA")) df <- remove_empty(df, which = c("rows", "cols"))
na.strings = c("", "NA")
ensures that every blank value will be turned into NA
. This is needed for remove_empty
to work (needs janitor
package, see here).
remove_empty
removes any row or column entirely made of NA
(so this is why we needed the first line). remove_empty_rows
and remove_empty_columns
are deprecated.
Remove NA
Let’s say you want to remove all observations that have an empty value for a given column/variable. Here it is:
df <- dplyr::filter(df, !is.na(var))
It requires the dplyr
package (source).
Reorder columns in a dataframe
Vanilla
Two solutions are available:
# reorder by column name
df <- df[c("A", "B", "C")]
# reorder by column index
df <- df[c(1,3,2)]
With package
Requires dplyr
.
df <- select(df, col2, col1)
df
will be reorder from col1, col2
to col2, col1
.
Replace NA
With 0 (zeros)
To replace all the NA
in a data frame with 0
(source):
df[is.na(df)] <- 0
To replace only the NA
in a column (variable):
df$var[is.na(df$var)] <- 0
With a value contained in another column
You want to replace var1
with the value of var2
when var1
is NA
. Intuitively, you want to write:
df$var1[is.na(df$var1)] <- df$var2
But this code will run into problems. Basically, if only certain values of var1
are empty, you will have a discrepancy between var1
and var2
. For instance, if the 9th row in var1
are empty, this code will replace var1
not with the 9th row in var2
, but with the 1st – so the discrepancy. You will get the error Number of items to replace is not a multiple of replacement length
.
To avoid that, you need to use an ifelse
statement:
df$var1 <- ifelse(is.na(df$var1), df$var2, df$var1)
What this code does is simple: if df$var1
is empty (NA
), it will be replaced by var2
. If not, it will be replaced by itself – nothing will change.
For whatever reason, this code avoids the discrepancy issue mentioned earlier.
Retrieve the numbers and the names of all columns and all rows
ncol(df) colnames(df) nrow(df) rnames(df)
Sort dataframe
Let say you want to sort a data frame after one specific column/variable. Here is the code to do that:
df <- arrange(df, desc(var))
var
is the variable used to sort df
. To sort df
in the opposite direction, remove desc()
.