Data manipulation

This page describes how to manipulate data in R.

colnames(df)[colnames(df)=="old_name"] <- "new_name"

Where df in the name of the data frame. old_name has to match the original name.


To check if a variable is empty, do not use df$var == "NA" or something similar, but$var).

df <- data.frame(var1 = NA, var2 = NA)

Each variable will be a new column.

Require the dplyr package.

df = select(df, -col1., -col23, -col47)

Instead of col1, use the actual name of the column in the data frame.

Two ways to do so.

The first one is the most elegant. It is:


x is the row number.

The second one is by knowing the numbers of the row and the column of the cell:

df[x, y]

x is the row number and y is the column number. Omitting y will return the values of the row x, and conversely if omitting x.


Extract data of a given cell based on the content of another cell

Let's say you want to know the value of the cell price when the cell currency is EUR. The previous code could be tricky, especially because it requires you to look at the data frame, something you don't want to.

To extract price when currency is EUR, use the following code:

df$price[df$currency == "EUR"]

Also, see this for more details on the conditions.

In the scales package, use the percent (or percent_format) function:

percent(x, accuracy = NULL, scale = 100, prefix = "",
  suffix = "%", big.mark = " ", decimal.mark = ".", trim = TRUE,

This exemple will produce a percent with 1 decimal digit:

percent(x = (var1_2018 - var1_2017)/var1_2017, accuracy = .1)

Arguments (copied from source):

  • accuracy: Number to round to, NULL for automatic guess.
  • scale: A scaling factor: x will be multiply by scale before formating (useful if the un- derlying data is on another scale, e.g. for computing percentages or thousands).
  • prefix, suffix: Symbols to display before and after value.
  • big.mark: Character used between every 3 digits to separate thousands.
  • decimal.mark: The character to be used to indicate the numeric decimal point.
  • trim: Logical, if FALSE, values are right-justified to a common width (see base::format())
  • ...: Other arguments passed on to base::format().
  • x: A numeric vector to format.
  • digits: Deprecated, use accuracy instead.
  • unit: The units to append.
  • sep: The separator between the number and the unit label.

To replace all the NA in a data frame with 0 (source):

df[] <- 0

To replace only the NA in a column (variable):

df$var[$var)] <- 0

Sometime, because of formatting or for others reasons, some rows may be empty while still present in the data frame. To remove them, it's simple:

df <- read.csv2("file.csv", sep = ";", header = TRUE, na.strings = c("", "NA"))
df <- remove_empty(df,  which = c("rows", "cols"))

na.strings = c("", "NA") ensures that every blank value will be turned into NA. This is needed for remove_empty to work (needs janitor package

remove_empty removes any row or column entirely made of NA (so this is why we needed the first line). remove_empty_rows and remove_empty_columns are deprecated.

Let's say you want to remove all observations that have an empty value for a given column/variable. Here it is:

df <- df[ ! df$var %in% c("", "NA"), ]


Enter your comment. Wiki syntax is allowed:
If you can't read the letters on the image, download this .wav file to get them read to you.
  • Last modified: 6 weeks ago
  • by Olivier Simard-Casanova