Suggestion for get_dupes: return additional variable "dupe_group" #371

jeffmun · 2020-04-30T00:06:01Z

Feature request

I use get_dupes() all the time. Thank you!
I would find it very helpful to index each set of dupes with an additional variable.
Something like "dupe_group" or "dupe_index"?

In the code below, I've use frank to create the variable I am looking for, but it would be great to have this automatically embedded into get_dupes().

Thanks for your work on janitor.

The following illustrates what I'm thinking of:

# insert reprex here
library(data.table)
mydupes <- mtcars %>% get_dupes(cyl, gear)
mydupes$dupe_group <- frank(mydupes, cyl, gear, ties.method = "dense")
mydupes %>% dplyr::select(cyl, gear, dupe_count, dupe_group)

This returns the following:
Having the "dupe_group" variable lets me immediately operate on each set of duplicates.

# A tibble: 30 x 4
     cyl  gear dupe_count dupe_group
   <dbl> <dbl>      <int>      <int>
 1     4     4          8          1
 2     4     4          8          1
 3     4     4          8          1
 4     4     4          8          1
 5     4     4          8          1
 6     4     4          8          1
 7     4     4          8          1
 8     4     4          8          1
 9     4     5          2          2
10     4     5          2          2
11     6     3          2          3
12     6     3          2          3
13     6     4          4          4
14     6     4          4          4
15     6     4          4          4
16     6     4          4          4
17     8     3         12          5
18     8     3         12          5
19     8     3         12          5
20     8     3         12          5
21     8     3         12          5
22     8     3         12          5
23     8     3         12          5
24     8     3         12          5
25     8     3         12          5
26     8     3         12          5
27     8     3         12          5
28     8     3         12          5
29     8     5          2          6
30     8     5          2          6

The text was updated successfully, but these errors were encountered:

sfirke · 2020-04-30T15:30:47Z

I like this idea! I don't think it would be hard to implement either, any of these would probably do it: https://stackoverflow.com/questions/6112803/how-to-create-a-consecutive-index-based-on-a-grouping-variable-in-a-dataframe. We probably ought to test them for performance but dplyr::group_indices looks simple (I don't wish to add data.table as a dependency of janitor).

Any thoughts from other users, either in terms of how to implement this or how it should work for the user?

jzadra · 2020-04-30T17:05:17Z

I think this is a nice idea as well. And super easy to implement with probably zero performance hit.

I'd wonder if it should be default or require an argument to get the output? I suggest the latter, as I think it isn't useful to most people unless they specifically want it in order to accomplish something. add_dupe_group_index = F maybe.

Here's the way I've just tested that seems super simple and shouldn't add any noticeable performance decrease:

filter dupe_count > 1 in line 37 code rather than on line 44.
Add row_number to counts.

Sorry for not doing a PR, my git2r stopped working with ssh again and I don't have time to get it running right now.

get_dupes <- function(dat, ..., add_dupe_index = FALSE) {
  
  expr <- rlang::expr(c(...))
  pos <- tidyselect::eval_select(expr, data = dat)
  
  names(dat)[pos] <- names(pos) #allows for renaming within get_dupes() consistent with select()
  
  if (rlang::dots_n(...) == 0) { # if no tidyselect variables are specified, check the whole data.frame
    var_names <- names(dat)
    nms <- rlang::syms(var_names)
    message("No variable names specified - using all columns.\n")
  } else {
    var_names <- names(pos)
    nms <- rlang::syms(var_names)
  }
  
  dupe_count <- NULL # to appease NOTE for CRAN; does nothing.
  
  # calculate counts to join back to main df
  counts <- dat %>%
    dplyr::count(!!! nms, name = "dupe_count") %>% 
    filter(dupe_count > 1)
  
  if(add_dupe_index) counts <- counts %>% mutate(dupe_group = row_number())
  
  # join new count vector to main data.frame
  dupes <- suppressMessages(dplyr::inner_join(counts, dat))
  
  dupes <- dupes %>%
    dplyr::ungroup() %>%
    dplyr::arrange(!!! nms)
  
  # shorten error message for large data.frames
  if (length(var_names) > 10) {
    var_names <- c(var_names[1:9], paste("... and", length(var_names) - 9, "other variables"))
  }
  if (nrow(dupes) == 0) {
    message(paste0("No duplicate combinations found of: ", paste(var_names, collapse = ", ")))
  }
  dupes
}

And now:

> mtcars %>% select(cyl, gear) %>% get_dupes(cyl, gear)
   cyl gear dupe_count
1    4    4          8
2    4    4          8
3    4    4          8
4    4    4          8
5    4    4          8
6    4    4          8
7    4    4          8
8    4    4          8
9    4    5          2
10   4    5          2
11   6    3          2
12   6    3          2
13   6    4          4
14   6    4          4
15   6    4          4
16   6    4          4
17   8    3         12
18   8    3         12
19   8    3         12
20   8    3         12
21   8    3         12
22   8    3         12
23   8    3         12
24   8    3         12
25   8    3         12
26   8    3         12
27   8    3         12
28   8    3         12
29   8    5          2
30   8    5          2
> mtcars %>% select(cyl, gear) %>% get_dupes(cyl, gear, add_dupe_index = T)
   cyl gear dupe_count dupe_group
1    4    4          8          1
2    4    4          8          1
3    4    4          8          1
4    4    4          8          1
5    4    4          8          1
6    4    4          8          1
7    4    4          8          1
8    4    4          8          1
9    4    5          2          2
10   4    5          2          2
11   6    3          2          3
12   6    3          2          3
13   6    4          4          4
14   6    4          4          4
15   6    4          4          4
16   6    4          4          4
17   8    3         12          5
18   8    3         12          5
19   8    3         12          5
20   8    3         12          5
21   8    3         12          5
22   8    3         12          5
23   8    3         12          5
24   8    3         12          5
25   8    3         12          5
26   8    3         12          5
27   8    3         12          5
28   8    3         12          5
29   8    5          2          6
30   8    5          2          6

higgi13425 · 2022-02-04T18:43:52Z

This sounds like a good idea. Would it make sense to add a function like keep_first_dupe(), that would keep only the first occurrence of each duplicated observation? That is something I find myself doing fairly often...

jzadra · 2022-02-04T19:22:34Z

I think it's easier just to follow it up with distinct(), unless I'm missing something?

higgi13425 · 2022-02-04T21:30:18Z

Doh, you are absolutely right... On Friday, February 4, 2022, 02:22:46 PM EST, Jonathan Zadra ***@***.***> wrote: I think it's easier just to follow it up with distinct(), unless I'm missing something? — Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you commented.Message ID: ***@***.***>

sfirke added the seeking comments Users and any interested parties should please weigh in - this is in a discussion phase! label Apr 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion for get_dupes: return additional variable "dupe_group" #371

Suggestion for get_dupes: return additional variable "dupe_group" #371

jeffmun commented Apr 30, 2020

sfirke commented Apr 30, 2020

jzadra commented Apr 30, 2020 •

edited

Loading

higgi13425 commented Feb 4, 2022

jzadra commented Feb 4, 2022

higgi13425 commented Feb 4, 2022 via email

Suggestion for get_dupes: return additional variable "dupe_group" #371

Suggestion for get_dupes: return additional variable "dupe_group" #371

Comments

jeffmun commented Apr 30, 2020

Feature request

sfirke commented Apr 30, 2020

jzadra commented Apr 30, 2020 • edited Loading

higgi13425 commented Feb 4, 2022

jzadra commented Feb 4, 2022

higgi13425 commented Feb 4, 2022 via email

jzadra commented Apr 30, 2020 •

edited

Loading