Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Truncate the length of variable names #332

Open
statzhero opened this issue Feb 27, 2020 · 11 comments
Open

Truncate the length of variable names #332

statzhero opened this issue Feb 27, 2020 · 11 comments
Assignees
Labels
seeking comments Users and any interested parties should please weigh in - this is in a discussion phase!

Comments

@statzhero
Copy link

Feature requests

Just to confirm: there's no option yet to truncate variable names?

E.g. clean_names(x, max = 24)

@sfirke
Copy link
Owner

sfirke commented Feb 27, 2020

There is not. I am open to adding this, either as a separate function or as part of clean_names.

Would you shorten it by truncating from the end, then forcing uniqueness? I sometimes have very long survey questions with identical beginnings and in those cases I take maybe the first and last 10 characters, separated by a _.

@sfirke sfirke added the seeking comments Users and any interested parties should please weigh in - this is in a discussion phase! label Feb 27, 2020
@statzhero
Copy link
Author

statzhero commented Feb 27, 2020

shorten it by truncating from the end, then forcing uniqueness

Yes, that sounds most appealing. base::abbreviate has also some interesting features.

@jzadra
Copy link
Contributor

jzadra commented Feb 27, 2020

I second abbreviate(). I use it for exactly this purpose and it has highly configurable arguments.

@sfirke
Copy link
Owner

sfirke commented Mar 7, 2020

Is this the same as #201 ? That's a couple of votes then. I wasn't aware of base::abbreviate but it looks like it's been missing from my life.

I like the idea but this would be adding a bunch of new arguments to clean_names to control the truncation in abbreviate. abbreviate seems pretty good already, if I would just build on that then maybe an immediate solution is to add a link to abbreviate to the clean_names documentation? I think most people can write the extra line to follow clean_names with abbreviate, the thing is knowing it exists.

@billdenney
Copy link
Collaborator

I'll try to pull something like this into the #340 rewrite.

@sfirke
Copy link
Owner

sfirke commented Mar 7, 2020 via email

@sfirke sfirke self-assigned this Mar 12, 2020
@billdenney
Copy link
Collaborator

I realized that I missed this feature in the clean_names() rewrite. What is the advantage of abbreviate() over simply using substr()? Also, abbreviate() appears not to always actually shorten the text. So, if you were looking for something with <=2 characters, you may not get that.

@sfirke
Copy link
Owner

sfirke commented Mar 20, 2020

I'm not finding abbreviate useful for my cases where the variable is a long text string from a survey.

x <- "Please indicate your level of agreement with the following statement: my manager takes actions to make me feel valued."

This is not useful:

abbreviate(x, minlength = 30, method = "both.sides")
> "Plsiyloawtfsttmnt:mmtatmmfvld."

I much prefer:

> paste(stringr::str_sub(x, 1, 15), stringr::str_sub(x, start = -15), sep = "_")
[1] "Please indicate_me feel valued."

As a better use of 30 characters. I may be biased by this being my main use case for this problem.

@jzadra
Copy link
Contributor

jzadra commented Mar 20, 2020

I agree that a truncated version is more useful than a long abbreviation. I played around with abbreviate() a little and couldn't get anything much better.

One thing that occurred to me: are there special things to watch for that are more critical pieces of info in a long colname that we want to try to keep? For instance numerals - "Please indicate your level in 2019", "Please indicate your level in 2020", or capitals, etc?

If so, we could check for these markers and make sure they are always kept.

Another thing would be to combine truncation with abbreviation for a string above some number of chars - truncate the first 15 or 20 keeping the special markers, then abbreviate the rest?

@billdenney
Copy link
Collaborator

@jzadra, I was thinking that if we went down a path with a lot of controls for how abbreviation happens, then we may want another pair of functions (e.g. make_abbrev_names() and abbrev_names(), "abbrev" is specifically abbreviated because that makes me smile). Then, you could run:

data %>%
  abbrev_names([all the controls for abbreviation]) %>%
  clean_names([all the controls for cleaning])

In my mind, the part that fits in make_clean_names() (and therefore clean_names()) is the part that @sfirke suggested: Keep the beginning and end of the name with an underscore in the middle. In general, I think that the beginning and end of the column name contain most of the more critical information. (In your example, "2019" and "2020" are both at the end, and the "Please indicate" would give clarity to a column class.)

One other item that I'd suggest is that abbreviated names would not be guaranteed to have the number of characters due to duplicated column names having _1 and counting up from there appended. The guarantee would be that before de-duplication, the number of characters would be the abbreviated number. Does that seem reasonable?

For my use cases, I find myself often doing the following:

data %>%
  clean_names() %>%
  rename(
    # make the names what I actually want them to be, but
    # start from something known to be ok and unique
  )
@jzadra
Copy link
Contributor

jzadra commented Mar 20, 2020

That makes sense to me. And I often do something similar to your use case - let clean_names() get them most of the way there and unique, and then modify what needs to be changed manually.

@sfirke sfirke removed this from the v2.0 milestone Apr 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
seeking comments Users and any interested parties should please weigh in - this is in a discussion phase!
4 participants