Truncate the length of variable names #332

statzhero · 2020-02-27T16:57:14Z

Feature requests

Just to confirm: there's no option yet to truncate variable names?

E.g. clean_names(x, max = 24)

The text was updated successfully, but these errors were encountered:

sfirke · 2020-02-27T17:00:57Z

There is not. I am open to adding this, either as a separate function or as part of clean_names.

Would you shorten it by truncating from the end, then forcing uniqueness? I sometimes have very long survey questions with identical beginnings and in those cases I take maybe the first and last 10 characters, separated by a _.

statzhero · 2020-02-27T18:45:57Z

shorten it by truncating from the end, then forcing uniqueness

Yes, that sounds most appealing. base::abbreviate has also some interesting features.

jzadra · 2020-02-27T22:55:47Z

I second abbreviate(). I use it for exactly this purpose and it has highly configurable arguments.

sfirke · 2020-03-07T05:29:39Z

Is this the same as #201 ? That's a couple of votes then. I wasn't aware of base::abbreviate but it looks like it's been missing from my life.

I like the idea but this would be adding a bunch of new arguments to clean_names to control the truncation in abbreviate. abbreviate seems pretty good already, if I would just build on that then maybe an immediate solution is to add a link to abbreviate to the clean_names documentation? I think most people can write the extra line to follow clean_names with abbreviate, the thing is knowing it exists.

billdenney · 2020-03-07T13:00:33Z

I'll try to pull something like this into the #340 rewrite.

sfirke · 2020-03-07T14:15:24Z

If you think it doesn't add too much for something that already exists in a modular form in abbreviate. Maybe just a max length argument and in the docs note that finer control of truncation can be achieved through abbreviate?

…

On Sat, Mar 7, 2020, 8:00 AM Bill Denney ***@***.***> wrote: I'll try to pull something like this into the #340 <#340> rewrite. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#332?email_source=notifications&email_token=ABZYDEBXC665I5BZ527CSJDRGJAPDA5CNFSM4K466O52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEODYZOI#issuecomment-596085945>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABZYDEA75VLGPFVCXYISNI3RGJAPDANCNFSM4K466O5Q> .

billdenney · 2020-03-20T14:17:04Z

I realized that I missed this feature in the clean_names() rewrite. What is the advantage of abbreviate() over simply using substr()? Also, abbreviate() appears not to always actually shorten the text. So, if you were looking for something with <=2 characters, you may not get that.

sfirke · 2020-03-20T14:56:53Z

I'm not finding abbreviate useful for my cases where the variable is a long text string from a survey.

x <- "Please indicate your level of agreement with the following statement: my manager takes actions to make me feel valued."

This is not useful:

abbreviate(x, minlength = 30, method = "both.sides")
> "Plsiyloawtfsttmnt:mmtatmmfvld."

I much prefer:

> paste(stringr::str_sub(x, 1, 15), stringr::str_sub(x, start = -15), sep = "_")
[1] "Please indicate_me feel valued."

As a better use of 30 characters. I may be biased by this being my main use case for this problem.

jzadra · 2020-03-20T15:36:14Z

I agree that a truncated version is more useful than a long abbreviation. I played around with abbreviate() a little and couldn't get anything much better.

One thing that occurred to me: are there special things to watch for that are more critical pieces of info in a long colname that we want to try to keep? For instance numerals - "Please indicate your level in 2019", "Please indicate your level in 2020", or capitals, etc?

If so, we could check for these markers and make sure they are always kept.

Another thing would be to combine truncation with abbreviation for a string above some number of chars - truncate the first 15 or 20 keeping the special markers, then abbreviate the rest?

billdenney · 2020-03-20T15:50:06Z

@jzadra, I was thinking that if we went down a path with a lot of controls for how abbreviation happens, then we may want another pair of functions (e.g. make_abbrev_names() and abbrev_names(), "abbrev" is specifically abbreviated because that makes me smile). Then, you could run:

data %>%
  abbrev_names([all the controls for abbreviation]) %>%
  clean_names([all the controls for cleaning])

In my mind, the part that fits in make_clean_names() (and therefore clean_names()) is the part that @sfirke suggested: Keep the beginning and end of the name with an underscore in the middle. In general, I think that the beginning and end of the column name contain most of the more critical information. (In your example, "2019" and "2020" are both at the end, and the "Please indicate" would give clarity to a column class.)

One other item that I'd suggest is that abbreviated names would not be guaranteed to have the number of characters due to duplicated column names having _1 and counting up from there appended. The guarantee would be that before de-duplication, the number of characters would be the abbreviated number. Does that seem reasonable?

For my use cases, I find myself often doing the following:

data %>%
  clean_names() %>%
  rename(
    # make the names what I actually want them to be, but
    # start from something known to be ok and unique
  )

jzadra · 2020-03-20T15:59:25Z

That makes sense to me. And I often do something similar to your use case - let clean_names() get them most of the way there and unique, and then modify what needs to be changed manually.

sfirke added the seeking comments Users and any interested parties should please weigh in - this is in a discussion phase! label Feb 27, 2020

sfirke added this to the v2.0 milestone Mar 7, 2020

sfirke mentioned this issue Mar 7, 2020

argument to clean_names() for maximum length of names? #201

Closed

sfirke self-assigned this Mar 12, 2020

sfirke removed this from the v2.0 milestone Apr 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Truncate the length of variable names #332

Truncate the length of variable names #332

statzhero commented Feb 27, 2020

sfirke commented Feb 27, 2020

statzhero commented Feb 27, 2020 •

edited

Loading

jzadra commented Feb 27, 2020

sfirke commented Mar 7, 2020

billdenney commented Mar 7, 2020

sfirke commented Mar 7, 2020 via email

billdenney commented Mar 20, 2020

sfirke commented Mar 20, 2020

jzadra commented Mar 20, 2020

billdenney commented Mar 20, 2020

jzadra commented Mar 20, 2020

Truncate the length of variable names #332

Truncate the length of variable names #332

Comments

statzhero commented Feb 27, 2020

Feature requests

sfirke commented Feb 27, 2020

statzhero commented Feb 27, 2020 • edited Loading

jzadra commented Feb 27, 2020

sfirke commented Mar 7, 2020

billdenney commented Mar 7, 2020

sfirke commented Mar 7, 2020 via email

billdenney commented Mar 20, 2020

sfirke commented Mar 20, 2020

jzadra commented Mar 20, 2020

billdenney commented Mar 20, 2020

jzadra commented Mar 20, 2020

statzhero commented Feb 27, 2020 •

edited

Loading