tidytext::unnest_tokens 함수 사용 예제.

티스토리 뷰

개발/R

tidytext::unnest_tokens 함수 사용 예제.

리로7 2022. 5. 9. 23:02

텍스트 데이터 분석을 위한 tidytext 라이브러리의 unnest_tokens 함수에 대한 사용 예제를 정리한다. 샘플데이터를 우선 하나 d 로 정의한다.

library(tidytext)
library(dplyr)
library(janeaustenr)

d <- tibble(txt = prideprejudice)
d

# 출력결과
# A tibble: 13,030 × 1
   txt                                                                      
   <chr>                                                                    
 1 "PRIDE AND PREJUDICE"                                                    
 2 ""                                                                       
 3 "By Jane Austen"                                                         
 4 ""                                                                       
 5 ""                                                                       
 6 ""                                                                       
 7 "Chapter 1"                                                              
 8 ""                                                                       
 9 ""                                                                       
10 "It is a truth universally acknowledged, that a single man in possession"
# … with 13,020 more rows

unnest_tokens 함수의 간단한 호출인 경우 디폴트 셋팅값으로 words 가 선택되고, 단어기준으로 텍스트 분리가 된다.

# 둘다 동일한 결과를 만든다.
d %>%
  unnest_tokens(word, txt)

# 둘다 동일한 결과를 만든다.
d %>%
  unnest_tokens(output = word, input = txt, token = "words")
  
# A tibble: 122,204 × 1
   word     
   <chr>    
 1 pride    
 2 and      
 3 prejudice
 4 by       
 5 jane     
 6 austen   
 7 chapter  
 8 1        
 9 it       
10 is       
# … with 122,194 more rows

빌트인 token 옵션만 사용할 수 있는게 아니라 별도의 함수들도 token 으로 사용할 수 있다. 다만 여기에는 한가지 제약이 있는데 도움말에 보면 아래와 같은 설명이 있다.

token
Unit for tokenizing, or a custom tokenizing function. Built-in options are "words" (default), "characters", "character_shingles", "ngrams", "skip_ngrams", "sentences", "lines", "paragraphs", "regex", "tweets" (tokenization by word that preserves usernames, hashtags, and URLS ), and "ptb" (Penn Treebank). If a function, should take a character vector and return a list of character vectors of the same length.

하나의 문자열 백터를 받아서(길이는 n 이어도 됨), 같은 n 길이인 문자열 백터를 가진 하나의 리스트를 리턴하는 함수여야 한다. 말을 풀어쓰기가 어렵지만 예제를 보면 어느정도 이해가 간다. 인풋값으로 2개의 문자열을 담은 하나의 벡터를 대입하면, 그 결과로 2개의 결과를 담은 하나의 리스트 형이 리턴되는 함수. 그런 함수를 token 에 적용해야 한다.

stringr::str_split(c('alkj b slkdjf', 'skljf b slfffskdjf'), pattern = ' ')

# 출력결과
[[1]]
[1] "alkj"   "b"      "slkdjf"

[[2]]
[1] "skljf"      "b"          "slfffskdjf"

그래서 조건을 만족하는 함수를 token 에 제대로 넣어 나온 결과는 아래와 같다.

d %>%
  unnest_tokens(word, txt, token = stringr::str_split, pattern = " ")

# 출력결과
# A tibble: 124,032 × 1
   word       
   <chr>      
 1 "pride"    
 2 "and"      
 3 "prejudice"
 4 ""         
 5 "by"       
 6 "jane"     
 7 "austen"   
 8 ""         
 9 ""         
10 ""         
# … with 124,022 more rows

만약에 token 에 인풋값과 동일한 길이의 리스트형이 리턴되지 않는 함수를 무턱대고 token 에 인풋으로 넣게 되면, 아래와 같은 에러를 만나게 된다.

Error: Expected output of tokenizing function to be a list of length 13030 Run `rlang::last_error()` to see where the error occurred.

d %>%
  unnest_tokens(word, txt, token = str_c, pattern = " ")
  
# 에러발생
Error: Expected output of tokenizing function to be a list of length 13030
Run `rlang::last_error()` to see where the error occurred.

저작자표시 비영리 변경금지 (새창열림)

공지사항

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

링크

TAG more

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

글 보관함

로그잇

티스토리 뷰

tidytext::unnest_tokens 함수 사용 예제.

티스토리툴바