All Questions
23
questions
0
votes
0
answers
53
views
Issues with Sokuon Conversion in pykakasi Library for Japanese to Romaji Translation in Python
I am attempting to use the pykakasi library in Python to convert Japanese text to Romaji. However, I am encountering issues with the conversion of sokuon (促音). Here is the code I am using:
import ...
1
vote
2
answers
1k
views
Python, using pdfplumber, pdfminer packages extract text from pdf, bolded characters duplicates
Goal: extract Chinese financial report text
Implementation: Python pdfplumber/pdfminer package to extract PDF text to txt
problem: for PDF text in bold, corresponding extracted text in txt duplicates
...
1
vote
1
answer
447
views
Pinyin packages: accuracy and efficiency
I am looking to get the pinyin of Simplified Mandarin characters, and have come across two packages:
pinyin 0.4.0 which is 6 years old (GitHub repo here)
pinyin_jyutping_sentence which is 2> years ...
1
vote
0
answers
168
views
How to parse and encode Chinese Characters in Jupyter Notebook?
I want to train a really basic NLP model but using Chinese characters. Read_csv doesn't really work.
I was also wondering if there is any way to extract the different parts of the character, like for ...
2
votes
2
answers
753
views
Tokenizing Chinese text with keras.preprocessing.text.Tokenizer
keras.preprocessing.text.Tokenizer doesn't work correctly with Chinese text. How can I modify it to work on Chinese text?
from keras.preprocessing.text import Tokenizer
def fit_get_tokenizer(data, ...
0
votes
2
answers
60
views
identify elements with specific language, f.e. chinese
I have a dataset that looks simplified similar to this:
call_id<- c("001","002","003","004","005","012","024")
transcript <- ...
0
votes
0
answers
255
views
Spell Check/DidYouMean for Japanese language
looking for ideas for implementing Spellcheck/DidYouMean for the Japanese language (mostly).
The target for spellcheck is search queries, search engine build on solr, but the solution is not bound to ...
-4
votes
5
answers
239
views
Is there R function to extract number amounts from string of Chinese characters?
I have a string like d
d <- c("您尾号1234卡11月11日00:03转入人民币1,500.00元,余额人民币1,501.12元",
"您尾号3256卡11月11日00:03转出人民币678.12元,余额人民币1,501.12元",
"您尾号7894卡11月11日00:03取现0....
0
votes
1
answer
103
views
Where to find resource of Japanese - Chinese dictionary
Hey I am trying to provide japanese - chinese translation functionality for my project. I have found Rikaichan which is a chrome plugin that achieves a popup japanese - english translation. Rikaichan ...
1
vote
2
answers
1k
views
Module import issue with a Japanese Tokenizer
I am trying to get the JapaneseTokenizer working in python, but I am having trouble with one of the modules it depends on. Here is the trace of the errors I am getting:
/Users/home/PycharmProjects/...
0
votes
2
answers
659
views
RASA how to use Japanese (Tokennization-Mecab)
RASA is known to be an effective bots framework.
Stack such as RASA NLU and RASA Core is really useful.
I hand-on it around, I find out that its amazing especially with English text. I give another ...
2
votes
2
answers
1k
views
How to split CJK text into words?
I use JavaScript to create a transliteration. I am wondering whether it is possible to split CJK text into a sequence of words, defined according to some word segmentation standard. Any alternative?
...
6
votes
3
answers
5k
views
Spacy Japanese Tokenizer
I am trying to use Spacy's Japanese tokenizer.
import spacy
Question= 'すぺいんへ いきました。'
nlp(Question.decode('utf8'))
I am getting the below error,
TypeError: Expected unicode, got spacy.tokens.token....
2
votes
1
answer
528
views
C# Japanese morphological analyzers
I can't find any Japanese morphological analyzers for C#. Can anyone please suggest one?
1
vote
2
answers
403
views
Determine whether a romanized name is Japanese or not, preferably in Ruby
How can I determine whether a romanized name is likely, or unlikely, to be a Japanese name?
"Yukihiro Matsumoto".likely_to_be_japanese? # => true
"John Smith".likely_to_be_japanese? # => false
...