Skip to content
This repository has been archived by the owner on Mar 9, 2023. It is now read-only.

sudachipy is able to work as python module. #2

Closed

Conversation

Kensuke-Mitsuzawa
Copy link

Current sudachipy is not able to work as python module because of following issues.

  1. it fails to import the package due to relative importing statements
  2. setup.py is not sufficient. It misses some module directories.

And it's dull work to put system dictionary manually. It's possible to make that in automation with makefile.

By the way, it seems that sudachipy does not support mode "C". I guess this mode is welcomed by developers. I hope it will come some day :)

@sorami sorami self-requested a review May 8, 2018 12:00
@sorami
Copy link
Collaborator

sorami commented May 8, 2018

Thank you for your comments and code!

Using SudachiPy as a Python module

It still does not work with the $pip install -e . installation step? I confirm that with that step it works fine (my colleagues and me).

Yes, $pip install -e . is not the final form of usage, and we aim to make develop it to a stable version then we will register it to PyPI and you can install like any other public Python libraries, $pip install sudachipy (Still under development ...).

Downloading and locating dictionary file

Yes, I totally agree with you that it's a dull work to put the system dictionary manually.

Your Makefile method works, but what we were planning is do this step from the code itself;
Similar NLTK (e.g., import nltk; nltk.download()) or spaCy (e.g., $python -m spacy download en). That's our goal, you currently need to do that dull work because we haven't implemented that part ... Sorry about that.

C mode splitting

I suspect that this is not about the code but the dictionary.

So we have core and full dictionaries, and different versions as we update the vocabs. The problem is that sometimes you cannot replicate the example in documents;

Say, with this setup,

import json

from sudachipy import tokenizer
from sudachipy import dictionary
from sudachipy import config

with open(config.SETTINGFILE, "r", encoding="utf-8") as f:
    settings = json.load(f)
tokenizer_obj = dictionary.Dictionary(settings).create()

With the current system_full.dic,

mode = tokenizer.Tokenizer.SplitMode.C
[m.surface() for m in tokenizer_obj.tokenize(mode, "医薬品安全管理責任者")]
# => ['医薬品安全管理責任者']

But with current system_core.dic, the result is the following as the vocab is not in the dictionary.

mode = tokenizer.Tokenizer.SplitMode.C
[m.surface() for m in tokenizer_obj.tokenize(mode, "医薬品安全管理責任者")]
# => ['医薬品', '安全', '管理', '責任者']

We hear the same issue from various people (e.g., Clarify the definition of core and non_core lexicon · Issue #34 · WorksApplications/Sudachi); we are sorry for the confusion, and we will tidy up the documents so it correpsonds to the real situation.

Misc

Yes, I think we need to add package_data part in setup.py as you suggested.

Adding an example.py (or write in README.md) would be a nice one.

So I would like to close this PR, but we are very thankful for raising these issues in public, and we are more than welcome to get questions, or finer coarse PRs.

@sorami sorami closed this May 8, 2018
@sorami
Copy link
Collaborator

sorami commented May 8, 2018

izziiyt pushed a commit that referenced this pull request Jul 7, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants