I have many plain text files in a folder in my workspace. I want to search through them all at once.
Right now, I am creating Notes and copy pasting the contents of the files into the respective Notes.
Is there a better way to do it?
Thank you!
Thank you for the excellent question! The unfortunate short answer is: no, at the moment neither the DMX platform nor its Webclient support searching file contents. Actually we at DMX have this wish for quite a while, not only for plain text, but PDF, DOC … as well. Technically this is quite easy by using e.g. Apache Tika (we once had implemented it in DM3 already):
https://tika.apache.org/
Are you a Java/Javascript developer? I could support you developing a file-search plugin for the DMX platform. It would a) provide a RESTful backend service, and b) extend the DMX Webclient, e.g. by new commands/dialogs etc. See the (still incomplete) DMX Developer guide and our plugin starter project:
DMX Developer Guide — DMX documentation
https://github.com/dmx-systems/dmx-plugin-template
Welcome to DMX!
Hi Jörg!
Thank you for the quick reply and for all the work you and your team mates do on DMX! I enjoy your idea very much!
Actually we at DMX have this wish for quite a while, not only for plain text, but PDF, DOC … as well.
Yeeeeeeesss!
What about audio/video?
Imagine DMX being able to automatically make connections/associations between Named Entities (PROPN, LOC, NOUN) which Spacy (for example) could generate from text documents, transcribed audio/video files?!
PROCESS:
.mp4, .wav → speech to text (subtitles .srt) → spacy → automatic ingestion into DMX
This is actually what I am trying to build.
Right now, I am writing small python scripts which automate different parts of the PROCESS (converting to text, speech-to-text, .spacydoc generation). I love the results I get from spacy! It gives me enough motivation to continue fighting the programming part
I have experience both in Java and Javascript, but I am a pretty bad programmer and just discovering the DMX Developer Guide…
How difficult would it be to make such a plugin for DMX? What do you think would be the main challenges?
Any help would be much appreciated!
Thanks for tika!
Hi gridbard,
thats’ so great to hear you’ve development skills and are keen to accept the plugin challenge! You have a very interesting project here, I think, and we could architect it as individual plugins, reusable by other DMX applications, e.g. I would keep the content search aspect independent from the Spacy aspect.
What do you think would be the main challenges?
From my perspective these are the main challenges:
-
To come up with a 1st very simple but working solution soon, as the basis for all the following versions, instead of wanting too much too early For example I would do the file content search before utilizing Spacy, and, regarding search, I would do plain text before utilizing Tika.
-
How could the file content search GUI look like in the DMX Webclient? How would the user use it? How are search results presented and navigated? Answers are constrained by the actual extension points the DMX Webclient currently provides. (An alternative would by to build a custom frontend independently from the Webclient, but at the moment I think we don’t want that.) At a later phase I can also consider to add new extension points to the Webclient, but for the first search version we have to live with what is there already, I guess.
-
How can we separate the application into independent reusable modules/services? E.g. there could be one module
dmx-file-search
(which might depend on Tika), and onedmx-spacy
module, bridging the Spacy (Python) library. Note: in DMX there is actually no distinction between an application and a module, everything is a plugin. A DMX-plugin can contain both, a (RESTful) backend service (and a data model), as well as frontend components for the Webclient. -
DMX challenges you with quite a learning curve. There is its particular property-less associative data model, its server-side (Java) API, and all the Webclient (Javascript, Vue.js) extension points. Unfortunately the DMX Developer Guide is incomplete, in particular regarding the Webclient. At the other hand there are a quite a lot real-world example projects available for exploration by yourself. Of course, I’ll try to provide you with all the support you need here in the forum.
Resources:
-
A starter project demonstrating how to extend the DMX Webclient:
https://github.com/dmx-systems/dmx-plugin-template -
A fully-fledged custom frontend and backend service:
https://github.com/dmx-systems/dmx-zukunftswerk
So far my first thoughts. Let us know what you think.
Cheers!
Hi @gribard!
Since you did not send any update, I am just curious to learn if @jri 's comment was of any use for you? Were you able to make any progress on your project? Do you need further help?
Thank you very much in advance!
Juergen
Meanwhile there are 2 new DMX plugins available which allow to search PDF files by content within the DMX Webclient. For developers making the fulltext search available for .txt files as well would be quite an easy task.
About the 2 new plugins:
- dmx-pdf-search - fulltext search for PDF files, works transparently in DMX Webclient’s search dialog
- dmx-tesseract - integrates the Tesseract OCR engine, works together with the dmx-pdf-search plugin
The custom fulltext index feature requires DMX 5.3.4 which has been released yesterday. See the Changelog.
I’m available for questions and support.