1,465
questions
-1
votes
0
answers
34
views
How can I extract text with coordinates from a PDF using JavaScript? [closed]
I’m working on a project where I need to extract text along with its coordinates from a PDF document using JavaScript. I’m familiar with PDF.js, but I’m unsure how to use it to get the coordinates of ...
0
votes
1
answer
74
views
Order arrangement of texts of docx documents in the document.xml
I am trying to extract text from docx files, where I am getting collapsed text from the document like the text present at the bottom or in a random text box is extracted first and then the texts from ...
1
vote
0
answers
39
views
Extracting text from a pdf file with differents strcuture failed how to properly do it Not all texts is extracted , just a portion is extracted
I am trying to extract text from CV in pdf extension. I come up with this script but I have a problem. The script does not extract all the text and I have problem to identify different block of the ...
0
votes
0
answers
47
views
Capturing Formatted Numbering from DOCX Files in Python
I'm working on a Python project where I need to extract text from DOCX files, preserving the formatted numbering. I've encountered a peculiar issue that I'm hoping someone can help me solve.
The ...
0
votes
0
answers
43
views
403 Clients Error: Forbidden for url: https://something.org/anotherthing/more%20things%20andfile.pdf
I was scraping a website and I tried to open a URL to PDF file to extract text from the pages. Unfortunately, I keep getting the following error message
403 Clients Error: Forbidden for url: https://...
0
votes
0
answers
59
views
Guidance on Extracting Compliance Items from PDF documents by fine-tuning a LLM
Need some guidance on extracting large compliance items from raw PDF documents. I have csv with these compliance items and I want to fine-tune a LLM such that if it reads any new PDF documents it can ...
3
votes
1
answer
83
views
Parsing formulas efficiently using regex and Polars
I am trying to parse a series of mathematical formulas and need to extract variable names efficiently using Polars in Python.
Regex support in Polars seems to be limited, particularly with look-around ...
-1
votes
1
answer
51
views
Extracting Text from PDFs with Python Without Including Comments
I have been trying to extract text from PDF files to automate a significant and tedious part of my job using Python. With the help of ChatGPT, I have written multiple lines of code. However, I am ...
1
vote
1
answer
92
views
Accurately Detecting randomly rotated Text in Images
I'm trying to detect text from items, which may be rotated in various directions. I've tried using Tesseract, EasyOCR, and EAST for text detection and extraction, but I am encountering issues with ...
1
vote
0
answers
53
views
AWS Textract With AWS Signature Version 4 Using Go Lang
I have 3 credentials:
host
acckey
secretkey
That from AWS. I am using AWS Signature Ver 4 method
And then i want to using textract feature from AWS with Golang. I have build the code and have a ...
0
votes
2
answers
79
views
How to convert a string in python to separate strings [closed]
I have a pandas dataframe with only one column containing symbols. I need to separate those symbols in groups of 13 and 39 inside a single string.
symbol
3IINFOTECH
3MINDIA
3PLAND
20MICRONS
3RDROCK
...
0
votes
0
answers
51
views
How to extract data from a PDF and their position?
Currently, I'm using Google's Vision AI to extract information about dates and prices from pdf files. I proceed with the following steps:
Extract the text from the PDF. The result received from ...
0
votes
0
answers
35
views
Extracting structured data from user query
I want to extract structured data from the query provided by the user.
For example,
user query: I need data for females above the age of 3
output : {
min_age: 3,
max_age: None,
sex: female
}
These are ...
1
vote
0
answers
77
views
Improving OCR accuracy with pytesseract for processing manga images
def get_string(img_path):
img = cv2.imread(img_path)
img = cv2.resize(img, None, fx=2, fy=2, interpolation=cv2.INTER_CUBIC)
gray_img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
...
2
votes
0
answers
115
views
lopdf RUST PDF - only getting text
brand new to rust and am trying to read a pdf file with lopdf.
trying out various examples but I am just getting characters. I need all the chars like spaces, tabs, line breaks, etc...for Regex.
Is ...