Skip to main content
-1 votes
0 answers
34 views

How can I extract text with coordinates from a PDF using JavaScript? [closed]

I’m working on a project where I need to extract text along with its coordinates from a PDF document using JavaScript. I’m familiar with PDF.js, but I’m unsure how to use it to get the coordinates of ...
Vikram Ray's user avatar
  • 1,164
0 votes
1 answer
74 views

Order arrangement of texts of docx documents in the document.xml

I am trying to extract text from docx files, where I am getting collapsed text from the document like the text present at the bottom or in a random text box is extracted first and then the texts from ...
vignesh's user avatar
1 vote
0 answers
39 views

Extracting text from a pdf file with differents strcuture failed how to properly do it Not all texts is extracted , just a portion is extracted

I am trying to extract text from CV in pdf extension. I come up with this script but I have a problem. The script does not extract all the text and I have problem to identify different block of the ...
emma's user avatar
  • 343
0 votes
0 answers
47 views

Capturing Formatted Numbering from DOCX Files in Python

I'm working on a Python project where I need to extract text from DOCX files, preserving the formatted numbering. I've encountered a peculiar issue that I'm hoping someone can help me solve. The ...
Anshuman Sharma's user avatar
0 votes
0 answers
43 views

403 Clients Error: Forbidden for url: https://something.org/anotherthing/more%20things%20andfile.pdf

I was scraping a website and I tried to open a URL to PDF file to extract text from the pages. Unfortunately, I keep getting the following error message 403 Clients Error: Forbidden for url: https://...
Bacha's user avatar
  • 11
0 votes
0 answers
59 views

Guidance on Extracting Compliance Items from PDF documents by fine-tuning a LLM

Need some guidance on extracting large compliance items from raw PDF documents. I have csv with these compliance items and I want to fine-tune a LLM such that if it reads any new PDF documents it can ...
Daremitsu's user avatar
  • 643
3 votes
1 answer
83 views

Parsing formulas efficiently using regex and Polars

I am trying to parse a series of mathematical formulas and need to extract variable names efficiently using Polars in Python. Regex support in Polars seems to be limited, particularly with look-around ...
Oyibo's user avatar
  • 97
-1 votes
1 answer
51 views

Extracting Text from PDFs with Python Without Including Comments

I have been trying to extract text from PDF files to automate a significant and tedious part of my job using Python. With the help of ChatGPT, I have written multiple lines of code. However, I am ...
MDMT's user avatar
  • 1
1 vote
1 answer
92 views

Accurately Detecting randomly rotated Text in Images

I'm trying to detect text from items, which may be rotated in various directions. I've tried using Tesseract, EasyOCR, and EAST for text detection and extraction, but I am encountering issues with ...
Agura's user avatar
  • 11
1 vote
0 answers
53 views

AWS Textract With AWS Signature Version 4 Using Go Lang

I have 3 credentials: host acckey secretkey That from AWS. I am using AWS Signature Ver 4 method And then i want to using textract feature from AWS with Golang. I have build the code and have a ...
Hafi Ihza Farhana's user avatar
0 votes
2 answers
79 views

How to convert a string in python to separate strings [closed]

I have a pandas dataframe with only one column containing symbols. I need to separate those symbols in groups of 13 and 39 inside a single string. symbol 3IINFOTECH 3MINDIA 3PLAND 20MICRONS 3RDROCK ...
Hamza Ahmed's user avatar
  • 1,751
0 votes
0 answers
51 views

How to extract data from a PDF and their position?

Currently, I'm using Google's Vision AI to extract information about dates and prices from pdf files. I proceed with the following steps: Extract the text from the PDF. The result received from ...
Đạt Vũ Trọng's user avatar
0 votes
0 answers
35 views

Extracting structured data from user query

I want to extract structured data from the query provided by the user. For example, user query: I need data for females above the age of 3 output : { min_age: 3, max_age: None, sex: female } These are ...
llms_query's user avatar
1 vote
0 answers
77 views

Improving OCR accuracy with pytesseract for processing manga images

def get_string(img_path): img = cv2.imread(img_path) img = cv2.resize(img, None, fx=2, fy=2, interpolation=cv2.INTER_CUBIC) gray_img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) ...
Myat Thet's user avatar
2 votes
0 answers
115 views

lopdf RUST PDF - only getting text

brand new to rust and am trying to read a pdf file with lopdf. trying out various examples but I am just getting characters. I need all the chars like spaces, tabs, line breaks, etc...for Regex. Is ...
diogenes's user avatar
  • 2,067

15 30 50 per page
1
2 3 4 5
98