Implement some VBA macro for easier drawing / redrawing the vertical arrowsĪnd now think further into this direction, after creating the first few tables: are there more tedious tasks which you can automate here? Maybe some consistency checking / quality assurance tasks? But always have this classical XKCD in mind: Use existing OCR software which allows extracting text from a rectangular frame one draws interactively over a column in the PDF, then copy the text into the Excel template If the template needs some variation, write some VBA macro which allows to apply this variation. Start with some document template which is already formatted correctly, contains all repeating text like the headlines. My idea here would be to solve parts of the task automatically, and some parts of it manually. So your best option is probably to find a middle ground. Given your current timeline, Brook's law is likely to kill the project. Coordinating a bigger group of people for this task requires finding them, making a contract, teaching them thr requirements, applying quality assurance, and lots of more things which can in the end cost more time and money than it saves. On the other end of the spectrum, one does not invest any time in automation and tries to solve the problem manually, maybe by assigning a lot of people to the task (maybe some low price workers found over some crowdworking portal).īoth approaches will probably be not ideal, the first one is technically unrealistic, the second one will probably cause too much management effort. On the one end of the spectrum, there is the idea to find or create the ultimate OCR tool, where one just feeds a PDF into a black box and gets a perfectly ready-made Excel document out of it. However, from a software engineering point of view, I think the main question is how much automation is it worth to achieve, and for which parts of the process, to either keep the total effort small or to hold the deadline. We do not answer questions for finding or using programming tools on this site. It is mainly the data inside the red box that I need to extract, all the cable schedules are in this format. this is what I have been using to extract the larger columns however it struggles with some of the smaller columns so about half of it still needs to be entered manually.īelow is an example of what I am trying to extract. basically makes hieroglyphs from the document.This may still be viable as I am not very experienced with either R or tesseract.This method was found on Stack Overflow however it struggles with the layout of the cable schedule and only able to extract a few parts.it converts everything wrong even basic words like "TRAY" it will fail to get a single letter correct. I don't have access to this and I haven't perused access to this as using Adobe's online version it has the same issues as 1.They either output a corrupted table or just add the table as in image in excel.As that shows those numbers do not line up with the timeline hence why I am asking if anyone here knows any options. With searching for other methods along with trying manually I can complete about 4 every day assuming no distractions. I have approximately 350 of these to do and only December/January to do it. The layout of the excel that is exported doesn't matter as long as the data is right as I can just manually copy and paste the columns into the correct format. It is a non-editable pdf and was made by basically drawing it in AutoCad and thus doesn't technically have text hence why I am trying OCR The layout of the documents might be an issue as it is in the format of an engineering drawing that has been exported from AutoCad with the reference grid all around it. A lot of things I have found seem to thing that it randomly splits into 2 columns. If there are multiple rows with the same information in a column there is just an arrow pointing down for however long it is. I believe the main 3 errors are that I have are that I have tried a few different OCR (Optical Character Recognition) things such as websites, code in R (tesseract), javascript, data from picture in excel and looked into C. I have a task that is to convert cable schedules into an Excel spreadsheet.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |