SINCE 2004

  • 0

      0 Item in Bag


      Your Shopping bag is empty

      CHECKOUT
  • Notice

    • ALL COMPUTER, ELECTRONICS AND MECHANICAL COURSES AVAILABLE…. PROJECT GUIDANCE SINCE 2004. FOR FURTHER DETAILS CALL 9443117328

    Projects > COMPUTER > 2019 > NON IEEE > APPLICATION

    Automated Phrase Mining from Massive Text Corpora


    Abstract

    Phrase mining aims at extracting quality phrases from a text corpus and has various downstream applications including information extraction/retrieval, taxonomy construction, and topic modelling. Most existing methods rely on complex, trained linguistic analysers, and thus likely have unsatisfactory performance on text corpora of new domains and genres without extra but expensive adaption. None of the state-of-the-art models, even data-driven models, is fully automated because they require human experts for designing rules or labelling phrases. In this project a novel framework was proposed for automated phrase mining, Auto Phrase, which supports any language as long as a general knowledge base (e.g., Wikipedia) in that language is available, while benefiting from, but not requiring, a POS tagger. Compared to the state-of-the-art methods, Auto Phrase has shown significant improvements in both effectiveness and efficiency on five real-world datasets across different domains and languages. Besides, Auto Phrase can be extending to model single-word quality phrases.


    Existing System

    In the existing system, topic modelling was performed 0020 on phrases by first mining phrases, segmenting each document into single and multi- 305 word phrases, and then using the constraints from segmentation in our topic modelling. First, to address the scalability issue, an efficient phrase mining technique to extract frequent significant phrases and segment the text simultaneously. It uses frequent phrase mining and a statistical significance measure to segment the text while simultaneously filtering out false candidates’ phrases. Second, to ensure a systematic method of assigning latent topics to phrases, proposes a simple but effective topic model. By restricting all constituent terms within a phrase to share the same latent topic, we can assign a phrase the topic of its constituent words. For example, By frequent phrase mining and context-specific statistical significance ranking, the following titles can be segmented as follows: Title 1. Mining frequent patterns without candidate generation: a frequent pattern tree approach. Title 2. Frequent pattern mining : current status and future directions. The tokens grouped together are constrained to share the same topic assignment. This TopMine method has the following advantages. This phrase mining algorithm efficiently extracts candidate phrases and the necessary aggregate statistics needed to prune these candidate phrases. Requiring no domain knowledge or specific linguistic rule sets, our method is purely data-driven. This method allows for an efficient and accurate filtering of false-candidate phrases. After merging frequent and pattern, only need to test whether frequent pattern tree is a significant phrase in order to determine whether to keep frequent pattern as a phrase in this title. Segmentation induces a bag-of-phrase representation for documents. It incorporates this as a constraint into our topic model eliminating the need for additional latent variables and the phrases. The model complexity is reduced and the conformity of topic assignments within each phrase is maintained.


    Proposed System

    Phrase mining refers to the process of automatic extraction of high-quality phrases (e.g., scientific terms and general entity names) in a given corpus. In the proposed system, a novel automated phrase mining framework Auto Phrase in this project, going beyond Rephrase, to further avoid additional manual labelling effort and enhance the performance, mainly using the following two new techniques. • Robust Positive-Only Distant Training. In fact, many high-quality phrases are freely available in general knowledge bases, and they can be easily obtained to a scale that is much larger than that produced by human experts. Domain-specific corpora usually contain some quality phrases also encoded in general knowledge bases, even when there may be no other domain-specific knowledge bases. Therefore, for distant training, we leverage the existing high quality phrases, as available from general knowledge bases, such as Wikipedia and Freebase, to get rid of additional manual labelling effort. We independently build samples of positive labels from general knowledge bases and negative labels from the given domain corpora, and train a number of base classifiers. We then aggregate the predictions from these classifiers, whose independence helps reduce the noise from negative labels. • POS-Guided Phrasal Segmentation. There is a trade-off between the accuracy and domain-independence when incorporating linguistic processors in the phrase mining method. On the domain independence side, the accuracy might be limited without linguistic knowledge. It is difficult to support multiple languages well, if the method is completely language-blind. On the accuracy side, relying on complex, trained linguistic analysers may hurt the domain-independence of the phrase mining method. For example, it is expensive to adapt dependency parsers to special domains like clinical reports. As a compromise, we propose to incorporate a pre-trained part-of-speech (POS) tagger to further enhance the performance, when it is available for the language of the document collection. The POS guided phrasal segmentation leverages the shallow syntactic information in POS tags to guide the phrasal segmentation model locating the boundaries of phrases more accurately.


    Architecture



    FOR MORE INFORMATION CLICK HERE