Corpora/notes/plan
Background
This document provides internal notes (which can be ‘rough’ or out-of-date) to Task Force (DE), i.e. the combination of activities (D) on very large corpora and (E) on word embeddings.
Work Plan
According to the initial NLPL work plan, two strands of corpora-related activities were planned, one seeking to enable project-wide access to licensed corpora (e.g. the GigaWord collections from the LDC), and another one aiming to simplify the creation of and access to large text collections derived from Wikipedia and the Common Crawl