Language Identification, Transliteration, and Translation Service


Project scope
Categories
Information technology Software development Databases MediaSkills
github language translation scrum (software development) language interpretation extensible markup language (xml) metadata testbed language identification software project management transliterationOur organization provides identification services for the global media and entertainment industry, (EIDR IDs are to movies and TV as ISBNs are to books, VINs are to cars, and UPC/EAN codes are to consumer products.) Descriptive metadata records for media programs are submitted in a wide variety of languages and scripts. We need to identify the languages used and produced normalized (transliterated and translated) versions for display and de-duplication.
We need to develop a service that will read each record in our database and:
- Determine which language is used for each field
- If the script is not in the Latin-1 character set, then:
- Transliterate selected fields to Latin-1 (Romanize)
- Translate other fields to English
- Store the updated records in our database
This will involve several different steps for the students, including:
- Familiarizing themselves with commercial language translation and transliteration tools
- Familiarizing themselves with our XML-based API
- Developing an architecture that will review and update our existing records and act as a testbed for future language tool development
- Selecting the best technologies and tools for this project, given our existing technology stack and available resources
- Building, testing, tuning, and deploying the service
- Developing comprehensive documentation describing the service for operations and ongoing maintenance
By the end of the project, students should demonstrate:
- An improved understanding of linquistic terms and challenges
- Familiarity with commercial language tools
- Familiarity with common software project tools, including GitHub and Jira
- Familiarity with the Scrum and Kanban project frameworks
- An improved understanding of the global media market
Final deliverables should include
- A working language identification, transliteration, and translation service
- A presentation covering the alternatives explored, the decisions made, and the final product produced
Students will become part of our software development team. They will receive direct supervision and mentoring from our Technology Director and will have access to our professional developers for technical advice and assistance. The project will be broken down into a series of smaller deliverables with ongoing review and detailed feedback at each stage.
About the company
The Entertainment Identifier Registry Association (EIDR) is a not-for-profit industry association that supplies the global entertainment supply chain with universal identifiers for a broad array of audio visual objects. EIDR IDs are to movies, TV, games, and podcasts as ISBNs are to books, VINs are to cars, or UPC/EAN codes are to consumer products. The EIDR registry is, and always has been, read-for-free, though we do restrict write-access to authorized parties only. Our identifiers are critical to applications throughout the media and entertainment industry from production to public presentation, by archives, and in academic citation. Our Board includes Amazon, Google, Gracenote, NBCUniversal, Paramount, Sony Pictures, Disney, Warner Bros, and Xperi.