A look into my project

"I don't quite know how to put it into words, but I feel for the audience that I have; I know them." - Idina Menzel

A look into my project

Welcome to my third blog post for Outreachy.

Today I will be talking straight about my project: Synchronising Wikidata and Wikipedias using pywikibot.

Outreachy suggests that we write this post to explain the project to someone who has "never heard of [my] community before and have never worked on this type of project before." While my community is Wikimedia (note the 'm' in the middle) and is often not known -- or known but confused with something else, the raison d'etre of the community is something that's well-known, it's Wikipedia, that omniscient repository of knowledge, the answerer of our questions and the final decider of our arguments on that obscure event or fact.

My project centers on three things in general, Wikipedia, Wikidata and Python programming language. I will explain each in brief, but first let's talk about the organization I am interning with:

  • Wikimedia Foundation is a non-profit organization that hosts Wikipedia and provides the essential infrastructure for its running and other free knowledge projects as well as advancing the cause of open source and knowledge access for all.
  • Wikipedia - Wikipedia is a free, multilingual online encyclopedia written and maintained by a community of volunteer contributors through a model of open collaboration, using a wiki-based editing system.
  • Wikidata - Wikidata is a structured data repository linked to Wikipedia and the other Wikimedia projects. It holds structured data about a huge number of concepts, including every topic covered by a Wikipedia article, and many scientific papers and other topics. It also includes the interlanguage links between Wikipedia articles in different languages, links from Wikipedia to Commons, and between other Wikimedia projects. (definition from project description)
  • Python - Python is a popular, interpreted high-level and general-purpose programming language. It's ideal choice for this project since the scripts I am building upon are written in it, and there's an excellent Python library (Pywikibot)to interface with the Wikipedia/Wikidata, which are target of my project.

With definitions of these key components, my project in nutshell is to import data from Wikipedia to Wikidata using Python.

Simple it seems, right?

Except it's not really so...

Wikipedia is vast (see amazing details here: Size of Wikipedia) and so it's often overwhelming where to start, what to start with and even how to start. I am however happy with all the guide and help from my mentor Mike Peel.

Additionally the Wikipedia uses prose to describe events and topics, which makes its content quite hard to be accessed by machines (of course it is meant for humans to read, but these days you cannot ignore the needs of the machines and tools too!)

During the preliminary work of the Outreachy application I created these Python modules to demonstrate how to do this extractions. While the actual logic of each module differs from each other, the modules share some common logic where they need to. They adapt to various ways we can extract the data from the free-form text of Wikipedia and export to structured-data repository, thats Wikidata.

With the start of Outreachy project proper, I continued the work here which is still ongoing.

Thanks for reading.... Until next time.