Extracting Data from Wikidata Using SPARQL and Python

Jelle van Kerkvoorde
ITNEXT
Published in
3 min readMar 27, 2023

--

Wikidata is a free, multilingual knowledge graph that contains structured data on a vast array of topics, ranging from people and organizations to locations and events. While the Wikidata web interface allows you to browse and search the data, sometimes you may want to extract and work with the data programmatically. In this tutorial, we will demonstrate how to use the SPARQL query language and the SPARQLWrapper library in Python to query and extract data from Wikidata and convert a pandas dataframes.

Getting Started

To get started, we will need to install the SPARQLWrapper library, which is a Python wrapper for the SPARQL Protocol and RDF Query Language. You can install the library using pip:

pip install SPARQLWrapper

We will also need the pandas library, which we will use to store and manipulate the data we extract from Wikidata:

pip install pandas

Writing a SPARQL Query

The first step in extracting data from Wikidata is to write a SPARQL query. SPARQL is a query language for RDF data that allows you to query and retrieve data from RDF triplestores. Wikidata exposes a SPARQL endpoint that allows you to query the data using SPARQL.

For this tutorial, we will write a simple SPARQL query that retrieves the name, location, and founding date of all cities in the United States:

query = """
SELECT ?city ?cityLabel ?location ?locationLabel ?founding_date
WHERE {
?city wdt:P31/wdt:P279* wd:Q515.
?city wdt:P17 wd:Q30.
?city wdt:P625 ?location.
?city wdt:P571 ?founding_date.
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
"""

Let’s break down this query:

  • In Wikidata, entities are identified by unique identifiers called “Q-ids”. For example, “Q30” is the Q-id for the United States of America. On the other hand, properties in Wikidata are identified by unique identifiers called “P-ids”. For example, “P571” is the P-id for the inception property
  • SELECT ?city ?cityLabel ?location ?locationLabel ?founding_date selects the variables we want to retrieve from the query. In this case, we want to retrieve the city, its label, its location, its location label, and its founding date.
  • WHERE {…} specifies the conditions that the results must meet. In this case, we are looking for entities that meet the following conditions:
  • The entity is an instance of (or a subclass of) a city (?city wdt:P31/wdt:P279* wd:Q515).
  • The entity is located in the United States (?city wdt:P17 wd:Q30).
  • The entity has a location (?city wdt:P625 ?location).
  • The entity has a founding date (?city wdt:P571 ?founding_date).
  • SERVICE wikibase:label { bd:serviceParam wikibase:language “en”. } retrieves the labels for the entities returned in the query. We specify that we want the labels in English.

Querying Wikidata Using SPARQLWrapper

Now that we have our SPARQL query, we can use the SPARQLWrapper library in Python to query the data from Wikidata. Here is an example Python script that executes the query and stores the results in a Pandas DataFrame:

This class can be used by anyone who wants to query data from Wikidata using Python, and then manipulate the results as a Pandas DataFrame.

When you run the class with the above query:

data_extracter = WikiDataQueryResults(query)
df = data_extracter.load_as_dataframe()
print(df.head())

It will result in:

                                     city         founding_date                            location         cityLabel                       locationLabel
0 http://www.wikidata.org/entity/Q486868 1888-01-01T00:00:00Z Point(-117.755833333 34.060833333) Pomona Point(-117.755833333 34.060833333)
1 http://www.wikidata.org/entity/Q43301 1872-01-01T00:00:00Z Point(-119.792222222 36.781666666) Fresno Point(-119.792222222 36.781666666)
2 http://www.wikidata.org/entity/Q5917 1909-02-17T00:00:00Z Point(-117.999722222 33.692777777) Huntington Beach Point(-117.999722222 33.692777777)
3 http://www.wikidata.org/entity/Q509604 1906-01-20T00:00:00Z Point(-124.157222222 40.598055555) Fortuna Point(-124.157222222 40.598055555)
4 http://www.wikidata.org/entity/Q159260 1777-01-01T00:00:00Z Point(-121.966666666 37.35) Santa Clara Point(-121.966666666 37.35)

In Conclusion

Wikidata is a vast source of structured knowledge that can be accessed programmatically using SPARQL queries. With the help of the SPARQLWrapper library in Python, it is easy to execute SPARQL queries against the Wikidata SPARQL endpoint and retrieve the results as a Pandas DataFrame. The retrieved data can be used for a wide range of purposes, such as data analysis, data visualization, or machine learning. The ability to programmatically access and analyze data from Wikidata opens up exciting possibilities for researchers, developers, and data scientists to explore and leverage this rich source of knowledge.

--

--

Data Hobbyist | Computer Vision Engineer | Python Wizard | Geodata Enthusiast