Wikidata

From Apertium
Revision as of 09:41, 21 June 2016 by Unhammer (talk | contribs) (→‎Getting labels from dumps)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Here's an example query to get proper name translations for countries in Nynorsk/Bokmål/Danish from Wikidata:

PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX v: <http://www.wikidata.org/prop/statement/>
SELECT * WHERE {
  ?p wdt:P31/wdt:P279 wd:Q6256 .
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "nn" .
        ?p rdfs:label ?nnName .
  }
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "no,nb" .
        ?p rdfs:label ?nbName .
  }
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "da" .
        ?p rdfs:label ?daName .
  }
 } LIMIT 10

You can paste that into https://query.wikidata.org/ to get the first 10 hits.

Mouse-over things like "wd:Q6256" to show what they refer to, or look them up at urls like https://www.wikidata.org/wiki/Q6256 or https://www.wikidata.org/wiki/Property:P279

To get lots of hits, click the "🔗Link▼" button and right-click and copy the link to "REST Endpoint"; you can curl this with an increased LIMIT into a big file, e.g.

curl -o result.xml 'https://query.wikidata.org/bigdata/namespace/wdq/sparql?query=PREFIX+wikibase%3A+%3Chttp%3A%2F%2Fwikiba.se%2Fontology%23%3E%0D%0APREFIX+wd%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2F%3E%0D%0APREFIX+wdt%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fdirect%2F%3E%0D%0APREFIX+rdfs%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E%0D%0APREFIX+p%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2F%3E%0D%0APREFIX+v%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fstatement%2F%3E%0D%0ASELECT+*+WHERE+%7B%0D%0A+%3Fp+wdt%3AP31%2Fwdt%3AP279+wd%3AQ6256+.%0D%0A++SERVICE+wikibase%3Alabel+%7B%0D%0A++++bd%3AserviceParam+wikibase%3Alanguage+%22nn%22+.%0D%0A++++++++%3Fp+rdfs%3Alabel+%3FnnName+.%0D%0A++%7D%0D%0A++SERVICE+wikibase%3Alabel+%7B%0D%0A++++bd%3AserviceParam+wikibase%3Alanguage+%22no%2Cnb%22+.%0D%0A++++++++%3Fp+rdfs%3Alabel+%3FnbName+.%0D%0A++%7D%0D%0A++SERVICE+wikibase%3Alabel+%7B%0D%0A++++bd%3AserviceParam+wikibase%3Alanguage+%22da%22+.%0D%0A++++++++%3Fp+rdfs%3Alabel+%3FdaName+.%0D%0A++%7D%0D%0A+%7D+LIMIT+1000%0D%0A'

Only where names differ

This might be a more interesting list:

PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX v: <http://www.wikidata.org/prop/statement/>
SELECT * WHERE {
  ?p wdt:P31/wdt:P279 wd:Q6256 .
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "nn" .
        ?p rdfs:label ?nnName .
  }
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "da" .
        ?p rdfs:label ?daName .
  }
  FILTER  (!(?nnName = ?daName))
 } LIMIT 10

Getting labels from dumps

See https://www.wikidata.org/wiki/Wikidata:Database_download for the dumps; downloads are at https://dumps.wikimedia.org/wikidatawiki/entities/ – then you can run:

$ bzcat wikidata-20160229-all.json.bz2 \
  | grep '^{' |sed 's/,$//' \
  | jq -c '{ "da":.labels.da.value, "sv":.labels.sv.value, "nn":.labels.nn.value }' 

(the grep+sed is necessary so jq won't try to fit the whole array in memory)

You'll get some silliness like {"da":"CSS","sv":"Cascading Style Sheets","nn":"Stilark"} but there's probably some gold in there as well.


A simple way to get only toponyms is to check that the entry "claims" a coordinate location, ie.

$ bzcat wikidata-20160229-all.json.bz2 \
  | grep '^{' |sed 's/,$//' \
  | jq -c 'if .claims.P625 then { "da":.labels.da.value, "sv":.labels.sv.value, "nn":.labels.nn.value } else null end' 

(You can't just grep for '"datatype":"globe-coordinate"' or P625 or whatever, since it has to be the top-level entry which has that property. If you just do a simple grep, you'll also get properties-of-properties, e.g. Borges was buried at a coordinate location.)


More info at https://meta.wikimedia.org/wiki/Grants:Learning_patterns/Using_Wikidata_to_make_Machine_Translation_dictionary_entries

See also