-4

I have a large text file, around 200,000 lines of word translations. I want to keep the translated text, which appears after the tab.

abaxial van  osovine
abbacy  opatstvo
abbaino     kora
abbatial    opatski
abbe    opat
abbé    opat
abbé    sveæenik
hematological parameters    hematološki pokazatelji

How can I get strip all characters before the first instance of a tab?

Martin Erlic
  • 5,467
  • 22
  • 81
  • 153
  • 1
    `text.split("\\s{2}\\s*")` – Hovercraft Full Of Eels Dec 18 '17 at 00:41
  • Just tested, and only removes strings before two spaces, but not for 1, 3, 4, etc... – Martin Erlic Dec 18 '17 at 00:49
  • For some reason ``String content = line.substring(line.lastIndexOf("\t") + 1);`` worked for me. – Martin Erlic Dec 18 '17 at 00:57
  • 2
    @MartinErlic That's weird, because `\t` is **not a space**. Perhaps your question should be: "How do I remove all text before the first **tab** character?". Or last, as your code snippet does it. – Andreas Dec 18 '17 at 01:10
  • Interesting. I didn't realize that. They all seem to be tabs, but some are of different sizes. – Martin Erlic Dec 18 '17 at 01:13
  • 1
    @MartinErlic Then perhaps you should figure out what the data is, before you try to manipulate it. Use a good text editor that will show you spaces and tabs, e.g. [Notepad++](https://notepad-plus-plus.org/). – Andreas Dec 18 '17 at 01:14
  • 1
    Confused!!! Question title *"remove all characters **after** a tab"*. Question text: *"strip all characters **before** two spaces"*. Is it tab or 2 spaces? Is it text before or after that needs to be eliminated? – Andreas Dec 18 '17 at 01:16
  • 1
    If the question isn't about spaces, then please edit the text of the question to reflect that. – Keara Dec 18 '17 at 01:17
  • We've had better questions on this site, to be sure. – Hovercraft Full Of Eels Dec 18 '17 at 01:24

2 Answers2

2

You can use this regex to match everything before the translation:

 .+? {2,}

Try this regex online: https://regex101.com/r/P0TY1k/1

Use this regex to call replaceAll on your string.

yourString.replaceAll(".+? {2,}", "");

EDIT: If the delimiter is not 2 spaces but a tab, you can try this regex instead:

.+?(?: {2,}|\t)
Sweeper
  • 213,210
  • 22
  • 193
  • 313
  • Sorry, good answer, but question was wrong, and the "2 spaces" was actually a tab character. See [comment to question](https://stackoverflow.com/questions/47860805/remove-all-characters-after-the-first-instance-of-a-character-that-follows-more#comment82686635_47860805). – Andreas Dec 18 '17 at 01:12
  • Question says *"first instance of ..."*, so shouldn't the regex begin with `.+?`? With a [greedy quantifier](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#greedy) like `.+`, it will replace all up to *last* instance, not first. A [reluctant quantifier](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#reluc) like `.+?` will stop at first instance. – Andreas Dec 18 '17 at 01:21
  • @Andreas I see that the original/translation pairs are all in separate lines, so that wouldn't make a difference would it? Edited anyway. – Sweeper Dec 18 '17 at 01:23
0

So you could use regex to do the string manipulated pretty efficently.

import java.util.regex.Matcher; import java.util.regex.Pattern;

public class Main {

/**
 * Splits the line related to translation into 2 groups by splitting it on
 * two spaces " " and storing the splits into two named groups (key,
 * value)</br>
 * Group1 (key) is the text before the two spaces.</br>
 * Group2 (value) is the text after the two spaces.</br>
 */
private static final Pattern TRANSLATION_PATTERN = Pattern.compile("<key>.*)\\s\\s+(<value>.*)");

public static String grabTextAfterTwoSpaces(String input) {
    Matcher matcher = TRANSLATION_PATTERN.matcher(input);

    /*
     * You have to call .matches() for the regex to actually be applied.
     */
    if (!matcher.matches()) {
        throw new IllegalArgumentException(String.format("Provided input:[%s] did not contain two spaces", input));
    }

    return matcher.group("value");
}

public static void main(String[] args) {
    System.out.println(grabTextAfterTwoSpaces("abaxial van  osovine"));
    System.out.println(grabTextAfterTwoSpaces("abbacy  opatstvo"));
    System.out.println(grabTextAfterTwoSpaces("abbaino     kora"));
    System.out.println(grabTextAfterTwoSpaces("abbatial    opatski"));
    System.out.println(grabTextAfterTwoSpaces("abbe    opat"));
    System.out.println(grabTextAfterTwoSpaces("abbé    opat"));
    System.out.println(grabTextAfterTwoSpaces("abbé    sveæenik"));
    System.out.println(grabTextAfterTwoSpaces("abbacy  opatstvo"));

    System.out.println(grabTextAfterTwoSpaces("hematological parameters    hematološki pokazatelji"));
}

}

Try it online!

So if you use "value" for the group you'll get everything after the 2+ spaces.

osovine

opatstvo

kora

opatski

opat

opat

sveæenik

opatstvo

hematološki pokazatelji

  • Sorry, good answer, but question was wrong, and the "2 spaces" was actually a tab character. See [comment to question](https://stackoverflow.com/questions/47860805/remove-all-characters-after-the-first-instance-of-a-character-that-follows-more#comment82686635_47860805). – Andreas Dec 18 '17 at 01:12